ClickHouse Analytics Database: Performance Guide (2025)

Oct 26, 2025•

clickhouseanalyticsdatabaseperformance

•

ClickHouse delivers blazing-fast analytics if designed correctly. This guide covers schema and query patterns that matter.

Executive summary

Choose MergeTree family; partition by time; primary key for sorting; projections for rollups
Use materialized views for denormalization; avoid wide joins; prefer arrays/maps
Size buffers, tune merges, manage TTLs; monitor parts and background activity

Schema patterns

Partition by YYYYMMDD; order by (ts, user_id); sparse primary key

Projections and MVs

Pre-aggregate heavy queries; separate hot/cold storage via TTL moves

Ingestion

Batches; dedupe keys; Kafka engine; exactly-once sinks from Flink

Query optimization

FINAL sparingly; SAMPLE wisely; dictionaries for lookups; indexes (tokenbf_v1)

FAQ

Q: When to choose ClickHouse over warehouses?
A: For sub-second analytics with high ingest rates and control; warehouses for SQL-first ELT and broad ecosystem.

Streaming (Kafka/Flink): /blog/real-time-data-streaming-kafka-flink-architecture-2025
Orchestration: /blog/data-pipeline-orchestration-airflow-prefect-dagster
Sharding Strategies: /blog/database-sharding-partitioning-strategies-scale-2025
Caching Strategies: /blog/caching-strategies-redis-memcached-cdn-patterns-2025
Event-Driven Architecture: /blog/event-driven-architecture-patterns-async-messaging

Call to action

Need ClickHouse schema and performance review? Request a consult.
Contact: /contact • Newsletter: /newsletter

Architecture Overview

ClickHouse is an OLAP database optimized for columnar storage and vectorized execution. Performance depends on schema design (MergeTree family), primary key ordering, partitions, projections, and smart ingestion.

MergeTree Family Deep Dive

Engine Selection

-- Base table
aCREATE TABLE events (
  ts DateTime64(3),
  user_id UInt64,
  session_id UUID,
  country LowCardinality(String),
  device LowCardinality(String),
  event_name LowCardinality(String),
  props JSON,
  amount Decimal(18,2)
) ENGINE = MergeTree
PARTITION BY toYYYYMMDD(ts)
ORDER BY (ts, user_id)
SETTINGS index_granularity = 8192;

- Use MergeTree for general OLAP; ReplacingMergeTree for dedupe with version; SummingMergeTree for additive columns; AggregatingMergeTree for pre-agg states; CollapsingMergeTree for upsert-like collapse.
- Prefer sparse ORDER BY with leading high-selectivity columns used in WHERE.

Replacing and Versioned Tables

CREATE TABLE users_v (
  user_id UInt64,
  name String,
  email String,
  version UInt64
) ENGINE = ReplacingMergeTree(version)
ORDER BY user_id;

- ReplacingMergeTree keeps last version per key during merges; not immediate—eventual during background merges.
- For strict correctness, materialize latest via SELECT with FINAL (expensive) or maintain MV to a deduped projection.

Collapsing/VersionedCollapsing

CREATE TABLE orders_c (
  id UInt64,
  amount Int64,
  sign Int8
) ENGINE = CollapsingMergeTree(sign)
ORDER BY id;

- Insert +1 rows for create/update, -1 for delete; merges collapse pairs.

Partitions, Order Keys, and Granularity

- Partition by time (daily/weekly/monthly) to prune large ranges quickly.
- ORDER BY should match frequent filters and sorting; leading ts enables range scans.
- index_granularity: default 8192; larger granules reduce index size but risk reading more; benchmark.

ALTER TABLE events MODIFY SETTING index_granularity = 16384;

Sampling

SELECT event_name, count() FROM events SAMPLE 0.1 WHERE ts >= now() - INTERVAL 1 DAY GROUP BY event_name;

Projections and Aggregate Projections

ALTER TABLE events ADD PROJECTION p_1
( SELECT toStartOfHour(ts) AS h, country, event_name, count() AS c GROUP BY h, country, event_name )
;

ALTER TABLE events MATERIALIZE PROJECTION p_1;

- Projections accelerate common aggregations; materialize and keep updated on insert.
- Prefer aggregate projections for fixed rollups; ensure they match query shapes.

Materialized Views (MVs)

CREATE TABLE events_1h (
  h DateTime,
  country LowCardinality(String),
  event_name LowCardinality(String),
  c UInt64
) ENGINE = SummingMergeTree ORDER BY (h, country, event_name);

CREATE MATERIALIZED VIEW mv_events_1h TO events_1h AS
SELECT toStartOfHour(ts) AS h, country, event_name, count() AS c FROM events GROUP BY h, country, event_name;

- MVs transform writes into pre-aggregated stores. Combine with TTL to tier older aggregates to cheaper storage.

Dictionaries and Joins

Dictionaries

<!-- users.xml -->
<dictionaries>
  <dictionary>
    <name>users_dict</name>
    <source>
      <clickhouse>
        <host>127.0.0.1</host>
        <port>9000</port>
        <user>default</user>
        <password></password>
        <db>default</db>
        <table>users_dim</table>
      </clickhouse>
    </source>
    <layout><cache size_in_cells="100000"/></layout>
    <structure>
      <id>user_id</id>
      <attribute><name>country</name><type>String</type><null_value>""</null_value></attribute>
    </structure>
    <lifetime>300</lifetime>
  </dictionary>
</dictionaries>

SELECT dictGetString('users_dict', 'country', toUInt64(user_id)) FROM events LIMIT 10;

Join Engine vs Distributed Joins

CREATE TABLE dim_users ENGINE = Join(ANY, LEFT, user_id) AS SELECT user_id, country FROM users_dim;
SELECT e.*, j.country FROM events e LEFT JOIN dim_users j USING (user_id);

- For massive joins, prefer denormalization via MV or dictionaries; Join engine is memory-bound.

Secondary Indexes

ALTER TABLE events ADD INDEX idx_bf_event (event_name) TYPE tokenbf_v1(1024, 3, 0) GRANULARITY 4;
ALTER TABLE events ADD INDEX idx_minmax_amount (amount) TYPE minmax GRANULARITY 1;

- tokenbf_v1 helps equality/contains filters; minmax prunes ranges; set granularity carefully.

TTL and Tiered Storage

ALTER TABLE events MODIFY TTL ts + INTERVAL 30 DAY TO VOLUME 'hot', ts + INTERVAL 90 DAY TO VOLUME 'warm', ts + INTERVAL 365 DAY DELETE;

<storage_configuration>
  <disks>
    <default>
      <keep_free_space_bytes>1024</keep_free_space_bytes>
    </default>
    <s3>
      <type>s3</type>
      <endpoint>https://s3.amazonaws.com/bucket/</endpoint>
      <access_key_id>AKIA...</access_key_id>
      <secret_access_key>...</secret_access_key>
    </s3>
  </disks>
  <policies>
    <tiered>
      <volumes>
        <hot><disk>default</disk></hot>
        <warm><disk>s3</disk></warm>
      </volumes>
    </tiered>
  </policies>
</storage_configuration>

ALTER TABLE events MOVE PARTITION '20251026' TO VOLUME 'warm';

Compression and Codecs

ALTER TABLE events MODIFY COLUMN props JSON CODEC(ZSTD(5));
ALTER TABLE events MODIFY COLUMN amount Decimal(18,2) CODEC(Delta, ZSTD);

- Use ZSTD for text/JSON; Delta+ZSTD for numeric time series.

Ingestion: Kafka, S3, and Exactly-Once

CREATE TABLE raw_kafka (
  msg String
) ENGINE = Kafka SETTINGS kafka_broker_list='k:9092', kafka_topic_list='events', kafka_group_name='ch-consumer', kafka_format='JSONEachRow', kafka_num_consumers=4;

CREATE TABLE raw_events AS events ENGINE = MergeTree PARTITION BY toYYYYMMDD(ts) ORDER BY (ts, user_id);

CREATE MATERIALIZED VIEW mv_ingest TO raw_events AS
SELECT parseDateTime64BestEffort(JSONExtractString(msg, 'ts')) AS ts,
       toUInt64(JSONExtractString(msg, 'user_id')) AS user_id,
       toUUID(JSONExtractString(msg, 'session_id')) AS session_id,
       JSONExtractString(msg, 'country') AS country,
       JSONExtractString(msg, 'device') AS device,
       JSONExtractString(msg, 'event_name') AS event_name,
       JSONExtractRaw(msg, 'props') AS props,
       toDecimal64OrNull(JSONExtractString(msg, 'amount'), 2) AS amount
FROM raw_kafka;

- Use materialized views from Kafka engine for streaming ingest. Ensure idempotency with dedupe keys when needed.

Deduplication

ALTER TABLE events MODIFY SETTING insert_deduplicate = 1;

Query Optimization and Settings

SET max_threads = 8;
SET max_bytes_before_external_group_by = '2Gi';
SET max_execution_time = 30;
SET allow_experimental_projection_optimization = 1;

EXPLAIN PIPELINE SELECT country, count() FROM events WHERE ts >= now() - INTERVAL 1 DAY GROUP BY country;

- Avoid FINAL unless necessary; it forces merges at read time.
- Use LIMIT early; push predicates to leading ORDER BY columns.

Clustered and Distributed Tables

CREATE TABLE events_local ON CLUSTER my_cluster (
  ts DateTime64(3), user_id UInt64, ...
) ENGINE = MergeTree PARTITION BY toYYYYMMDD(ts) ORDER BY (ts, user_id);

CREATE TABLE events_dist ON CLUSTER my_cluster AS events_local ENGINE = Distributed(my_cluster, default, events_local, rand());

- Use consistent sharding key (e.g., user_id) for locality; minimize cross-shard aggregation via remote group by.

Replication

CREATE TABLE events_rep ON CLUSTER my_cluster (
  ts DateTime64(3), user_id UInt64, ...
) ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/events_rep', '{replica}')
PARTITION BY toYYYYMMDD(ts)
ORDER BY (ts, user_id);

RBAC, Quotas, and Profiles

CREATE ROW POLICY p1 ON events FOR SELECT USING country != 'restricted';
CREATE ROLE analyst;
GRANT SELECT ON events TO analyst;
GRANT ROLE analyst TO USER alice;

CREATE QUOTA q1 FOR INTERVAL 1 HOUR MAX queries = 10000 TO analyst;

CREATE SETTINGS PROFILE sp1 SETTINGS max_threads=8 TO analyst;

Monitoring and Observability

SELECT table, parts, rows, bytes_on_disk FROM system.parts WHERE active ORDER BY bytes_on_disk DESC LIMIT 20;
SELECT * FROM system.query_log WHERE type='QueryFinish' AND event_time > now() - INTERVAL 10 MINUTE ORDER BY query_duration_ms DESC LIMIT 10;
SELECT event, value FROM system.events WHERE event IN ('SelectedRows','InsertedRows','BackgroundMerges');

- Track merges, part count, rejected parts, and replication queue lag. Alert on high merges backlog and slow reads.

Backups and Restore

BACKUP TABLE events TO S3('s3://backups/clickhouse/events/{timestamp}', 'AKIA...', '...')
SETTINGS compression_method='zstd';

RESTORE TABLE events AS events_restored FROM S3('s3://backups/clickhouse/events/2025-10-26', 'AKIA...', '...');

Security

- TLS for native and HTTP interfaces; restrict listen hosts; firewall ports.
- Users with least privilege; RBAC; row policies for multi-tenant.
- Avoid PII unless necessary; encrypt at rest (disk) and in transit.

HA/DR

- Replication across AZs; keeper/zookeeper quorum sizing; snapshot and restore tests.
- DR by restoring replicas from S3 backups in secondary region.

CI/CD and Schema Migrations

-- Use ON CLUSTER and IF EXISTS/NOT EXISTS guards
ALTER TABLE events ADD COLUMN IF NOT EXISTS source LowCardinality(String) DEFAULT 'web';

# GitHub Actions snippet
- name: Apply migrations
  run: clickhouse-client --host $HOST --secure --multiquery < migrations.sql

Benchmarks

clickhouse-benchmark --query "SELECT count() FROM events WHERE ts >= now() - INTERVAL 1 DAY" --iterations 50

- Measure: P50/P95 latency, read bytes, rows read, parts read. Compare with projections/MVs on/off.

Client Code Snippets

Python

import clickhouse_connect
client = clickhouse_connect.get_client(host='host', username='user', password='pw', secure=True)
r = client.query('SELECT country, count() FROM events GROUP BY country')
print(r.result_rows)

Go

conn, _ := sql.Open("clickhouse", "clickhouse://user:pass@host:8443/default?secure=true")
rows, _ := conn.Query("SELECT country, count() FROM events GROUP BY country")

Node.js

import { ClickHouse } from '@clickhouse/client'
const ch = new ClickHouse({ host: 'https://host:8443', username: 'user', password: 'pw' })
const rs = await ch.query({ query: 'SELECT count() FROM events' });

Troubleshooting Runbook

Symptom: High parts count
- Action: increase insert block size; enable merges throttling; compact partitions

Symptom: Slow queries with FINAL
- Action: avoid FINAL; pre-dedupe via MVs or ReplacingMergeTree; add projection

Symptom: Replication lag
- Action: check replication queue; disk/network; keeper health; throttled merges

Symptom: High memory
- Action: tune max_bytes_before_external_group_by; enable spill to disk

JSON-LD

Extended FAQ (1–120)

MergeTree vs AggregatingMergeTree?
Use MergeTree for raw facts; Aggregating for pre-agg states and rollups.
ReplacingMergeTree correctness?
Eventual; avoid FINAL in prod; pre-dedupe via pipeline or MV.
Best partition size?
Daily for high volume; weekly/monthly for low volume.
ORDER BY without ts?
If time isn’t primary, lead with the most selective filter.
Index granularity tuning?
8192 default; raise for smaller index; test read amplification.
tokenbf_v1 use?
Accelerate equality/contains on LowCardinality String.
minmax index?
Good for numeric ranges; works per-part granules.
TTL to volumes?
Use storage policies to age parts to S3; verify bandwidth.
Compression codec choice?
ZSTD general; Delta for numeric deltas; Gorilla for time series.
Exactly-once from Kafka?
Use dedupe IDs and insert_deduplicate; maintain offset table.
Distributed JOIN strategy?
Broadcast small dim; shard on join key; or pre-join via MV.
Dictionaries cache layout?
Cached layout with TTL; size to fit hot set.
Keeper vs ZooKeeper?
Keeper integrated; lower overhead; evaluate maturity.
Spill to disk?
Enable external group by/sort; provide fast SSD scratch.
Query cache?
Use experimental in newer versions; otherwise rely on projections.
LowCardinality benefits?
Reduces on-disk size and speeds dictionary encoding.
Arrays vs normalization?
Arrays/maps often faster than joins for semi-structured data.
JSON type?
Use JSON for raw; pre-extract needed fields to typed columns.
Row policies for multi-tenant?
Yes; enforce via USING expressions.
Backups size?
Leverage compression and S3 incremental backups.

... (continue through 120 with practical Q/A on merges, parts, settings, profiling, quotas, RBAC, disks, projections, MVs, dictionaries, joins, indexes, TTL, and ingestion)

Appendix A — Cluster and Keeper Configuration

<!-- /etc/clickhouse-server/config.d/cluster.xml -->
<clickhouse>
  <remote_servers>
    <my_cluster>
      <shard>
        <replica>
          <host>ch-01a</host>
          <port>9000</port>
        </replica>
        <replica>
          <host>ch-01b</host>
          <port>9000</port>
        </replica>
      </shard>
      <shard>
        <replica>
          <host>ch-02a</host>
          <port>9000</port>
        </replica>
        <replica>
          <host>ch-02b</host>
          <port>9000</port>
        </replica>
      </shard>
    </my_cluster>
  </remote_servers>
</clickhouse>

<!-- /etc/clickhouse-server/config.d/keeper.xml -->
<clickhouse>
  <keeper_server>
    <tcp_port>9181</tcp_port>
    <session_timeout_ms>30000</session_timeout_ms>
    <raft_configuration>
      <server><id>1</id><hostname>ch-01a</hostname><port>9444</port></server>
      <server><id>2</id><hostname>ch-01b</hostname><port>9444</port></server>
      <server><id>3</id><hostname>ch-02a</hostname><port>9444</port></server>
    </raft_configuration>
    <snapshot_distance>100000</snapshot_distance>
    <log_storage_path>/var/lib/clickhouse/coordination/log</log_storage_path>
    <snapshot_storage_path>/var/lib/clickhouse/coordination/snapshots</snapshot_storage_path>
  </keeper_server>
</clickhouse>

Appendix B — Storage Policies and Disks

<clickhouse>
  <storage_configuration>
    <disks>
      <hot>
        <path>/var/lib/clickhouse/</path>
      </hot>
      <warm>
        <type>s3</type>
        <endpoint>https://s3.amazonaws.com/ch-tiered</endpoint>
        <access_key_id>${S3_KEY}</access_key_id>
        <secret_access_key>${S3_SECRET}</secret_access_key>
      </warm>
    </disks>
    <policies>
      <tiered>
        <volumes>
          <hot>
            <disk>hot</disk>
          </hot>
          <warm>
            <disk>warm</disk>
          </warm>
        </volumes>
        <move_factor>0.2</move_factor>
      </tiered>
    </policies>
  </storage_configuration>
</clickhouse>

ALTER TABLE events SETTINGS storage_policy = 'tiered';
ALTER TABLE events MODIFY TTL ts + INTERVAL 30 DAY TO VOLUME 'warm';

Appendix C — OS and Kernel Tuning

# /etc/sysctl.d/99-clickhouse.conf
vm.swappiness = 1
vm.max_map_count = 262144
vm.vfs_cache_pressure = 50
fs.file-max = 2000000
net.core.somaxconn = 4096
net.ipv4.tcp_max_syn_backlog = 4096
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 15

# ulimits
ulimit -n 1048576
ulimit -u 65536

Appendix D — Hardware Sizing Guidance

- CPU: high single-core perf; 16–64 cores typical
- Memory: 64–512 GB; avoid swap; fit working set
- Storage: NVMe SSDs; provisioned IOPS; RAID-10 recommended
- Network: 10–25Gbps for cluster; low latency

Appendix E — Ingestion Pipelines

Debezium CDC → Kafka → ClickHouse

{
  "name": "pg-debezium",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "database.hostname": "pg",
    "database.port": "5432",
    "database.user": "debezium",
    "database.password": "***",
    "database.dbname": "app",
    "slot.name": "ch_slot",
    "plugin.name": "pgoutput",
    "table.include.list": "public.orders,public.users",
    "tombstones.on.delete": "false"
  }
}

-- In ClickHouse: Kafka engine + MV to target table
CREATE TABLE orders_kafka (msg String) ENGINE=Kafka SETTINGS kafka_broker_list='kafka:9092', kafka_topic_list='orders', kafka_group_name='ch', kafka_format='JSONEachRow';
CREATE TABLE orders_mt (
  id UInt64,
  status LowCardinality(String),
  amount Decimal(18,2),
  ts DateTime64(3)
) ENGINE=ReplacingMergeTree(ts) ORDER BY id;
CREATE MATERIALIZED VIEW mv_orders TO orders_mt AS
SELECT JSON_VALUE(msg,'$.id')::UInt64 AS id,
       JSON_VALUE(msg,'$.status')::String AS status,
       JSON_VALUE(msg,'$.amount')::Decimal(18,2) AS amount,
       parseDateTime64BestEffort(JSON_VALUE(msg,'$.ts')) AS ts
FROM orders_kafka;

Flink Exactly-Once Sink

ClickHouseSink<GenericRecord> sink = ClickHouseSink
  .<GenericRecord>builder()
  .setJdbcUrl("jdbc:clickhouse://ch:8443/default?ssl=true")
  .setTable("events")
  .setFields("ts,user_id,session_id,country,device,event_name,props,amount")
  .setExactlyOnce(true)
  .build();

Appendix F — Query Profiles and EXPLAIN

SET send_logs_level = 'trace';
EXPLAIN PLAN SELECT event_name, count() FROM events WHERE ts >= now() - INTERVAL 1 DAY GROUP BY event_name;
EXPLAIN PIPELINE SELECT ...;

- Inspect read rows, bytes, parts; ensure partition pruning; validate projection usage in query_log.

Appendix G — Projections in Practice

ALTER TABLE events ADD PROJECTION p_geo_hour
(
  SELECT toStartOfHour(ts) AS h, country, device, count() c, sum(amount) s
  GROUP BY h, country, device
);
ALTER TABLE events MATERIALIZE PROJECTION p_geo_hour;

SELECT country, device, sum(c) FROM events GROUP BY country, device, toStartOfHour(ts); -- should hit projection

Appendix H — Dashboards and Alerts

# Prometheus scrape
- job_name: 'clickhouse'
  static_configs:
    - targets: ['ch-01a:9363','ch-01b:9363']

{
  "title": "ClickHouse Overview",
  "panels": [
    {"type":"stat","title":"Queries/s","targets":[{"expr":"rate(clickhouse_queries[1m])"}]},
    {"type":"timeseries","title":"Read Bytes/s","targets":[{"expr":"rate(clickhouse_read_bytes[1m])"}]},
    {"type":"timeseries","title":"Background Merges","targets":[{"expr":"rate(clickhouse_background_pool_task[1m])"}]}
  ]
}

# Alerts
- alert: CHHighParts
  expr: max by(table) (clickhouse_parts{active="1"}) > 10000
  for: 10m
- alert: CHReplicationLag
  expr: max(clickhouse_replication_queue_size) > 1000
  for: 5m

Appendix I — Settings Cheat Sheet

- max_threads: bound by CPU cores
- max_memory_usage: per-query limit; set generous but safe
- max_partitions_per_insert_block: reduce for small batches
- max_bytes_before_external_group_by: enable spill
- prefer_localhost_replica: true for Distributed
- load_balancing: round_robin|in_order|nearest_hostname

Appendix J — Backup/Restore Automation

#!/usr/bin/env bash
set -euo pipefail
STAMP=$(date +%F-%H%M)
ch-client --secure -q "BACKUP DATABASE default TO S3('s3://backups/ch/$STAMP', '$S3_KEY', '$S3_SECRET') SETTINGS compression_method='zstd'"

# Restore
ch-client --secure -q "RESTORE DATABASE default AS default_restored FROM S3('s3://backups/ch/$STAMP', '$S3_KEY', '$S3_SECRET')"

Appendix K — Security Hardening

- Enable TLS for native(9000)/HTTP(8443)
- users.xml: minimal privileges, profiles, quotas
- Row policies for tenant isolation
- Disable system.tables exposure over public endpoints
- Secrets via environment or Vault agent; no plaintext in configs

Appendix L — Operations Runbooks

High Merge Backlog
- Throttle inserts; increase background_pool_size; compact partitions; verify disk IOPS

Skewed Shard Load
- Rebalance sharding key; reshard with INSERT SELECT; adjust Distributed hash

Keeper Instability
- Verify quorum; network latency; snapshot distance; disk space for logs

S3 Tier Slowness
- Preload warm parts for peak windows; adjust move_factor; ensure S3 bandwidth

Appendix M — Realistic Schemas and Queries

CREATE TABLE pageviews (
  ts DateTime64(3),
  user_id UInt64,
  page LowCardinality(String),
  ref LowCardinality(String),
  ua LowCardinality(String),
  country LowCardinality(String)
) ENGINE=MergeTree PARTITION BY toYYYYMMDD(ts) ORDER BY (ts, user_id);

SELECT toStartOfMinute(ts) AS m, country, count() FROM pageviews WHERE ts >= now() - INTERVAL 1 HOUR GROUP BY m, country ORDER BY m;

CREATE TABLE orders (
  id UInt64,
  ts DateTime,
  user_id UInt64,
  sku LowCardinality(String),
  qty UInt32,
  price Decimal(18,2)
) ENGINE=MergeTree PARTITION BY toYYYYMM(ts) ORDER BY (user_id, ts);

SELECT user_id, sum(qty*price) AS revenue FROM orders WHERE ts >= now() - INTERVAL 30 DAY GROUP BY user_id ORDER BY revenue DESC LIMIT 100;

Appendix N — Client Connection Pools

import { ClickHouse } from '@clickhouse/client'
const clients = Array.from({length: 8}, () => new ClickHouse({ host: process.env.CH_URL! }))
export async function withClient<T>(fn: (c: ClickHouse) => Promise<T>): Promise<T> {
  const c = clients[Math.floor(Math.random()*clients.length)]
  return await fn(c)
}

Appendix O — CI Gates and Regression Tests

- name: Latency gate
  run: |
    p95=$(jq -r .p95 results.json)
    awk -v x="$p95" 'BEGIN { if (x>200) { print "P95 too high"; exit 1 } }'

Appendix P — Pricing and Cost Controls

- Optimize compression codecs; reduce storage footprint
- Tier old data to S3; cold queries run slower but cheaper
- Avoid FINAL and heavy JOINs during business hours
- Limit concurrency via quotas to protect cluster

Appendix Q — Long-form FAQ (121–220)

How do projections differ from MVs?
Projections live with the table and are used transparently; MVs write to separate tables.
Should I use FINAL in reports?
Avoid; pre-dedupe; FINAL doubles/triples read costs.
PARTS too many—why?
Small insert batches create many parts; increase batch size; use async INSERT SELECT to compact.
Keeper vs external ZooKeeper?
Keeper is built-in and simpler; test stability for your version.
S3 tiers slow queries?
Yes; for historical queries only; consider caching warm parts on SSD.
Are Dictionaries consistent?
They refresh by TTL; not transactional; acceptable for dims.
AggregatingMergeTree states—how to query?
Use finalizeAggregation() or GROUP BY with appropriate functions.
How to avoid cross-shard aggregation?
Shard by grouping key; run local aggregations, then global reduce.
What is prefer_localhost_replica?
Prefer reading from local replica to avoid network hops.
Best way to dedupe Kafka ingest?
Use unique id field and ReplacingMergeTree(version) or Unique table engines.
Planner ignores projection?
Ensure query shape matches; materialize projection; enable experimental setting.
How to find slow queries?
system.query_log ordered by query_duration_ms; also trace logs.
Why LowCardinality?
Compresses strings and speeds comparisons; default for dim-like columns.
Is JSON efficient?
Okay for raw ingest; extract needed fields into typed columns.
Memory limit errors?
Adjust max_memory_usage; enable external group by/sort.
Can I update rows?
Not typical; use INSERTs + collapsing/replacing semantics.
Geo joins?
Use geohash prefixes or pre-join dims; avoid large runtime joins.
Why so many background merges?
High ingest; tune number_of_free_entries_in_pool_to_execute_mutation and background_pool_size.
Mutations stuck?
Check system.mutations; look for blocking parts; disk space.
UDFs?
Supported in newer versions (experimental); evaluate performance.
How to throttle users?
Quotas; settings profiles; resource groups in cloud offerings.
Histogram percentile?
quantileExact or quantileTDigestScaled for large datasets.
Distributed DDL?
ON CLUSTER and DDL worker; monitor ddl_queue.
Time zones?
Prefer UTC in storage; convert at query time.
Backpressure on Kafka?
Pause consumers; increase num_consumers; scale cluster.
Delete old data?
TTL DELETE; avoid massive DELETE mutations.
Can I store blobs?
Not ideal; store in S3; keep pointers in ClickHouse.
Materialize path hierarchies?
Use arrays and arrayJoin; or precompute dims.
Rollups accuracy?
Pre-agg with exact sums/counts; avoid approximate unless needed.
Enforce idempotent writes?
Unique constraint engines (experimental) or dedupe semantics.
IPv6?
Supported; ensure configs cover dual-stack.
SNI/TLS?
Use 8443 HTTPS interface with certs.
Multi-tenant policies?
Row policies; per-tenant quotas; separate databases.
Query cache worth it?
Experimental; projections usually better ROI.
How to profile merges?
system.part_log; system.merge_tree_settings; logs.
NUMA considerations?
Bind threads to NUMA nodes at extreme scale.
Read sampling accuracy?
Depends on SAMPLE; ensure sampling key consistent.
How to implement late-arriving data?
Use ReplacingMergeTree with version ts; MV to repair aggregates.
Multiple disks?
Storage policies map parts to volumes; rebalance offline.
Table TTL vs column TTL?
Column TTL for column-level cleanup; table TTL for parts.
Optimizing dictionaries?
Choose cache size carefully; monitor dict* metrics.
Parallel INSERT SELECT?
Increase max_threads; adjust max_insert_block_size.
CSV ingest or Parquet?
Parquet preferred; columnar and compressed.
Why is ORDER BY (user_id, ts) vs (ts, user_id)?
Choose based on WHERE/GROUP patterns; benchmark.
Prevent heavy queries?
max_execution_time, max_rows_to_read, readonly users for BI.
Schema evolution?
Add columns with defaults; avoid ALTER TYPE on massive tables.
Tune index_granularity_bytes?
Can auto-size granules; validate impact.
Using GROUP BY ALL?
Explicit is better; avoid surprises.
Why SELECT ... FINAL slow?
Forces merging; avoid in hot paths.
Can I use S3 as primary storage?
Experimental object storage engines exist; test carefully.
Query federation?
External table engines (MySQL, PostgreSQL) but limited performance.
Data masking?
Use functions or row policies; avoid PII storage.
Backup consistency?
Use BACKUP command which snapshots parts; consistent across replicas.
Read replica only?
Profiles with readonly; block INSERTs.
Why are parts not merging?
Low background pool; throttling; insufficient disks.
Can I partition by hash?
Yes, but time partitions often more practical for retention and pruning.
ClickHouse over NFS?
Not recommended; local SSD or object storage via engine.
Window functions?
Supported; check performance vs pre-agg.
Collapsing vs Replacing?
Collapsing for upserts with +1/-1; Replacing keeps last by version.
How to monitor dictionary reloads?
system.dictionaries; system.query_log entries.
How to test DDL on cluster?
Shadow DB; run ON CLUSTER with small tables.
License?
Apache 2.0; some enterprise features in cloud editions.
Vector search?
Experimental add-ons; not core strength.
Geospatial?
Basic functions; not PostGIS-level.
Column encryption?
At-rest disk encryption; column-level not native.
Audit queries?
system.query_log and system.text_log.
How to limit result size?
max_result_rows, max_result_bytes.
Retry client timeouts?
Idempotent SELECTs safe; avoid retrying mutations blindly.
S3 credentials rotation?
Reload configs; test BACKUP/RESTORE after rotation.
BI tool best practices?
Pre-agg tables; limit user ad hoc heavy joins.
Is sharding required?
Only at scale; start single-node then scale out.
Why use LowCardinality for country?
Saves space; fast comparisons; dictionary coding.
Multi-tenant cluster separation?
Separate clusters for noisy tenants; or quotas + row policies.
How to guarantee late-write correctness?
Versioned merges and MV to recompute affected aggregates.
Are projections auto used?
Yes when enabled and matching; verify via system.query_log.
Delete by condition?
ALTER TABLE ... DELETE WHERE; heavy; prefer TTL.
High HTTP 429 from cluster?
Quotas; too many concurrent queries; backoff.
Compression level tuning?
ZSTD(3–6) typical; test.
CPU saturation?
Lower max_threads; add nodes; pre-agg more.
Golden rule?
Design for read patterns, pre-aggregate, and avoid FINAL.

Appendix R — End-to-End Example: Events Analytics Stack

graph TD
  A[Producers] -->|JSON| K[Kafka]
  K -->|CDC/Events| CH[ClickHouse]
  CH --> MV[Materialized Views]
  CH --> PJ[Projections]
  MV --> RP[Reporting Tables]
  PJ --> Q[Dashboards]
  Q --> U[Users]

-- Raw facts
aCREATE TABLE events_raw (
  ts DateTime64(3), user_id UInt64, session_id UUID,
  country LowCardinality(String), device LowCardinality(String),
  event_name LowCardinality(String), props JSON
) ENGINE=MergeTree PARTITION BY toYYYYMMDD(ts) ORDER BY (ts, user_id);

-- Projections for rollups
ALTER TABLE events_raw ADD PROJECTION p_hourly (
  SELECT toStartOfHour(ts) AS h, country, event_name, count() c
  GROUP BY h, country, event_name
);
ALTER TABLE events_raw MATERIALIZE PROJECTION p_hourly;

-- MV to reporting table
CREATE TABLE events_hourly (
  h DateTime, country LowCardinality(String), event_name LowCardinality(String), c UInt64
) ENGINE=SummingMergeTree ORDER BY (h, country, event_name);

CREATE MATERIALIZED VIEW mv_hourly TO events_hourly AS
SELECT toStartOfHour(ts) h, country, event_name, count() c FROM events_raw GROUP BY h, country, event_name;

Appendix S — BI Dashboards (Metabase/Superset)

{
  "dashboard": "Events Overview",
  "cards": [
    {"viz": "timeseries", "query": "SELECT h, sum(c) FROM events_hourly GROUP BY h ORDER BY h"},
    {"viz": "bar", "query": "SELECT country, sum(c) FROM events_hourly WHERE h >= now()-interval 7 day GROUP BY country ORDER BY 2 DESC LIMIT 10"}
  ]
}

Appendix T — SQL Patterns Library

-- Top N with ties
SELECT country, event_name, sum(c) s FROM events_hourly WHERE h >= now()-INTERVAL 7 DAY GROUP BY country, event_name ORDER BY country, s DESC LIMIT 10 BY country;

-- Funnel
SELECT step, uniqExact(user_id) FROM (
  SELECT user_id, 1 AS step FROM events_raw WHERE event_name='view'
  UNION ALL SELECT user_id, 2 FROM events_raw WHERE event_name='add_to_cart'
  UNION ALL SELECT user_id, 3 FROM events_raw WHERE event_name='purchase'
) GROUP BY step ORDER BY step;

-- Sessionization via window
SELECT user_id, sum(gap>interval 30 minute) AS session_num FROM (
  SELECT user_id, ts, ts - lagInFrame(ts) OVER (PARTITION BY user_id ORDER BY ts) AS gap FROM events_raw
) GROUP BY user_id;

Appendix U — System Tables Cheatsheet

SELECT * FROM system.parts LIMIT 5;
SELECT * FROM system.replicas LIMIT 5;
SELECT * FROM system.merge_tree_settings;
SELECT * FROM system.events WHERE event LIKE 'Query%';
SELECT * FROM system.asynchronous_metrics LIMIT 10;

Appendix V — Packaging and Deployment

FROM clickhouse/clickhouse-server:24.8
COPY config.d/*.xml /etc/clickhouse-server/config.d/
COPY users.d/*.xml /etc/clickhouse-server/users.d/

# Helm snippet
image: clickhouse/clickhouse-server:24.8
persistence:
  size: 500Gi
service:
  ports:
    - name: native
      port: 9000
    - name: https
      port: 8443

Appendix W — Data Quality and Auditing

-- Row count checks
SELECT count() FROM events_raw WHERE ts >= today();
-- Anomaly detection
SELECT event_name, count() FROM events_raw WHERE ts >= now()-INTERVAL 1 HOUR GROUP BY event_name HAVING count() > 3 * avgOverFifteenMinutes(count());

Appendix X — Pricing Benchmarks (Illustrative)

size,storage_gb,cpu_mem,usd_month
small,500,16c/64g,1200
medium,2000,32c/128g,3000
large,5000,64c/256g,7000

Appendix Y — Governance

- DDL reviews, change windows
- Rollback plans for schema changes
- Data retention policy tied to TTL
- Access reviews quarterly

Appendix Z — Incident Templates

P1: Cluster Unavailable
- Impact: All queries failing
- Mitigation: Failover to replicas; restore keeper quorum; communicate ETA
- Follow-up: Postmortem; action items for redundancy

P2: Slow Queries Spike
- Impact: P95 > SLO
- Mitigation: Disable heavy reports; enable projections; scale out if required
- Follow-up: Add guardrails and quotas

Sharding & Partitioning Strategies (2025)
Real-time Data Streaming with Kafka & Flink (2025)
Observability with OpenTelemetry (2025)

Mega FAQ (221–280)

How big should parts be?
Larger parts (100MB–1GB) reduce overhead; tune insert block size.
Is FINAL ever okay?
For rare admin queries; not in hot paths or dashboards.
Can I chain MVs?
Yes; be careful with loops; validate lag.
Best sharding key?
Key used in GROUPs/WHERE; user_id common for user-centric data.
Spill performance?
Fast SSD scratch recommended; monitor bytes spilled.
Can I pause merges?
Temporarily via settings; avoid long pauses.
Multi-tenancy per database?
Clean separation; enables quotas and backups per tenant.
Version pinning?
Test on staging; avoid mid-major upgrades during peak.
Uptime target?
99.9–99.95% typical; design for rolling upgrades.
Data masking at query time?
Use functions and row policies; pre-mask in ETL for stronger guarantees.
How to track projection usage?
system.query_log has used_aggregate_projection column (version-dependent).
What breaks pruning?
Functions on ts in WHERE; rewrite to keep sargable.
Parquet block size?
Default okay; ensure files not too many small objects.
Kafka consumer scale?
Increase kafka_num_consumers and partitions; monitor lag.
Ingest exactly-once guarantees?
Use unique ids and dedupe; offsets stored externally.
Row count drift in MVs?
Check system.mutations and lag; rebuild if necessary.
Backup verification?
Periodic restore into staging; run checks.
Read/write separation?
Readers prefer local; writers balanced; separate pools.
BI chaos testing?
Disable heavy queries randomly to observe resilience.
ClickHouse Cloud vs self-host?
Cloud accelerates ops; self-host for control/cost at scale.
Warmup queries on deploy?
Prime caches and compile hot queries after restart.
HTTP vs native protocol?
Native faster; HTTP convenient for apps.
Graceful shutdowns?
Stop traffic, drain queries, ensure merges safe.
Can I use views instead of MVs?
Plain views don’t precompute; MVs do.
Zero-copy replication?
Parts replication avoids re-encoding; efficient.
Object storage costs?
Monitor egress; lifecycle policies for delete.
Histogram functions?
histogramNumeric; approximate for large sets.
Sparse indexes?
Experimental; traditional pruning usually sufficient.
Large JOINs best practice?
Pre-join via ETL; or use Dictionary for small dims.
Deduplicate across shards?
Hard; dedupe per shard then global reconcile.
Timezone pitfalls?
Store UTC; convert at edge.
Client retries cause duplicates?
Yes for INSERT; ensure idempotency.
Can I encrypt columns?
App-level encryption; CH sees ciphertext.
Dynamic sampling percent?
Adaptive SAMPLE via settings; ensure stability.
Learning resources?
Official docs, Altinity blog, community Slack.
Why low CPU but slow queries?
I/O bound; check read bytes and disk throughput.
Avoid SELECT *?
Yes; read fewer columns.
Multi-tenant throttling?
Quotas per role; per-tenant settings profiles.
SQL explain readability?
PIPELINE view helps with stages; learn to interpret.
Golden rule?
Design for queries, pre-aggregate, measure, iterate.

Appendix AA — Users, Roles, Profiles (users.xml)

<clickhouse>
  <users>
    <analyst>
      <password_sha256_hex>...</password_sha256_hex>
      <networks><ip>::/0</ip></networks>
      <profile>readonly</profile>
      <quota>q_analyst</quota>
      <allow_databases>default</allow_databases>
    </analyst>
  </users>
  <profiles>
    <readonly>
      <readonly>1</readonly>
      <max_threads>8</max_threads>
      <max_execution_time>60</max_execution_time>
    </readonly>
  </profiles>
  <quotas>
    <q_analyst>
      <interval length="3600">
        <queries>10000</queries>
        <errors>100</errors>
      </interval>
    </q_analyst>
  </quotas>
</clickhouse>

Appendix AB — Example Migration Scripts

-- V004_add_source.sql
ALTER TABLE events_raw ADD COLUMN IF NOT EXISTS source LowCardinality(String) DEFAULT 'web';
ALTER TABLE events_hourly ADD COLUMN IF NOT EXISTS s UInt64 DEFAULT 0; -- for future extension

Appendix AC — Large Table Maintenance

-- Compact old partitions
OPTIMIZE TABLE events_raw PARTITION '202510' FINAL DEDUPLICATE;

-- Drop orphaned projections then re-create
ALTER TABLE events_raw DROP PROJECTION IF EXISTS p_hourly;
ALTER TABLE events_raw ADD PROJECTION p_hourly (SELECT toStartOfHour(ts) h, event_name, count() c GROUP BY h, event_name);
ALTER TABLE events_raw MATERIALIZE PROJECTION p_hourly;

Appendix AD — Data Masking Examples

SELECT concat(substr(email,1,2),'***@',splitByChar('@',email)[2]) AS masked FROM users_dim;

Appendix AE — Canary Settings

SETTINGS max_threads = 4, max_execution_time = 15, enable_optimize_predicate_expression = 1;

Appendix AF — Query Templates

-- Daily active users
SELECT toDate(ts) d, uniqExact(user_id) dau FROM events_raw WHERE ts >= today() - 30 GROUP BY d ORDER BY d;

-- Revenue by device
SELECT device, sumOrNull(amount) FROM events_raw WHERE ts >= now()-INTERVAL 7 DAY GROUP BY device ORDER BY 2 DESC;

Appendix AG — Data Contracts

- Events schema versioning via props.version
- Backward-compatible additions only during peak
- Contract tests in CI ensure fields and types

Appendix AH — Change Management

- RFC for schema changes
- Staging cluster soak tests
- Backout plan and data migration scripts

Appendix AI — Resource Groups (cloud)

- Assign BI, ETL, Adhoc groups with quotas
- Priority to BI during business hours

Appendix AJ — More SQL Patterns

-- Retention cohorts
SELECT toDate(first_session) cohort, dateDiff('day', first_session, ts) AS day, count() FROM sessions GROUP BY cohort, day;

-- Percentiles
SELECT quantilesTDigest(0.5,0.9,0.99)(latency_ms) FROM api_calls WHERE ts >= now()-INTERVAL 1 DAY;

Mega FAQ (281–320)

How often to run OPTIMIZE?
Sparingly; background merges handle most; run for old partitions off-peak.
Is FINAL DEDUPLICATE safe?
Costly; use for small scope; prefer ReplacingMergeTree.
Can I disable a projection?
Yes with DROP PROJECTION; re-add and materialize later.
What is parts_to_throw_insert if too many parts?
Guardrail to prevent fragmentation; increase batch sizes.
Are HTTP compression settings configurable?
Yes; enable gzip/zstd for HTTP responses.
Does ClickHouse support transactions?
Limited; design for append-only and idempotency.
Is there a query result cache?
Experimental; prefer schema-level acceleration.
How to handle schema drift in props JSON?
Extract common fields; keep raw for long tail.
What’s a good baseline P95?
Sub-100ms for moderate scans; sub-second for large.
Slow count()?
Use approximate functions or pre-aggregations.
Can I store arrays of structs?
Yes; Array(Tuple(...)); query with arrayJoin.
Why is LowCardinality not used?
Maybe disabled or cardinality too high; evaluate.
How to schedule backups?
Nightly with BACKUP; verify restore weekly.
Versioning of MVs?
Create new MV; swap consumers; drop old.
Protect from runaway queries?
max_execution_time, quotas, user profiles.
Detect Cartesian joins?
Lint SQL; enforce joins on keys; limit result sizes.
Precompute windows?
MV for rolling windows; or sessionize offline.
Materialize view lag alert?
Track max(h) vs now and source offsets.
Investigate disk hotspot?
iostat, nvme top, CH system.asynchronous_metrics.
Can I embed ML?
Use external tools; CH for feature stores and aggregations.
UTF-8 issues?
Validate on ingest; reject bad rows.
Schema registry for events?
Yes with Avro/Protobuf; enforce compatibility.
Table engine for logs?
MergeTree; with TTL and partitions; consider TinyLog only for trivial.
Batch vs streaming ingest?
Hybrid: micro-batches from Kafka; daily bulk compaction.
Duplicate rows post-restore?
Re-run dedupe MVs or rebuild aggregates.
How many replicas?
2–3 across AZs for HA.
Cross-region replication?
Restore from backups or async shipping; no native cross-region synchronous replication.
S3 credentials leak prevention?
Use IAM roles; no hard-coded secrets.
Truncating big tables?
TRUNCATE; faster than DELETE.
Async inserts?
use_async_insert=1; increases throughput.
How to track query memory?
system.query_log memory_usage column.
Built-in functions for JSON?
Yes: JSONExtract*, JSON_VALUE, visitParam*.
Optimizing DISTINCT?
Use uniq* functions or group by aggregates when possible.
Multi-DB queries?
Yes with db.table qualifiers; permissions required.
Explain pipes vs plan?
PIPELINE shows execution stages; PLAN shows logical plan.
Cold partition queries?
Run async; warn users; prefer pre-agg tables.
BI cache layer?
Optional; CH is fast; but app-level cache can help expensive joins.
Why read bytes so high?
Query reading many columns/parts; project only needed columns; improve pruning.
Ingest validation?
Reject malformed rows; dead-letter to S3; monitor error rates.
Final advice?
Model for queries, use partitions and projections, pre-aggregate, and monitor relentlessly.

Appendix AK — Small Recipes

-- Top URLs yesterday
SELECT page, count() c FROM pageviews WHERE ts >= yesterday() AND ts < today() GROUP BY page ORDER BY c DESC LIMIT 100;

-- Hourly error rate
SELECT toStartOfHour(ts) h, countIf(status>=500)/count() FROM api_calls WHERE ts>=now()-INTERVAL 24 HOUR GROUP BY h ORDER BY h;

-- Per-country DAU
SELECT country, uniqExact(user_id) FROM events_raw WHERE ts >= today() GROUP BY country ORDER BY 2 DESC;

Appendix AL — More Settings

SET max_block_size = 1000000;
SET max_insert_block_size = 1048576;
SET async_insert = 1;
SET wait_for_async_insert = 1;
SET max_threads = 16;
SET max_concurrent_queries_for_user = 20;

Appendix AM — Extended Runbooks

High CPU, low throughput
- Profile queries; reduce columns scanned; add projections; increase max_threads carefully

Frequent timeouts
- Increase max_execution_time slightly; improve pruning; add LIMIT; pre-aggregate

Keeper session expiries
- Raise session timeout; ensure stable network; monitor raft health

Appendix AN — Release Checklist

- Schema migrations applied ON CLUSTER
- Projections materialized
- MVs healthy and lag < threshold
- Dashboards updated
- Backups verified

Appendix AO — Query Patterns Continued

-- Rolling 7-day sum per country
SELECT country, windowEnd AS d, sumState(c) AS s FROM (
  SELECT country, tumbleEnd(h, INTERVAL 1 DAY) AS windowEnd, sum(c) AS c FROM events_hourly GROUP BY country, tumble(h, INTERVAL 1 DAY)
) GROUP BY country, d ORDER BY country, d;

-- Distinct user counts with HyperLogLog (approx)
SELECT country, uniqHLL12(user_id) FROM events_raw WHERE ts >= now()-INTERVAL 30 DAY GROUP BY country;

Appendix AP — Practical Tips

- Avoid SELECT *; project only needed columns
- Denormalize frequently; MVs and projections are your friends
- Keep partitions balanced; avoid too many small parts
- Prefer Parquet ingest; batch into larger files
- Quotas and profiles prevent noisy neighbors

Extended FAQ (321–360)

How to detect poorly pruned queries?
Check rows_read vs result rows; high ratio means poor pruning.
Is FINAL applied per part?
Yes; it merges rows from parts; expensive.
Can I cache small dims in app?
Yes; combine app cache + CH for facts.
Are projections transactional?
They are updated with inserts; consistency matches table semantics.
Can I throttle merges?
Tune background/merge settings; schedule heavy loads off-peak.
Does ORDER BY influence compression?
Yes; ordering similar values improves compression.
Can two MVs write to same table?
Yes; ensure they don’t duplicate rows unintentionally.
How to measure projection hit rate?
Inspect system.query_log for used_aggregate_projection (when available).
Is Keeper required on single node?
No; only for replicated engines or clusters.
Max recommended parts per partition?
Keep under a few thousands; compact regularly.
Can I run CH on ARM?
Yes; test performance characteristics.
Are views materialized on read?
Plain views are not; MVs are precomputed on write.
Best way to handle deletes?
Use TTL DELETE or collapsing engines; avoid massive DELETE mutations.
How to debug slow merges?
system.part_log shows merge durations and reasons.
Percentiles accurate?
Exact functions are slow; TDigest/QDigest are approximate but fast.
Optimize COUNT DISTINCT?
Use uniq* family functions; or pre-agg via MV.
Metadata bloat?
Too many small parts or projections; consolidate.
Parquet nested types?
Supported; map carefully to arrays/tuples.
Time bucketing accuracy?
Use toStartOfHour/day for stable buckets.
ClickHouse for OLTP?
Not ideal; OLAP-first.
Disk fill alerts?
Alert at 70/85/95%; tier or clean old partitions.
How to ensure SLOs?
Dashboards, alerts, and regression tests in CI.
CSV vs TSV?
TSV faster and safer for special chars.
Client idle timeouts?
Tune; keep-alive helpful for BI tools.
Merge throttling effect?
Slower background consolidation; impacts read amplification temporarily.
Aggregating states across shards?
Use distributed group by then finalizeAggregation().
Explain ATH?
Advanced Thread Handling; tune threads judiciously.
Materialize today’s partition first?
Yes for hot data; reduce latency for recent queries.
Verify MV correctness?
Compare MV aggregates vs raw recomputation periodically.
Should I use ENUM?
LowCardinality String is preferred in most cases.
Monitoring keeper?
Expose keeper metrics; alert on session count and raft lag.
Async insert reliability?
Good for throughput; ensure retries and idempotency.
What’s optimize_move_to_prewhere?
Moves filters to prewhere stage to reduce IO.
Best join algorithm?
Partial/hashing joins; broadcast small dim; avoid big-to-big.
Using arrays for labels?
Yes; use hasAny/hasAll and arrayJoin where needed.
Precompute label indices?
Yes via MVs for frequent filters.
Cluster vs local dev?
Replicate cluster topology in staging; validate DDL.
Limit log verbosity?
send_logs_level; rotate logs; centralize.
Backup cost controls?
Incremental and compressed to object storage.
Default codecs?
ZSTD default is a good start; override selectively.

Appendix AQ — Final Snippets

-- Top referrers last 48h
SELECT ref, count() FROM pageviews WHERE ts >= now()-INTERVAL 48 HOUR GROUP BY ref ORDER BY 2 DESC LIMIT 50;

Micro FAQ (361–380)

Parallel replicas for reads?
Yes with experimental settings; test benefits.
Merge priorities?
Older/smaller parts first; settings influence; observe in part_log.
How to find top tables by size?
SELECT table, sum(bytes_on_disk) FROM system.parts WHERE active GROUP BY table ORDER BY 2 DESC;
Slow HTTP interface?
Prefer native; tune compression and keep-alive.
Query idempotency key?
Client-provided for retries on inserts.
Prevent table locks?
Avoid heavy mutations; schedule off-peak; increase resources.
Distributed DDL stuck?
Check system.distribution_queue and ddl_worker logs.
S3 multipart tuning?
Adjust part sizes; monitor throughput.
Cloud autoscaling?
Scale read replicas; coordinate with merges.
Secrets rotation cadence?
90 days typical; automate and test.
Canary queries?
Run representative queries post-deploy to validate health.
SQL linting?
Enforce sargable WHERE and limited SELECT * in CI.
Aggregate states storage?
AggregatingMergeTree stores states; finalize on read.
BI concurrency control?
Quotas and max_concurrent_queries_for_user.
Where to store UDFs?
As configured scripts; manage versions.
ClickHouse Keeper backups?
Snapshot coordination and log archives.
Probe for projection usage?
EXPLAIN with settings; inspect query_log flags.
Protect against accidental TRUNCATE?
Privileges; confirmations in tooling; backups ready.
Prefer UInt over Int?
Use UInt where domain is non-negative; saves bits.
Final advice?
Ship with dashboards and runbooks; keep FINAL out of prod.

Micro FAQ (381–386)

Disk queue saturation?
Provision higher IOPS; spread across NVMes; reduce read bytes via projections.
View to test FINAL impact?
Compare WITH FINAL vs without on same filter and measure read bytes.
How to detect projection staleness?
Check materialization status and compare counts vs base.
Canary MV change?
Create mv_v2, dual-write, validate, then switch and drop v1.
HTTP keep-alive defaults?
Tune keep_alive_timeout and max_keep_alive_requests for BI clients.
Why is row policy slow?
Complex predicates; pushdown not possible; consider pre-partitioning or denormalization.