Set up ClickHouse and Kafka real-time data pipeline with streaming analytics

Advanced 45 min Apr 03, 2026 320 views
Ubuntu 24.04 Debian 12 AlmaLinux 9 Rocky Linux 9

Build a production-ready real-time data pipeline using ClickHouse for high-performance analytics and Apache Kafka for streaming data ingestion. Configure clustering, replication, and automated data processing workflows.

Prerequisites

  • Root or sudo access
  • Minimum 8GB RAM
  • Java 11 or higher
  • Network access to download packages
  • At least 50GB free disk space

What this solves

Real-time data analytics requires a robust pipeline that can ingest, process, and analyze streaming data at scale. This tutorial sets up ClickHouse as your analytical database with Apache Kafka for stream processing, creating a production-grade pipeline capable of handling millions of events per second with sub-second query latency.

Step-by-step installation

Update system packages

Start by updating your package manager and installing essential dependencies for the data pipeline setup.

sudo apt update && sudo apt upgrade -y
sudo apt install -y curl wget gnupg software-properties-common apt-transport-https ca-certificates
sudo dnf update -y
sudo dnf install -y curl wget gnupg2 ca-certificates

Install Java for Kafka

Apache Kafka requires Java 8 or higher. Install OpenJDK 11 for optimal compatibility and performance.

sudo apt install -y openjdk-11-jdk
java -version
sudo dnf install -y java-11-openjdk java-11-openjdk-devel
java -version

Install ClickHouse

Add the ClickHouse repository and install the server with client tools for high-performance analytical processing.

curl -fsSL 'https://packages.clickhouse.com/rpm/lts/repodata/repomd.xml.key' | sudo gpg --dearmor -o /usr/share/keyrings/clickhouse-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/clickhouse-keyring.gpg] https://packages.clickhouse.com/deb stable main" | sudo tee /etc/apt/sources.list.d/clickhouse.list
sudo apt update
sudo apt install -y clickhouse-server clickhouse-client
sudo yum install -y yum-utils
sudo yum-config-manager --add-repo https://packages.clickhouse.com/rpm/clickhouse.repo
sudo yum install -y clickhouse-server clickhouse-client

Configure ClickHouse for production

Set up ClickHouse with optimized settings for real-time analytics workloads and enable clustering support.



    
        information
        /var/log/clickhouse-server/clickhouse-server.log
        /var/log/clickhouse-server/clickhouse-server.err.log
        1000M
        10
    
    
    8123
    9000
    9004
    9005
    
    0.0.0.0
    
    4096
    3
    100
    0
    
    users.xml
    default
    default
    
    UTC
    
    true
    
    
        
            
                
                    localhost
                    9000
                
            
        
    
    
    
        earliest
        |
        5000
    

Configure ClickHouse users and security

Set up user authentication and access controls for secure database operations.



    
        
            
            
                ::1
                127.0.0.1
                10.0.0.0/8
                172.16.0.0/12
                192.168.0.0/16
            
            default
            default
            
                default
            
        
        
        
            e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
            
                127.0.0.1
                10.0.0.0/8
            
            analytics
            default
            
                analytics
                default
            
        
    
    
    
        
            10000000000
            0
            random
        
        
        
            20000000000
            20000000000
            20000000000
            1000000000
            1000000
            0
        
    
    
    
        
            
                3600
                0
                0
                0
                0
                0
            
        
    

Install Apache ZooKeeper

Install ZooKeeper for Kafka cluster coordination and ClickHouse replication management.

cd /opt
sudo wget https://archive.apache.org/dist/zookeeper/zookeeper-3.8.3/apache-zookeeper-3.8.3-bin.tar.gz
sudo tar -xzf apache-zookeeper-3.8.3-bin.tar.gz
sudo mv apache-zookeeper-3.8.3-bin zookeeper
sudo chown -R root:root /opt/zookeeper

Configure ZooKeeper

Set up ZooKeeper configuration for stable cluster operations and data consistency.

tickTime=2000
dataDir=/var/lib/zookeeper
clientPort=2181
initLimit=10
syncLimit=5
server.1=localhost:2888:3888
maxClientCnxns=60
autopurge.snapRetainCount=3
autopurge.purgeInterval=1
sudo mkdir -p /var/lib/zookeeper
sudo chown -R root:root /var/lib/zookeeper
echo "1" | sudo tee /var/lib/zookeeper/myid

Create ZooKeeper systemd service

Configure ZooKeeper as a system service for automatic startup and process management.

[Unit]
Description=Apache ZooKeeper server
Documentation=http://zookeeper.apache.org
Requires=network.target remote-fs.target
After=network.target remote-fs.target

[Service]
Type=forking
User=root
Group=root
ExecStart=/opt/zookeeper/bin/zkServer.sh start
ExecStop=/opt/zookeeper/bin/zkServer.sh stop
ExecReload=/opt/zookeeper/bin/zkServer.sh restart
WorkingDirectory=/opt/zookeeper

[Install]
WantedBy=multi-user.target

Install Apache Kafka

Download and install Kafka for distributed streaming platform capabilities.

cd /opt
sudo wget https://archive.apache.org/dist/kafka/2.13-3.6.1/kafka_2.13-3.6.1.tgz
sudo tar -xzf kafka_2.13-3.6.1.tgz
sudo mv kafka_2.13-3.6.1 kafka
sudo chown -R root:root /opt/kafka

Configure Kafka server

Set up Kafka broker configuration optimized for high-throughput data streaming and ClickHouse integration.

broker.id=0
listeners=PLAINTEXT://localhost:9092
log.dirs=/var/lib/kafka-logs
num.network.threads=8
num.io.threads=16
socket.send.buffer.bytes=102400
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600

num.partitions=3
num.recovery.threads.per.data.dir=1
offsets.topic.replication.factor=1
transaction.state.log.replication.factor=1
transaction.state.log.min.isr=1

log.retention.hours=168
log.retention.bytes=1073741824
log.segment.bytes=1073741824
log.retention.check.interval.ms=300000

zookeeper.connect=localhost:2181
zookeeper.connection.timeout.ms=18000

group.initial.rebalance.delay.ms=0
auto.create.topics.enable=true
delete.topic.enable=true

compression.type=lz4
max.request.size=10485760
message.max.bytes=10485760
replica.fetch.max.bytes=10485760
sudo mkdir -p /var/lib/kafka-logs
sudo chown -R root:root /var/lib/kafka-logs

Create Kafka systemd service

Configure Kafka as a system service with proper dependency management.

[Unit]
Description=Apache Kafka server
Documentation=http://kafka.apache.org/documentation.html
Requires=zookeeper.service
After=zookeeper.service

[Service]
Type=simple
User=root
Group=root
Environment=JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
ExecStart=/opt/kafka/bin/kafka-server-start.sh /opt/kafka/config/server.properties
ExecStop=/opt/kafka/bin/kafka-server-stop.sh
Restart=on-failure

[Install]
WantedBy=multi-user.target

Install Kafka Connect

Set up Kafka Connect for seamless data integration between Kafka and ClickHouse.

cd /opt
sudo wget https://github.com/ClickHouse/clickhouse-kafka-connect/releases/download/v1.0.12/clickhouse-kafka-connect-v1.0.12.zip
sudo unzip clickhouse-kafka-connect-v1.0.12.zip -d kafka-connect-clickhouse
sudo mv kafka-connect-clickhouse /opt/kafka/
sudo chown -R root:root /opt/kafka/kafka-connect-clickhouse

Configure Kafka Connect for ClickHouse

Set up Kafka Connect worker configuration for distributed mode operation.

bootstrap.servers=localhost:9092
group.id=clickhouse-connect-cluster

key.converter=org.apache.kafka.connect.storage.StringConverter
value.converter=org.apache.kafka.connect.storage.StringConverter
key.converter.schemas.enable=false
value.converter.schemas.enable=false

offset.storage.topic=connect-offsets
offset.storage.replication.factor=1
offset.storage.partitions=25

config.storage.topic=connect-configs
config.storage.replication.factor=1

status.storage.topic=connect-status
status.storage.replication.factor=1
status.storage.partitions=5

offset.flush.interval.ms=10000

plugin.path=/opt/kafka/kafka-connect-clickhouse

rest.host.name=localhost
rest.port=8083

Start all services

Enable and start ZooKeeper, Kafka, and ClickHouse services in the correct order.

sudo systemctl daemon-reload
sudo systemctl enable --now zookeeper
sudo systemctl enable --now clickhouse-server
sudo systemctl enable --now kafka

Wait for services to start

sleep 10

Verify services are running

sudo systemctl status zookeeper sudo systemctl status clickhouse-server sudo systemctl status kafka

Create ClickHouse database and tables

Set up the analytics database structure optimized for real-time data ingestion from Kafka.

clickhouse-client --query="CREATE DATABASE IF NOT EXISTS analytics"

clickhouse-client --database=analytics --query="
CREATE TABLE IF NOT EXISTS events_queue (
    timestamp DateTime64(3),
    user_id String,
    event_type String,
    properties String,
    ip_address IPv4,
    user_agent String
) ENGINE = Kafka()
SETTINGS 
    kafka_broker_list = 'localhost:9092',
    kafka_topic_list = 'user_events',
    kafka_group_name = 'clickhouse_consumer',
    kafka_format = 'JSONEachRow',
    kafka_num_consumers = 3;"

clickhouse-client --database=analytics --query="
CREATE TABLE IF NOT EXISTS events (
    timestamp DateTime64(3),
    user_id String,
    event_type String,
    properties String,
    ip_address IPv4,
    user_agent String,
    date Date MATERIALIZED toDate(timestamp)
) ENGINE = MergeTree()
PARTITION BY date
ORDER BY (event_type, user_id, timestamp)
TTL date + INTERVAL 90 DAY;"

clickhouse-client --database=analytics --query="
CREATE MATERIALIZED VIEW IF NOT EXISTS events_mv TO events AS
SELECT 
    timestamp,
    user_id,
    event_type,
    properties,
    ip_address,
    user_agent
FROM events_queue;"

Create Kafka topic for events

Set up a Kafka topic with optimal partitioning for high-throughput event streaming.

/opt/kafka/bin/kafka-topics.sh --create \
    --bootstrap-server localhost:9092 \
    --replication-factor 1 \
    --partitions 6 \
    --topic user_events \
    --config retention.ms=604800000 \
    --config compression.type=lz4

Create real-time analytics views

Set up materialized views for common analytics queries with automatic aggregation.

clickhouse-client --database=analytics --query="
CREATE TABLE IF NOT EXISTS hourly_stats (
    hour DateTime,
    event_type String,
    event_count UInt64,
    unique_users UInt64
) ENGINE = SummingMergeTree()
PARTITION BY toYYYYMM(hour)
ORDER BY (hour, event_type);"

clickhouse-client --database=analytics --query="
CREATE MATERIALIZED VIEW IF NOT EXISTS hourly_stats_mv TO hourly_stats AS
SELECT 
    toStartOfHour(timestamp) as hour,
    event_type,
    count() as event_count,
    uniq(user_id) as unique_users
FROM events
GROUP BY hour, event_type;"

clickhouse-client --database=analytics --query="
CREATE TABLE IF NOT EXISTS top_events_by_hour (
    hour DateTime,
    event_type String,
    event_count UInt64
) ENGINE = ReplacingMergeTree()
PARTITION BY toYYYYMM(hour)
ORDER BY (hour, event_count, event_type);"

clickhouse-client --database=analytics --query="
CREATE MATERIALIZED VIEW IF NOT EXISTS top_events_mv TO top_events_by_hour AS
SELECT 
    toStartOfHour(timestamp) as hour,
    event_type,
    count() as event_count
FROM events
GROUP BY hour, event_type
HAVING event_count > 10;"

Start Kafka Connect

Launch Kafka Connect in distributed mode to handle data pipeline operations.

/opt/kafka/bin/connect-distributed.sh /opt/kafka/config/connect-distributed.properties > /var/log/kafka-connect.log 2>&1 &

Create systemd service for Kafka Connect

sudo tee /etc/systemd/system/kafka-connect.service > /dev/null << 'EOF' [Unit] Description=Kafka Connect Requires=kafka.service After=kafka.service [Service] Type=simple User=root Group=root Environment=JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64 ExecStart=/opt/kafka/bin/connect-distributed.sh /opt/kafka/config/connect-distributed.properties Restart=on-failure [Install] WantedBy=multi-user.target EOF sudo systemctl daemon-reload sudo systemctl enable --now kafka-connect

Verify your setup

Test the complete data pipeline by producing sample events and verifying data flow through Kafka to ClickHouse.

# Check all services are running
sudo systemctl status zookeeper clickhouse-server kafka kafka-connect

Test Kafka topic creation

/opt/kafka/bin/kafka-topics.sh --list --bootstrap-server localhost:9092

Verify ClickHouse is accepting connections

clickhouse-client --query="SELECT version()"

Test data pipeline with sample events

echo '{"timestamp":"2024-01-15 10:30:00.000","user_id":"user123","event_type":"page_view","properties":"{\"page\":\"/home\"}","ip_address":"203.0.113.10","user_agent":"Mozilla/5.0"}' | /opt/kafka/bin/kafka-console-producer.sh --bootstrap-server localhost:9092 --topic user_events echo '{"timestamp":"2024-01-15 10:31:00.000","user_id":"user456","event_type":"click","properties":"{\"button\":\"signup\"}","ip_address":"203.0.113.20","user_agent":"Mozilla/5.0"}' | /opt/kafka/bin/kafka-console-producer.sh --bootstrap-server localhost:9092 --topic user_events

Wait a few seconds for processing

sleep 5

Verify data in ClickHouse

clickhouse-client --database=analytics --query="SELECT count() FROM events" clickhouse-client --database=analytics --query="SELECT * FROM events LIMIT 5" clickhouse-client --database=analytics --query="SELECT * FROM hourly_stats LIMIT 5"

Check Kafka Connect status

curl -s localhost:8083/connectors

Verify Kafka consumer group

/opt/kafka/bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group clickhouse_consumer

Performance optimization and monitoring

Configure ClickHouse performance monitoring

Set up system tables and queries for monitoring pipeline performance and identifying bottlenecks.

clickhouse-client --database=analytics --query="
CREATE TABLE IF NOT EXISTS pipeline_metrics (
    timestamp DateTime,
    metric_name String,
    metric_value Float64,
    tags String
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY (timestamp, metric_name);"

Create monitoring queries

clickhouse-client --database=analytics --query=" SELECT 'events_per_second' as metric, count() / 3600 as value, toStartOfHour(now()) as hour FROM events WHERE timestamp >= now() - INTERVAL 1 HOUR;" clickhouse-client --database=analytics --query=" SELECT 'partition_size_gb' as metric, sum(bytes_on_disk) / (102410241024) as size_gb, table FROM system.parts WHERE database = 'analytics' AND active = 1 GROUP BY table;"

Set up automated performance optimization

Configure automatic table optimization and partition management for sustained performance.

#!/bin/bash

Optimize tables for better query performance

clickhouse-client --database=analytics --query="OPTIMIZE TABLE events FINAL" clickhouse-client --database=analytics --query="OPTIMIZE TABLE hourly_stats FINAL" clickhouse-client --database=analytics --query="OPTIMIZE TABLE top_events_by_hour FINAL"

Clean up old partitions

clickhouse-client --database=analytics --query="ALTER TABLE events DROP PARTITION '$(date -d '91 days ago' '+%Y%m%d')'"

Update table statistics

clickhouse-client --database=analytics --query="SYSTEM FLUSH LOGS" echo "$(date): ClickHouse maintenance completed" >> /var/log/clickhouse-maintenance.log
sudo chmod +x /opt/clickhouse-maintenance.sh

Add to crontab for daily execution

echo "0 2 * /opt/clickhouse-maintenance.sh" | sudo crontab -

Common issues

Symptom Cause Fix
Kafka Connect fails to start Missing ClickHouse connector plugin Verify plugin.path in connect-distributed.properties and restart service
ClickHouse not receiving data Kafka consumer group not active Check consumer group status and restart materialized view: DETACH TABLE events_mv; ATTACH TABLE events_mv;
High memory usage in ClickHouse Large result sets or inefficient queries Implement query result limits and optimize table partitioning
Kafka topic lag increasing ClickHouse ingestion slower than production Increase kafka_num_consumers and add more ClickHouse replicas
ZooKeeper connection timeouts Network latency or ZooKeeper overload Increase zookeeper.connection.timeout.ms in Kafka config
Materialized view not updating Kafka engine table not consuming messages Check Kafka connectivity: SELECT * FROM system.kafka_consumers;

Next steps

Automated install script

Run this to automate the entire setup

Need help?

Don't want to manage this yourself?

We handle high availability infrastructure for businesses that depend on uptime. From initial setup to ongoing operations.