ClickHouse Kafka Pipeline Setup - Real-time Analytics Guide

Build a production-ready real-time data pipeline using ClickHouse for high-performance analytics and Apache Kafka for streaming data ingestion. Configure clustering, replication, and automated data processing workflows.

Prerequisites

Root or sudo access
Minimum 8GB RAM
Java 11 or higher
Network access to download packages
At least 50GB free disk space

What this solves

Real-time data analytics requires a robust pipeline that can ingest, process, and analyze streaming data at scale. This tutorial sets up ClickHouse as your analytical database with Apache Kafka for stream processing, creating a production-grade pipeline capable of handling millions of events per second with sub-second query latency.

Step-by-step installation

Update system packages

Start by updating your package manager and installing essential dependencies for the data pipeline setup.

sudo apt update && sudo apt upgrade -y
sudo apt install -y curl wget gnupg software-properties-common apt-transport-https ca-certificates

sudo dnf update -y
sudo dnf install -y curl wget gnupg2 ca-certificates

Install Java for Kafka

Apache Kafka requires Java 8 or higher. Install OpenJDK 11 for optimal compatibility and performance.

sudo apt install -y openjdk-11-jdk
java -version

sudo dnf install -y java-11-openjdk java-11-openjdk-devel
java -version

Install ClickHouse

Add the ClickHouse repository and install the server with client tools for high-performance analytical processing.

curl -fsSL 'https://packages.clickhouse.com/rpm/lts/repodata/repomd.xml.key' | sudo gpg --dearmor -o /usr/share/keyrings/clickhouse-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/clickhouse-keyring.gpg] https://packages.clickhouse.com/deb stable main" | sudo tee /etc/apt/sources.list.d/clickhouse.list
sudo apt update
sudo apt install -y clickhouse-server clickhouse-client

sudo yum install -y yum-utils
sudo yum-config-manager --add-repo https://packages.clickhouse.com/rpm/clickhouse.repo
sudo yum install -y clickhouse-server clickhouse-client

Configure ClickHouse for production

Set up ClickHouse with optimized settings for real-time analytics workloads and enable clustering support.



    
        information
        /var/log/clickhouse-server/clickhouse-server.log
        /var/log/clickhouse-server/clickhouse-server.err.log
        1000M
        10
    
    
    8123
    9000
    9004
    9005
    
    0.0.0.0
    
    4096
    3
    100
    0
    
    users.xml
    default
    default
    
    UTC
    
    true
    
    
        
            
                
                    localhost
                    9000
                
            
        
    
    
    
        earliest
        |
        5000

Configure ClickHouse users and security

Set up user authentication and access controls for secure database operations.



    
        
            
            
                ::1
                127.0.0.1
                10.0.0.0/8
                172.16.0.0/12
                192.168.0.0/16
            
            default
            default
            
                default
            
        
        
        
            e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
            
                127.0.0.1
                10.0.0.0/8
            
            analytics
            default
            
                analytics
                default
            
        
    
    
    
        
            10000000000
            0
            random
        
        
        
            20000000000
            20000000000
            20000000000
            1000000000
            1000000
            0
        
    
    
    
        
            
                3600
                0
                0
                0
                0
                0

Install Apache ZooKeeper

Install ZooKeeper for Kafka cluster coordination and ClickHouse replication management.

cd /opt
sudo wget https://archive.apache.org/dist/zookeeper/zookeeper-3.8.3/apache-zookeeper-3.8.3-bin.tar.gz
sudo tar -xzf apache-zookeeper-3.8.3-bin.tar.gz
sudo mv apache-zookeeper-3.8.3-bin zookeeper
sudo chown -R root:root /opt/zookeeper

Configure ZooKeeper

Set up ZooKeeper configuration for stable cluster operations and data consistency.

tickTime=2000
dataDir=/var/lib/zookeeper
clientPort=2181
initLimit=10
syncLimit=5
server.1=localhost:2888:3888
maxClientCnxns=60
autopurge.snapRetainCount=3
autopurge.purgeInterval=1

sudo mkdir -p /var/lib/zookeeper
sudo chown -R root:root /var/lib/zookeeper
echo "1" | sudo tee /var/lib/zookeeper/myid

Create ZooKeeper systemd service

Configure ZooKeeper as a system service for automatic startup and process management.

[Unit]
Description=Apache ZooKeeper server
Documentation=http://zookeeper.apache.org
Requires=network.target remote-fs.target
After=network.target remote-fs.target

[Service]
Type=forking
User=root
Group=root
ExecStart=/opt/zookeeper/bin/zkServer.sh start
ExecStop=/opt/zookeeper/bin/zkServer.sh stop
ExecReload=/opt/zookeeper/bin/zkServer.sh restart
WorkingDirectory=/opt/zookeeper

[Install]
WantedBy=multi-user.target

Install Apache Kafka

Download and install Kafka for distributed streaming platform capabilities.

cd /opt
sudo wget https://archive.apache.org/dist/kafka/2.13-3.6.1/kafka_2.13-3.6.1.tgz
sudo tar -xzf kafka_2.13-3.6.1.tgz
sudo mv kafka_2.13-3.6.1 kafka
sudo chown -R root:root /opt/kafka

Configure Kafka server

Set up Kafka broker configuration optimized for high-throughput data streaming and ClickHouse integration.

broker.id=0
listeners=PLAINTEXT://localhost:9092
log.dirs=/var/lib/kafka-logs
num.network.threads=8
num.io.threads=16
socket.send.buffer.bytes=102400
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600

num.partitions=3
num.recovery.threads.per.data.dir=1
offsets.topic.replication.factor=1
transaction.state.log.replication.factor=1
transaction.state.log.min.isr=1

log.retention.hours=168
log.retention.bytes=1073741824
log.segment.bytes=1073741824
log.retention.check.interval.ms=300000

zookeeper.connect=localhost:2181
zookeeper.connection.timeout.ms=18000

group.initial.rebalance.delay.ms=0
auto.create.topics.enable=true
delete.topic.enable=true

compression.type=lz4
max.request.size=10485760
message.max.bytes=10485760
replica.fetch.max.bytes=10485760

sudo mkdir -p /var/lib/kafka-logs
sudo chown -R root:root /var/lib/kafka-logs

Create Kafka systemd service

Configure Kafka as a system service with proper dependency management.

[Unit]
Description=Apache Kafka server
Documentation=http://kafka.apache.org/documentation.html
Requires=zookeeper.service
After=zookeeper.service

[Service]
Type=simple
User=root
Group=root
Environment=JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
ExecStart=/opt/kafka/bin/kafka-server-start.sh /opt/kafka/config/server.properties
ExecStop=/opt/kafka/bin/kafka-server-stop.sh
Restart=on-failure

[Install]
WantedBy=multi-user.target

Install Kafka Connect

Set up Kafka Connect for seamless data integration between Kafka and ClickHouse.

cd /opt
sudo wget https://github.com/ClickHouse/clickhouse-kafka-connect/releases/download/v1.0.12/clickhouse-kafka-connect-v1.0.12.zip
sudo unzip clickhouse-kafka-connect-v1.0.12.zip -d kafka-connect-clickhouse
sudo mv kafka-connect-clickhouse /opt/kafka/
sudo chown -R root:root /opt/kafka/kafka-connect-clickhouse

Configure Kafka Connect for ClickHouse

Set up Kafka Connect worker configuration for distributed mode operation.

bootstrap.servers=localhost:9092
group.id=clickhouse-connect-cluster

key.converter=org.apache.kafka.connect.storage.StringConverter
value.converter=org.apache.kafka.connect.storage.StringConverter
key.converter.schemas.enable=false
value.converter.schemas.enable=false

offset.storage.topic=connect-offsets
offset.storage.replication.factor=1
offset.storage.partitions=25

config.storage.topic=connect-configs
config.storage.replication.factor=1

status.storage.topic=connect-status
status.storage.replication.factor=1
status.storage.partitions=5

offset.flush.interval.ms=10000

plugin.path=/opt/kafka/kafka-connect-clickhouse

rest.host.name=localhost
rest.port=8083

Start all services

Enable and start ZooKeeper, Kafka, and ClickHouse services in the correct order.

sudo systemctl daemon-reload
sudo systemctl enable --now zookeeper
sudo systemctl enable --now clickhouse-server
sudo systemctl enable --now kafka

Wait for services to start
sleep 10

Verify services are running
sudo systemctl status zookeeper
sudo systemctl status clickhouse-server
sudo systemctl status kafka

Create ClickHouse database and tables

Set up the analytics database structure optimized for real-time data ingestion from Kafka.

clickhouse-client --query="CREATE DATABASE IF NOT EXISTS analytics"

clickhouse-client --database=analytics --query="
CREATE TABLE IF NOT EXISTS events_queue (
    timestamp DateTime64(3),
    user_id String,
    event_type String,
    properties String,
    ip_address IPv4,
    user_agent String
) ENGINE = Kafka()
SETTINGS 
    kafka_broker_list = 'localhost:9092',
    kafka_topic_list = 'user_events',
    kafka_group_name = 'clickhouse_consumer',
    kafka_format = 'JSONEachRow',
    kafka_num_consumers = 3;"

clickhouse-client --database=analytics --query="
CREATE TABLE IF NOT EXISTS events (
    timestamp DateTime64(3),
    user_id String,
    event_type String,
    properties String,
    ip_address IPv4,
    user_agent String,
    date Date MATERIALIZED toDate(timestamp)
) ENGINE = MergeTree()
PARTITION BY date
ORDER BY (event_type, user_id, timestamp)
TTL date + INTERVAL 90 DAY;"

clickhouse-client --database=analytics --query="
CREATE MATERIALIZED VIEW IF NOT EXISTS events_mv TO events AS
SELECT 
    timestamp,
    user_id,
    event_type,
    properties,
    ip_address,
    user_agent
FROM events_queue;"

Create Kafka topic for events

Set up a Kafka topic with optimal partitioning for high-throughput event streaming.

/opt/kafka/bin/kafka-topics.sh --create \
    --bootstrap-server localhost:9092 \
    --replication-factor 1 \
    --partitions 6 \
    --topic user_events \
    --config retention.ms=604800000 \
    --config compression.type=lz4

Create real-time analytics views

Set up materialized views for common analytics queries with automatic aggregation.

clickhouse-client --database=analytics --query="
CREATE TABLE IF NOT EXISTS hourly_stats (
    hour DateTime,
    event_type String,
    event_count UInt64,
    unique_users UInt64
) ENGINE = SummingMergeTree()
PARTITION BY toYYYYMM(hour)
ORDER BY (hour, event_type);"

clickhouse-client --database=analytics --query="
CREATE MATERIALIZED VIEW IF NOT EXISTS hourly_stats_mv TO hourly_stats AS
SELECT 
    toStartOfHour(timestamp) as hour,
    event_type,
    count() as event_count,
    uniq(user_id) as unique_users
FROM events
GROUP BY hour, event_type;"

clickhouse-client --database=analytics --query="
CREATE TABLE IF NOT EXISTS top_events_by_hour (
    hour DateTime,
    event_type String,
    event_count UInt64
) ENGINE = ReplacingMergeTree()
PARTITION BY toYYYYMM(hour)
ORDER BY (hour, event_count, event_type);"

clickhouse-client --database=analytics --query="
CREATE MATERIALIZED VIEW IF NOT EXISTS top_events_mv TO top_events_by_hour AS
SELECT 
    toStartOfHour(timestamp) as hour,
    event_type,
    count() as event_count
FROM events
GROUP BY hour, event_type
HAVING event_count > 10;"

Start Kafka Connect

Launch Kafka Connect in distributed mode to handle data pipeline operations.

/opt/kafka/bin/connect-distributed.sh /opt/kafka/config/connect-distributed.properties > /var/log/kafka-connect.log 2>&1 &

Create systemd service for Kafka Connect
sudo tee /etc/systemd/system/kafka-connect.service > /dev/null << 'EOF'
[Unit]
Description=Kafka Connect
Requires=kafka.service
After=kafka.service

[Service]
Type=simple
User=root
Group=root
Environment=JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
ExecStart=/opt/kafka/bin/connect-distributed.sh /opt/kafka/config/connect-distributed.properties
Restart=on-failure

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable --now kafka-connect

Verify your setup

Test the complete data pipeline by producing sample events and verifying data flow through Kafka to ClickHouse.

# Check all services are running
sudo systemctl status zookeeper clickhouse-server kafka kafka-connect

Test Kafka topic creation
/opt/kafka/bin/kafka-topics.sh --list --bootstrap-server localhost:9092

Verify ClickHouse is accepting connections
clickhouse-client --query="SELECT version()"

Test data pipeline with sample events
echo '{"timestamp":"2024-01-15 10:30:00.000","user_id":"user123","event_type":"page_view","properties":"{\"page\":\"/home\"}","ip_address":"203.0.113.10","user_agent":"Mozilla/5.0"}' | /opt/kafka/bin/kafka-console-producer.sh --bootstrap-server localhost:9092 --topic user_events

echo '{"timestamp":"2024-01-15 10:31:00.000","user_id":"user456","event_type":"click","properties":"{\"button\":\"signup\"}","ip_address":"203.0.113.20","user_agent":"Mozilla/5.0"}' | /opt/kafka/bin/kafka-console-producer.sh --bootstrap-server localhost:9092 --topic user_events

Wait a few seconds for processing
sleep 5

Verify data in ClickHouse
clickhouse-client --database=analytics --query="SELECT count() FROM events"
clickhouse-client --database=analytics --query="SELECT * FROM events LIMIT 5"
clickhouse-client --database=analytics --query="SELECT * FROM hourly_stats LIMIT 5"

Check Kafka Connect status
curl -s localhost:8083/connectors

Verify Kafka consumer group
/opt/kafka/bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group clickhouse_consumer

Performance optimization and monitoring

Configure ClickHouse performance monitoring

Set up system tables and queries for monitoring pipeline performance and identifying bottlenecks.

clickhouse-client --database=analytics --query="
CREATE TABLE IF NOT EXISTS pipeline_metrics (
    timestamp DateTime,
    metric_name String,
    metric_value Float64,
    tags String
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY (timestamp, metric_name);"

Create monitoring queries
clickhouse-client --database=analytics --query="
SELECT 
    'events_per_second' as metric,
    count() / 3600 as value,
    toStartOfHour(now()) as hour
FROM events 
WHERE timestamp >= now() - INTERVAL 1 HOUR;"

clickhouse-client --database=analytics --query="
SELECT 
    'partition_size_gb' as metric,
    sum(bytes_on_disk) / (102410241024) as size_gb,
    table
FROM system.parts 
WHERE database = 'analytics' AND active = 1
GROUP BY table;"

Set up automated performance optimization

Configure automatic table optimization and partition management for sustained performance.

#!/bin/bash

Optimize tables for better query performance
clickhouse-client --database=analytics --query="OPTIMIZE TABLE events FINAL"
clickhouse-client --database=analytics --query="OPTIMIZE TABLE hourly_stats FINAL"
clickhouse-client --database=analytics --query="OPTIMIZE TABLE top_events_by_hour FINAL"

Clean up old partitions
clickhouse-client --database=analytics --query="ALTER TABLE events DROP PARTITION '$(date -d '91 days ago' '+%Y%m%d')'"

Update table statistics
clickhouse-client --database=analytics --query="SYSTEM FLUSH LOGS"

echo "$(date): ClickHouse maintenance completed" >> /var/log/clickhouse-maintenance.log

sudo chmod +x /opt/clickhouse-maintenance.sh

Add to crontab for daily execution
echo "0 2   * /opt/clickhouse-maintenance.sh" | sudo crontab -

Common issues

Symptom	Cause	Fix
Kafka Connect fails to start	Missing ClickHouse connector plugin	Verify plugin.path in connect-distributed.properties and restart service
ClickHouse not receiving data	Kafka consumer group not active	Check consumer group status and restart materialized view: `DETACH TABLE events_mv; ATTACH TABLE events_mv;`
High memory usage in ClickHouse	Large result sets or inefficient queries	Implement query result limits and optimize table partitioning
Kafka topic lag increasing	ClickHouse ingestion slower than production	Increase kafka_num_consumers and add more ClickHouse replicas
ZooKeeper connection timeouts	Network latency or ZooKeeper overload	Increase zookeeper.connection.timeout.ms in Kafka config
Materialized view not updating	Kafka engine table not consuming messages	Check Kafka connectivity: `SELECT * FROM system.kafka_consumers;`

Next steps

Configure ZooKeeper for ClickHouse replication with multi-node cluster setup to scale your analytics cluster
Configure Kafka Connect for database integration with JDBC connectors and CDC for additional data sources
Set up Prometheus and Grafana monitoring stack with Docker compose for comprehensive pipeline monitoring
Configure Kafka Schema Registry with Avro serialization for data governance
Implement ClickHouse backup automation with compression and S3 integration

Automated install script

Run this to automate the entire setup

install.sh

#!/usr/bin/env bash

set -euo pipefail

# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m' # No Color

# Global variables
TOTAL_STEPS=8
CLICKHOUSE_USER="analytics"
CLICKHOUSE_PASSWORD=""

# Cleanup function
cleanup() {
    echo -e "${RED}[ERROR] Installation failed. Cleaning up...${NC}"
    systemctl stop clickhouse-server zookeeper kafka 2>/dev/null || true
    exit 1
}

trap cleanup ERR

usage() {
    echo "Usage: $0 [OPTIONS]"
    echo "Options:"
    echo "  --clickhouse-password PASSWORD  Set ClickHouse analytics user password (optional)"
    echo "  -h, --help                      Show this help message"
    exit 1
}

# Parse arguments
while [[ $# -gt 0 ]]; do
    case $1 in
        --clickhouse-password)
            CLICKHOUSE_PASSWORD="$2"
            shift 2
            ;;
        -h|--help)
            usage
            ;;
        *)
            echo -e "${RED}Unknown option: $1${NC}"
            usage
            ;;
    esac
done

# Check if running as root or with sudo
if [[ $EUID -ne 0 ]]; then
    echo -e "${RED}This script must be run as root or with sudo${NC}"
    exit 1
fi

# Auto-detect distribution
if [ -f /etc/os-release ]; then
    . /etc/os-release
    case "$ID" in
        ubuntu|debian)
            PKG_MGR="apt"
            PKG_UPDATE="apt update && apt upgrade -y"
            PKG_INSTALL="apt install -y"
            JAVA_PKG="openjdk-11-jdk"
            ;;
        almalinux|rocky|centos|rhel|ol)
            PKG_MGR="dnf"
            PKG_UPDATE="dnf update -y"
            PKG_INSTALL="dnf install -y"
            JAVA_PKG="java-11-openjdk java-11-openjdk-devel"
            ;;
        fedora)
            PKG_MGR="dnf"
            PKG_UPDATE="dnf update -y"
            PKG_INSTALL="dnf install -y"
            JAVA_PKG="java-11-openjdk java-11-openjdk-devel"
            ;;
        amzn)
            PKG_MGR="yum"
            PKG_UPDATE="yum update -y"
            PKG_INSTALL="yum install -y"
            JAVA_PKG="java-11-openjdk java-11-openjdk-devel"
            ;;
        *)
            echo -e "${RED}Unsupported distribution: $ID${NC}"
            exit 1
            ;;
    esac
else
    echo -e "${RED}Cannot detect distribution. /etc/os-release not found.${NC}"
    exit 1
fi

echo -e "${GREEN}[1/$TOTAL_STEPS] Updating system packages...${NC}"
$PKG_UPDATE
if [ "$PKG_MGR" = "apt" ]; then
    $PKG_INSTALL curl wget gnupg software-properties-common apt-transport-https ca-certificates
else
    $PKG_INSTALL curl wget gnupg2 ca-certificates yum-utils
fi

echo -e "${GREEN}[2/$TOTAL_STEPS] Installing Java for Kafka...${NC}"
$PKG_INSTALL $JAVA_PKG
java -version

echo -e "${GREEN}[3/$TOTAL_STEPS] Installing ClickHouse...${NC}"
if [ "$PKG_MGR" = "apt" ]; then
    curl -fsSL 'https://packages.clickhouse.com/rpm/lts/repodata/repomd.xml.key' | gpg --dearmor -o /usr/share/keyrings/clickhouse-keyring.gpg
    echo "deb [signed-by=/usr/share/keyrings/clickhouse-keyring.gpg] https://packages.clickhouse.com/deb stable main" > /etc/apt/sources.list.d/clickhouse.list
    apt update
    $PKG_INSTALL clickhouse-server clickhouse-client
else
    yum-config-manager --add-repo https://packages.clickhouse.com/rpm/clickhouse.repo || \
    dnf config-manager --add-repo https://packages.clickhouse.com/rpm/clickhouse.repo
    $PKG_INSTALL clickhouse-server clickhouse-client
fi

echo -e "${GREEN}[4/$TOTAL_STEPS] Configuring ClickHouse...${NC}"
cat > /etc/clickhouse-server/config.xml << 'EOF'
<?xml version="1.0"?>
<clickhouse>
    <logger>
        <level>information</level>
        <log>/var/log/clickhouse-server/clickhouse-server.log</log>
        <errorlog>/var/log/clickhouse-server/clickhouse-server.err.log</errorlog>
        <size>1000M</size>
        <count>10</count>
    </logger>
    <http_port>8123</http_port>
    <tcp_port>9000</tcp_port>
    <mysql_port>9004</mysql_port>
    <postgresql_port>9005</postgresql_port>
    <listen_host>0.0.0.0</listen_host>
    <max_connections>4096</max_connections>
    <keep_alive_timeout>3</keep_alive_timeout>
    <max_concurrent_queries>100</max_concurrent_queries>
    <uncompressed_cache_size>8589934592</uncompressed_cache_size>
    <mark_cache_size>5368709120</mark_cache_size>
    <path>/var/lib/clickhouse/</path>
    <tmp_path>/var/lib/clickhouse/tmp/</tmp_path>
    <users_config>users.xml</users_config>
    <default_profile>default</default_profile>
    <default_database>default</default_database>
    <timezone>UTC</timezone>
    <mlock_executable>true</mlock_executable>
</clickhouse>
EOF

chown clickhouse:clickhouse /etc/clickhouse-server/config.xml
chmod 644 /etc/clickhouse-server/config.xml

echo -e "${GREEN}[5/$TOTAL_STEPS] Configuring ClickHouse users...${NC}"
CLICKHOUSE_PASSWORD_HASH=""
if [ -n "$CLICKHOUSE_PASSWORD" ]; then
    CLICKHOUSE_PASSWORD_HASH=$(echo -n "$CLICKHOUSE_PASSWORD" | sha256sum | awk '{print $1}')
fi

cat > /etc/clickhouse-server/users.xml << EOF
<?xml version="1.0"?>
<clickhouse>
    <users>
        <default>
            <password></password>
            <networks incl="networks" replace="replace">
                <ip>::1</ip>
                <ip>127.0.0.1</ip>
                <ip>10.0.0.0/8</ip>
                <ip>172.16.0.0/12</ip>
                <ip>192.168.0.0/16</ip>
            </networks>
            <profile>default</profile>
            <quota>default</quota>
        </default>
        <${CLICKHOUSE_USER}>
            <password_sha256_hex>${CLICKHOUSE_PASSWORD_HASH}</password_sha256_hex>
            <networks>
                <ip>127.0.0.1</ip>
                <ip>10.0.0.0/8</ip>
            </networks>
            <profile>analytics</profile>
            <quota>default</quota>
        </${CLICKHOUSE_USER}>
    </users>
    <profiles>
        <default>
            <max_memory_usage>10000000000</max_memory_usage>
            <use_uncompressed_cache>0</use_uncompressed_cache>
            <load_balancing>random</load_balancing>
        </default>
        <analytics>
            <max_memory_usage>20000000000</max_memory_usage>
            <max_bytes_before_external_group_by>20000000000</max_bytes_before_external_group_by>
            <max_bytes_before_external_sort>20000000000</max_bytes_before_external_sort>
            <max_query_size>1000000000</max_query_size>
            <max_ast_elements>1000000</max_ast_elements>
            <readonly>0</readonly>
        </analytics>
    </profiles>
    <quotas>
        <default>
            <interval>
                <duration>3600</duration>
                <queries>0</queries>
                <errors>0</errors>
                <result_rows>0</result_rows>
                <read_rows>0</read_rows>
                <execution_time>0</execution_time>
            </interval>
        </default>
    </quotas>
</clickhouse>
EOF

chown clickhouse:clickhouse /etc/clickhouse-server/users.xml
chmod 644 /etc/clickhouse-server/users.xml

echo -e "${GREEN}[6/$TOTAL_STEPS] Installing Apache ZooKeeper...${NC}"
cd /opt
wget -q https://archive.apache.org/dist/zookeeper/zookeeper-3.8.3/apache-zookeeper-3.8.3-bin.tar.gz
tar -xzf apache-zookeeper-3.8.3-bin.tar.gz
mv apache-zookeeper-3.8.3-bin zookeeper
rm apache-zookeeper-3.8.3-bin.tar.gz
chown -R root:root /opt/zookeeper
chmod -R 755 /opt/zookeeper

# Configure ZooKeeper
mkdir -p /var/lib/zookeeper /var/log/zookeeper
echo "1" > /var/lib/zookeeper/myid
chown -R root:root /var/lib/zookeeper /var/log/zookeeper
chmod -R 755 /var/lib/zookeeper /var/log/zookeeper

cat > /opt/zookeeper/conf/zoo.cfg << 'EOF'
tickTime=2000
dataDir=/var/lib/zookeeper
dataLogDir=/var/log/zookeeper
clientPort=2181
initLimit=5
syncLimit=2
server.1=localhost:2888:3888
EOF

echo -e "${GREEN}[7/$TOTAL_STEPS] Installing Apache Kafka...${NC}"
cd /opt
wget -q https://archive.apache.org/dist/kafka/2.13-3.5.1/kafka_2.13-3.5.1.tgz
tar -xzf kafka_2.13-3.5.1.tgz
mv kafka_2.13-3.5.1 kafka
rm kafka_2.13-3.5.1.tgz
chown -R root:root /opt/kafka
chmod -R 755 /opt/kafka

# Configure Kafka
mkdir -p /var/lib/kafka-logs
chown -R root:root /var/lib/kafka-logs
chmod -R 755 /var/lib/kafka-logs

# Create systemd services
cat > /etc/systemd/system/zookeeper.service << 'EOF'
[Unit]
Description=Apache ZooKeeper
After=network.target

[Service]
Type=forking
User=root
Group=root
ExecStart=/opt/zookeeper/bin/zkServer.sh start
ExecStop=/opt/zookeeper/bin/zkServer.sh stop
Restart=on-failure

[Install]
WantedBy=multi-user.target
EOF

cat > /etc/systemd/system/kafka.service << 'EOF'
[Unit]
Description=Apache Kafka
After=zookeeper.service
Requires=zookeeper.service

[Service]
Type=simple
User=root
Group=root
ExecStart=/opt/kafka/bin/kafka-server-start.sh /opt/kafka/config/server.properties
ExecStop=/opt/kafka/bin/kafka-server-stop.sh
Restart=on-failure

[Install]
WantedBy=multi-user.target
EOF

echo -e "${GREEN}[8/$TOTAL_STEPS] Starting services and verification...${NC}"
systemctl daemon-reload
systemctl enable clickhouse-server zookeeper kafka
systemctl start clickhouse-server
sleep 5
systemctl start zookeeper
sleep 10
systemctl start kafka
sleep 10

# Verify services
if systemctl is-active --quiet clickhouse-server; then
    echo -e "${GREEN}✓ ClickHouse is running${NC}"
else
    echo -e "${RED}✗ ClickHouse failed to start${NC}"
    exit 1
fi

if systemctl is-active --quiet zookeeper; then
    echo -e "${GREEN}✓ ZooKeeper is running${NC}"
else
    echo -e "${RED}✗ ZooKeeper failed to start${NC}"
    exit 1
fi

if systemctl is-active --quiet kafka; then
    echo -e "${GREEN}✓ Kafka is running${NC}"
else
    echo -e "${RED}✗ Kafka failed to start${NC}"
    exit 1
fi

echo -e "${GREEN}Installation completed successfully!${NC}"
echo -e "${YELLOW}ClickHouse HTTP interface: http://localhost:8123${NC}"
echo -e "${YELLOW}ClickHouse TCP port: 9000${NC}"
echo -e "${YELLOW}Kafka broker: localhost:9092${NC}"
echo -e "${YELLOW}ZooKeeper: localhost:2181${NC}"
if [ -n "$CLICKHOUSE_PASSWORD" ]; then
    echo -e "${YELLOW}ClickHouse analytics user: $CLICKHOUSE_USER (password set)${NC}"
else
    echo -e "${YELLOW}ClickHouse analytics user: $CLICKHOUSE_USER (no password)${NC}"
fi

Review the script before running. Execute with: bash install.sh

#clickhouse #kafka #streaming #analytics #real-time

Set up ClickHouse and Kafka real-time data pipeline with streaming analytics

Prerequisites

What this solves

Step-by-step installation

Update system packages

Install Java for Kafka

Install ClickHouse

Configure ClickHouse for production

Configure ClickHouse users and security

Install Apache ZooKeeper

Configure ZooKeeper

Create ZooKeeper systemd service

Install Apache Kafka

Configure Kafka server

Create Kafka systemd service

Install Kafka Connect

Configure Kafka Connect for ClickHouse

Start all services

Wait for services to start

Verify services are running

Create ClickHouse database and tables

Create Kafka topic for events

Create real-time analytics views

Start Kafka Connect

Create systemd service for Kafka Connect

Verify your setup

Test Kafka topic creation

Verify ClickHouse is accepting connections

Test data pipeline with sample events

Wait a few seconds for processing

Verify data in ClickHouse

Check Kafka Connect status

Verify Kafka consumer group

Performance optimization and monitoring

Configure ClickHouse performance monitoring

Create monitoring queries

Set up automated performance optimization

Optimize tables for better query performance

Clean up old partitions

Update table statistics

Add to crontab for daily execution

Common issues

Next steps

Related tutorials

Configure Cassandra SSL encryption and authentication with security hardening

Optimize Cassandra data modeling and query performance with advanced tuning and monitoring

Set up Cassandra backup automation with nodetool

Don't want to manage this yourself?