Build a production-ready real-time data pipeline using ClickHouse for high-performance analytics and Apache Kafka for streaming data ingestion. Configure clustering, replication, and automated data processing workflows.
Prerequisites
- Root or sudo access
- Minimum 8GB RAM
- Java 11 or higher
- Network access to download packages
- At least 50GB free disk space
What this solves
Real-time data analytics requires a robust pipeline that can ingest, process, and analyze streaming data at scale. This tutorial sets up ClickHouse as your analytical database with Apache Kafka for stream processing, creating a production-grade pipeline capable of handling millions of events per second with sub-second query latency.
Step-by-step installation
Update system packages
Start by updating your package manager and installing essential dependencies for the data pipeline setup.
sudo apt update && sudo apt upgrade -y
sudo apt install -y curl wget gnupg software-properties-common apt-transport-https ca-certificates
Install Java for Kafka
Apache Kafka requires Java 8 or higher. Install OpenJDK 11 for optimal compatibility and performance.
sudo apt install -y openjdk-11-jdk
java -version
Install ClickHouse
Add the ClickHouse repository and install the server with client tools for high-performance analytical processing.
curl -fsSL 'https://packages.clickhouse.com/rpm/lts/repodata/repomd.xml.key' | sudo gpg --dearmor -o /usr/share/keyrings/clickhouse-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/clickhouse-keyring.gpg] https://packages.clickhouse.com/deb stable main" | sudo tee /etc/apt/sources.list.d/clickhouse.list
sudo apt update
sudo apt install -y clickhouse-server clickhouse-client
Configure ClickHouse for production
Set up ClickHouse with optimized settings for real-time analytics workloads and enable clustering support.
information
/var/log/clickhouse-server/clickhouse-server.log
/var/log/clickhouse-server/clickhouse-server.err.log
1000M
10
8123
9000
9004
9005
0.0.0.0
4096
3
100
0
users.xml
default
default
UTC
true
localhost
9000
earliest
|
5000
Configure ClickHouse users and security
Set up user authentication and access controls for secure database operations.
::1
127.0.0.1
10.0.0.0/8
172.16.0.0/12
192.168.0.0/16
default
default
default
e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
127.0.0.1
10.0.0.0/8
analytics
default
analytics
default
10000000000
0
random
20000000000
20000000000
20000000000
1000000000
1000000
0
3600
0
0
0
0
0
Install Apache ZooKeeper
Install ZooKeeper for Kafka cluster coordination and ClickHouse replication management.
cd /opt
sudo wget https://archive.apache.org/dist/zookeeper/zookeeper-3.8.3/apache-zookeeper-3.8.3-bin.tar.gz
sudo tar -xzf apache-zookeeper-3.8.3-bin.tar.gz
sudo mv apache-zookeeper-3.8.3-bin zookeeper
sudo chown -R root:root /opt/zookeeper
Configure ZooKeeper
Set up ZooKeeper configuration for stable cluster operations and data consistency.
tickTime=2000
dataDir=/var/lib/zookeeper
clientPort=2181
initLimit=10
syncLimit=5
server.1=localhost:2888:3888
maxClientCnxns=60
autopurge.snapRetainCount=3
autopurge.purgeInterval=1
sudo mkdir -p /var/lib/zookeeper
sudo chown -R root:root /var/lib/zookeeper
echo "1" | sudo tee /var/lib/zookeeper/myid
Create ZooKeeper systemd service
Configure ZooKeeper as a system service for automatic startup and process management.
[Unit]
Description=Apache ZooKeeper server
Documentation=http://zookeeper.apache.org
Requires=network.target remote-fs.target
After=network.target remote-fs.target
[Service]
Type=forking
User=root
Group=root
ExecStart=/opt/zookeeper/bin/zkServer.sh start
ExecStop=/opt/zookeeper/bin/zkServer.sh stop
ExecReload=/opt/zookeeper/bin/zkServer.sh restart
WorkingDirectory=/opt/zookeeper
[Install]
WantedBy=multi-user.target
Install Apache Kafka
Download and install Kafka for distributed streaming platform capabilities.
cd /opt
sudo wget https://archive.apache.org/dist/kafka/2.13-3.6.1/kafka_2.13-3.6.1.tgz
sudo tar -xzf kafka_2.13-3.6.1.tgz
sudo mv kafka_2.13-3.6.1 kafka
sudo chown -R root:root /opt/kafka
Configure Kafka server
Set up Kafka broker configuration optimized for high-throughput data streaming and ClickHouse integration.
broker.id=0
listeners=PLAINTEXT://localhost:9092
log.dirs=/var/lib/kafka-logs
num.network.threads=8
num.io.threads=16
socket.send.buffer.bytes=102400
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600
num.partitions=3
num.recovery.threads.per.data.dir=1
offsets.topic.replication.factor=1
transaction.state.log.replication.factor=1
transaction.state.log.min.isr=1
log.retention.hours=168
log.retention.bytes=1073741824
log.segment.bytes=1073741824
log.retention.check.interval.ms=300000
zookeeper.connect=localhost:2181
zookeeper.connection.timeout.ms=18000
group.initial.rebalance.delay.ms=0
auto.create.topics.enable=true
delete.topic.enable=true
compression.type=lz4
max.request.size=10485760
message.max.bytes=10485760
replica.fetch.max.bytes=10485760
sudo mkdir -p /var/lib/kafka-logs
sudo chown -R root:root /var/lib/kafka-logs
Create Kafka systemd service
Configure Kafka as a system service with proper dependency management.
[Unit]
Description=Apache Kafka server
Documentation=http://kafka.apache.org/documentation.html
Requires=zookeeper.service
After=zookeeper.service
[Service]
Type=simple
User=root
Group=root
Environment=JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
ExecStart=/opt/kafka/bin/kafka-server-start.sh /opt/kafka/config/server.properties
ExecStop=/opt/kafka/bin/kafka-server-stop.sh
Restart=on-failure
[Install]
WantedBy=multi-user.target
Install Kafka Connect
Set up Kafka Connect for seamless data integration between Kafka and ClickHouse.
cd /opt
sudo wget https://github.com/ClickHouse/clickhouse-kafka-connect/releases/download/v1.0.12/clickhouse-kafka-connect-v1.0.12.zip
sudo unzip clickhouse-kafka-connect-v1.0.12.zip -d kafka-connect-clickhouse
sudo mv kafka-connect-clickhouse /opt/kafka/
sudo chown -R root:root /opt/kafka/kafka-connect-clickhouse
Configure Kafka Connect for ClickHouse
Set up Kafka Connect worker configuration for distributed mode operation.
bootstrap.servers=localhost:9092
group.id=clickhouse-connect-cluster
key.converter=org.apache.kafka.connect.storage.StringConverter
value.converter=org.apache.kafka.connect.storage.StringConverter
key.converter.schemas.enable=false
value.converter.schemas.enable=false
offset.storage.topic=connect-offsets
offset.storage.replication.factor=1
offset.storage.partitions=25
config.storage.topic=connect-configs
config.storage.replication.factor=1
status.storage.topic=connect-status
status.storage.replication.factor=1
status.storage.partitions=5
offset.flush.interval.ms=10000
plugin.path=/opt/kafka/kafka-connect-clickhouse
rest.host.name=localhost
rest.port=8083
Start all services
Enable and start ZooKeeper, Kafka, and ClickHouse services in the correct order.
sudo systemctl daemon-reload
sudo systemctl enable --now zookeeper
sudo systemctl enable --now clickhouse-server
sudo systemctl enable --now kafka
Wait for services to start
sleep 10
Verify services are running
sudo systemctl status zookeeper
sudo systemctl status clickhouse-server
sudo systemctl status kafka
Create ClickHouse database and tables
Set up the analytics database structure optimized for real-time data ingestion from Kafka.
clickhouse-client --query="CREATE DATABASE IF NOT EXISTS analytics"
clickhouse-client --database=analytics --query="
CREATE TABLE IF NOT EXISTS events_queue (
timestamp DateTime64(3),
user_id String,
event_type String,
properties String,
ip_address IPv4,
user_agent String
) ENGINE = Kafka()
SETTINGS
kafka_broker_list = 'localhost:9092',
kafka_topic_list = 'user_events',
kafka_group_name = 'clickhouse_consumer',
kafka_format = 'JSONEachRow',
kafka_num_consumers = 3;"
clickhouse-client --database=analytics --query="
CREATE TABLE IF NOT EXISTS events (
timestamp DateTime64(3),
user_id String,
event_type String,
properties String,
ip_address IPv4,
user_agent String,
date Date MATERIALIZED toDate(timestamp)
) ENGINE = MergeTree()
PARTITION BY date
ORDER BY (event_type, user_id, timestamp)
TTL date + INTERVAL 90 DAY;"
clickhouse-client --database=analytics --query="
CREATE MATERIALIZED VIEW IF NOT EXISTS events_mv TO events AS
SELECT
timestamp,
user_id,
event_type,
properties,
ip_address,
user_agent
FROM events_queue;"
Create Kafka topic for events
Set up a Kafka topic with optimal partitioning for high-throughput event streaming.
/opt/kafka/bin/kafka-topics.sh --create \
--bootstrap-server localhost:9092 \
--replication-factor 1 \
--partitions 6 \
--topic user_events \
--config retention.ms=604800000 \
--config compression.type=lz4
Create real-time analytics views
Set up materialized views for common analytics queries with automatic aggregation.
clickhouse-client --database=analytics --query="
CREATE TABLE IF NOT EXISTS hourly_stats (
hour DateTime,
event_type String,
event_count UInt64,
unique_users UInt64
) ENGINE = SummingMergeTree()
PARTITION BY toYYYYMM(hour)
ORDER BY (hour, event_type);"
clickhouse-client --database=analytics --query="
CREATE MATERIALIZED VIEW IF NOT EXISTS hourly_stats_mv TO hourly_stats AS
SELECT
toStartOfHour(timestamp) as hour,
event_type,
count() as event_count,
uniq(user_id) as unique_users
FROM events
GROUP BY hour, event_type;"
clickhouse-client --database=analytics --query="
CREATE TABLE IF NOT EXISTS top_events_by_hour (
hour DateTime,
event_type String,
event_count UInt64
) ENGINE = ReplacingMergeTree()
PARTITION BY toYYYYMM(hour)
ORDER BY (hour, event_count, event_type);"
clickhouse-client --database=analytics --query="
CREATE MATERIALIZED VIEW IF NOT EXISTS top_events_mv TO top_events_by_hour AS
SELECT
toStartOfHour(timestamp) as hour,
event_type,
count() as event_count
FROM events
GROUP BY hour, event_type
HAVING event_count > 10;"
Start Kafka Connect
Launch Kafka Connect in distributed mode to handle data pipeline operations.
/opt/kafka/bin/connect-distributed.sh /opt/kafka/config/connect-distributed.properties > /var/log/kafka-connect.log 2>&1 &
Create systemd service for Kafka Connect
sudo tee /etc/systemd/system/kafka-connect.service > /dev/null << 'EOF'
[Unit]
Description=Kafka Connect
Requires=kafka.service
After=kafka.service
[Service]
Type=simple
User=root
Group=root
Environment=JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
ExecStart=/opt/kafka/bin/connect-distributed.sh /opt/kafka/config/connect-distributed.properties
Restart=on-failure
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable --now kafka-connect
Verify your setup
Test the complete data pipeline by producing sample events and verifying data flow through Kafka to ClickHouse.
# Check all services are running
sudo systemctl status zookeeper clickhouse-server kafka kafka-connect
Test Kafka topic creation
/opt/kafka/bin/kafka-topics.sh --list --bootstrap-server localhost:9092
Verify ClickHouse is accepting connections
clickhouse-client --query="SELECT version()"
Test data pipeline with sample events
echo '{"timestamp":"2024-01-15 10:30:00.000","user_id":"user123","event_type":"page_view","properties":"{\"page\":\"/home\"}","ip_address":"203.0.113.10","user_agent":"Mozilla/5.0"}' | /opt/kafka/bin/kafka-console-producer.sh --bootstrap-server localhost:9092 --topic user_events
echo '{"timestamp":"2024-01-15 10:31:00.000","user_id":"user456","event_type":"click","properties":"{\"button\":\"signup\"}","ip_address":"203.0.113.20","user_agent":"Mozilla/5.0"}' | /opt/kafka/bin/kafka-console-producer.sh --bootstrap-server localhost:9092 --topic user_events
Wait a few seconds for processing
sleep 5
Verify data in ClickHouse
clickhouse-client --database=analytics --query="SELECT count() FROM events"
clickhouse-client --database=analytics --query="SELECT * FROM events LIMIT 5"
clickhouse-client --database=analytics --query="SELECT * FROM hourly_stats LIMIT 5"
Check Kafka Connect status
curl -s localhost:8083/connectors
Verify Kafka consumer group
/opt/kafka/bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group clickhouse_consumer
Performance optimization and monitoring
Configure ClickHouse performance monitoring
Set up system tables and queries for monitoring pipeline performance and identifying bottlenecks.
clickhouse-client --database=analytics --query="
CREATE TABLE IF NOT EXISTS pipeline_metrics (
timestamp DateTime,
metric_name String,
metric_value Float64,
tags String
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY (timestamp, metric_name);"
Create monitoring queries
clickhouse-client --database=analytics --query="
SELECT
'events_per_second' as metric,
count() / 3600 as value,
toStartOfHour(now()) as hour
FROM events
WHERE timestamp >= now() - INTERVAL 1 HOUR;"
clickhouse-client --database=analytics --query="
SELECT
'partition_size_gb' as metric,
sum(bytes_on_disk) / (102410241024) as size_gb,
table
FROM system.parts
WHERE database = 'analytics' AND active = 1
GROUP BY table;"
Set up automated performance optimization
Configure automatic table optimization and partition management for sustained performance.
#!/bin/bash
Optimize tables for better query performance
clickhouse-client --database=analytics --query="OPTIMIZE TABLE events FINAL"
clickhouse-client --database=analytics --query="OPTIMIZE TABLE hourly_stats FINAL"
clickhouse-client --database=analytics --query="OPTIMIZE TABLE top_events_by_hour FINAL"
Clean up old partitions
clickhouse-client --database=analytics --query="ALTER TABLE events DROP PARTITION '$(date -d '91 days ago' '+%Y%m%d')'"
Update table statistics
clickhouse-client --database=analytics --query="SYSTEM FLUSH LOGS"
echo "$(date): ClickHouse maintenance completed" >> /var/log/clickhouse-maintenance.log
sudo chmod +x /opt/clickhouse-maintenance.sh
Add to crontab for daily execution
echo "0 2 * /opt/clickhouse-maintenance.sh" | sudo crontab -
Common issues
| Symptom | Cause | Fix |
|---|---|---|
| Kafka Connect fails to start | Missing ClickHouse connector plugin | Verify plugin.path in connect-distributed.properties and restart service |
| ClickHouse not receiving data | Kafka consumer group not active | Check consumer group status and restart materialized view: DETACH TABLE events_mv; ATTACH TABLE events_mv; |
| High memory usage in ClickHouse | Large result sets or inefficient queries | Implement query result limits and optimize table partitioning |
| Kafka topic lag increasing | ClickHouse ingestion slower than production | Increase kafka_num_consumers and add more ClickHouse replicas |
| ZooKeeper connection timeouts | Network latency or ZooKeeper overload | Increase zookeeper.connection.timeout.ms in Kafka config |
| Materialized view not updating | Kafka engine table not consuming messages | Check Kafka connectivity: SELECT * FROM system.kafka_consumers; |
Next steps
- Configure ZooKeeper for ClickHouse replication with multi-node cluster setup to scale your analytics cluster
- Configure Kafka Connect for database integration with JDBC connectors and CDC for additional data sources
- Set up Prometheus and Grafana monitoring stack with Docker compose for comprehensive pipeline monitoring
- Configure Kafka Schema Registry with Avro serialization for data governance
- Implement ClickHouse backup automation with compression and S3 integration
Automated install script
Run this to automate the entire setup
#!/usr/bin/env bash
set -euo pipefail
# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m' # No Color
# Global variables
TOTAL_STEPS=8
CLICKHOUSE_USER="analytics"
CLICKHOUSE_PASSWORD=""
# Cleanup function
cleanup() {
echo -e "${RED}[ERROR] Installation failed. Cleaning up...${NC}"
systemctl stop clickhouse-server zookeeper kafka 2>/dev/null || true
exit 1
}
trap cleanup ERR
usage() {
echo "Usage: $0 [OPTIONS]"
echo "Options:"
echo " --clickhouse-password PASSWORD Set ClickHouse analytics user password (optional)"
echo " -h, --help Show this help message"
exit 1
}
# Parse arguments
while [[ $# -gt 0 ]]; do
case $1 in
--clickhouse-password)
CLICKHOUSE_PASSWORD="$2"
shift 2
;;
-h|--help)
usage
;;
*)
echo -e "${RED}Unknown option: $1${NC}"
usage
;;
esac
done
# Check if running as root or with sudo
if [[ $EUID -ne 0 ]]; then
echo -e "${RED}This script must be run as root or with sudo${NC}"
exit 1
fi
# Auto-detect distribution
if [ -f /etc/os-release ]; then
. /etc/os-release
case "$ID" in
ubuntu|debian)
PKG_MGR="apt"
PKG_UPDATE="apt update && apt upgrade -y"
PKG_INSTALL="apt install -y"
JAVA_PKG="openjdk-11-jdk"
;;
almalinux|rocky|centos|rhel|ol)
PKG_MGR="dnf"
PKG_UPDATE="dnf update -y"
PKG_INSTALL="dnf install -y"
JAVA_PKG="java-11-openjdk java-11-openjdk-devel"
;;
fedora)
PKG_MGR="dnf"
PKG_UPDATE="dnf update -y"
PKG_INSTALL="dnf install -y"
JAVA_PKG="java-11-openjdk java-11-openjdk-devel"
;;
amzn)
PKG_MGR="yum"
PKG_UPDATE="yum update -y"
PKG_INSTALL="yum install -y"
JAVA_PKG="java-11-openjdk java-11-openjdk-devel"
;;
*)
echo -e "${RED}Unsupported distribution: $ID${NC}"
exit 1
;;
esac
else
echo -e "${RED}Cannot detect distribution. /etc/os-release not found.${NC}"
exit 1
fi
echo -e "${GREEN}[1/$TOTAL_STEPS] Updating system packages...${NC}"
$PKG_UPDATE
if [ "$PKG_MGR" = "apt" ]; then
$PKG_INSTALL curl wget gnupg software-properties-common apt-transport-https ca-certificates
else
$PKG_INSTALL curl wget gnupg2 ca-certificates yum-utils
fi
echo -e "${GREEN}[2/$TOTAL_STEPS] Installing Java for Kafka...${NC}"
$PKG_INSTALL $JAVA_PKG
java -version
echo -e "${GREEN}[3/$TOTAL_STEPS] Installing ClickHouse...${NC}"
if [ "$PKG_MGR" = "apt" ]; then
curl -fsSL 'https://packages.clickhouse.com/rpm/lts/repodata/repomd.xml.key' | gpg --dearmor -o /usr/share/keyrings/clickhouse-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/clickhouse-keyring.gpg] https://packages.clickhouse.com/deb stable main" > /etc/apt/sources.list.d/clickhouse.list
apt update
$PKG_INSTALL clickhouse-server clickhouse-client
else
yum-config-manager --add-repo https://packages.clickhouse.com/rpm/clickhouse.repo || \
dnf config-manager --add-repo https://packages.clickhouse.com/rpm/clickhouse.repo
$PKG_INSTALL clickhouse-server clickhouse-client
fi
echo -e "${GREEN}[4/$TOTAL_STEPS] Configuring ClickHouse...${NC}"
cat > /etc/clickhouse-server/config.xml << 'EOF'
<?xml version="1.0"?>
<clickhouse>
<logger>
<level>information</level>
<log>/var/log/clickhouse-server/clickhouse-server.log</log>
<errorlog>/var/log/clickhouse-server/clickhouse-server.err.log</errorlog>
<size>1000M</size>
<count>10</count>
</logger>
<http_port>8123</http_port>
<tcp_port>9000</tcp_port>
<mysql_port>9004</mysql_port>
<postgresql_port>9005</postgresql_port>
<listen_host>0.0.0.0</listen_host>
<max_connections>4096</max_connections>
<keep_alive_timeout>3</keep_alive_timeout>
<max_concurrent_queries>100</max_concurrent_queries>
<uncompressed_cache_size>8589934592</uncompressed_cache_size>
<mark_cache_size>5368709120</mark_cache_size>
<path>/var/lib/clickhouse/</path>
<tmp_path>/var/lib/clickhouse/tmp/</tmp_path>
<users_config>users.xml</users_config>
<default_profile>default</default_profile>
<default_database>default</default_database>
<timezone>UTC</timezone>
<mlock_executable>true</mlock_executable>
</clickhouse>
EOF
chown clickhouse:clickhouse /etc/clickhouse-server/config.xml
chmod 644 /etc/clickhouse-server/config.xml
echo -e "${GREEN}[5/$TOTAL_STEPS] Configuring ClickHouse users...${NC}"
CLICKHOUSE_PASSWORD_HASH=""
if [ -n "$CLICKHOUSE_PASSWORD" ]; then
CLICKHOUSE_PASSWORD_HASH=$(echo -n "$CLICKHOUSE_PASSWORD" | sha256sum | awk '{print $1}')
fi
cat > /etc/clickhouse-server/users.xml << EOF
<?xml version="1.0"?>
<clickhouse>
<users>
<default>
<password></password>
<networks incl="networks" replace="replace">
<ip>::1</ip>
<ip>127.0.0.1</ip>
<ip>10.0.0.0/8</ip>
<ip>172.16.0.0/12</ip>
<ip>192.168.0.0/16</ip>
</networks>
<profile>default</profile>
<quota>default</quota>
</default>
<${CLICKHOUSE_USER}>
<password_sha256_hex>${CLICKHOUSE_PASSWORD_HASH}</password_sha256_hex>
<networks>
<ip>127.0.0.1</ip>
<ip>10.0.0.0/8</ip>
</networks>
<profile>analytics</profile>
<quota>default</quota>
</${CLICKHOUSE_USER}>
</users>
<profiles>
<default>
<max_memory_usage>10000000000</max_memory_usage>
<use_uncompressed_cache>0</use_uncompressed_cache>
<load_balancing>random</load_balancing>
</default>
<analytics>
<max_memory_usage>20000000000</max_memory_usage>
<max_bytes_before_external_group_by>20000000000</max_bytes_before_external_group_by>
<max_bytes_before_external_sort>20000000000</max_bytes_before_external_sort>
<max_query_size>1000000000</max_query_size>
<max_ast_elements>1000000</max_ast_elements>
<readonly>0</readonly>
</analytics>
</profiles>
<quotas>
<default>
<interval>
<duration>3600</duration>
<queries>0</queries>
<errors>0</errors>
<result_rows>0</result_rows>
<read_rows>0</read_rows>
<execution_time>0</execution_time>
</interval>
</default>
</quotas>
</clickhouse>
EOF
chown clickhouse:clickhouse /etc/clickhouse-server/users.xml
chmod 644 /etc/clickhouse-server/users.xml
echo -e "${GREEN}[6/$TOTAL_STEPS] Installing Apache ZooKeeper...${NC}"
cd /opt
wget -q https://archive.apache.org/dist/zookeeper/zookeeper-3.8.3/apache-zookeeper-3.8.3-bin.tar.gz
tar -xzf apache-zookeeper-3.8.3-bin.tar.gz
mv apache-zookeeper-3.8.3-bin zookeeper
rm apache-zookeeper-3.8.3-bin.tar.gz
chown -R root:root /opt/zookeeper
chmod -R 755 /opt/zookeeper
# Configure ZooKeeper
mkdir -p /var/lib/zookeeper /var/log/zookeeper
echo "1" > /var/lib/zookeeper/myid
chown -R root:root /var/lib/zookeeper /var/log/zookeeper
chmod -R 755 /var/lib/zookeeper /var/log/zookeeper
cat > /opt/zookeeper/conf/zoo.cfg << 'EOF'
tickTime=2000
dataDir=/var/lib/zookeeper
dataLogDir=/var/log/zookeeper
clientPort=2181
initLimit=5
syncLimit=2
server.1=localhost:2888:3888
EOF
echo -e "${GREEN}[7/$TOTAL_STEPS] Installing Apache Kafka...${NC}"
cd /opt
wget -q https://archive.apache.org/dist/kafka/2.13-3.5.1/kafka_2.13-3.5.1.tgz
tar -xzf kafka_2.13-3.5.1.tgz
mv kafka_2.13-3.5.1 kafka
rm kafka_2.13-3.5.1.tgz
chown -R root:root /opt/kafka
chmod -R 755 /opt/kafka
# Configure Kafka
mkdir -p /var/lib/kafka-logs
chown -R root:root /var/lib/kafka-logs
chmod -R 755 /var/lib/kafka-logs
# Create systemd services
cat > /etc/systemd/system/zookeeper.service << 'EOF'
[Unit]
Description=Apache ZooKeeper
After=network.target
[Service]
Type=forking
User=root
Group=root
ExecStart=/opt/zookeeper/bin/zkServer.sh start
ExecStop=/opt/zookeeper/bin/zkServer.sh stop
Restart=on-failure
[Install]
WantedBy=multi-user.target
EOF
cat > /etc/systemd/system/kafka.service << 'EOF'
[Unit]
Description=Apache Kafka
After=zookeeper.service
Requires=zookeeper.service
[Service]
Type=simple
User=root
Group=root
ExecStart=/opt/kafka/bin/kafka-server-start.sh /opt/kafka/config/server.properties
ExecStop=/opt/kafka/bin/kafka-server-stop.sh
Restart=on-failure
[Install]
WantedBy=multi-user.target
EOF
echo -e "${GREEN}[8/$TOTAL_STEPS] Starting services and verification...${NC}"
systemctl daemon-reload
systemctl enable clickhouse-server zookeeper kafka
systemctl start clickhouse-server
sleep 5
systemctl start zookeeper
sleep 10
systemctl start kafka
sleep 10
# Verify services
if systemctl is-active --quiet clickhouse-server; then
echo -e "${GREEN}✓ ClickHouse is running${NC}"
else
echo -e "${RED}✗ ClickHouse failed to start${NC}"
exit 1
fi
if systemctl is-active --quiet zookeeper; then
echo -e "${GREEN}✓ ZooKeeper is running${NC}"
else
echo -e "${RED}✗ ZooKeeper failed to start${NC}"
exit 1
fi
if systemctl is-active --quiet kafka; then
echo -e "${GREEN}✓ Kafka is running${NC}"
else
echo -e "${RED}✗ Kafka failed to start${NC}"
exit 1
fi
echo -e "${GREEN}Installation completed successfully!${NC}"
echo -e "${YELLOW}ClickHouse HTTP interface: http://localhost:8123${NC}"
echo -e "${YELLOW}ClickHouse TCP port: 9000${NC}"
echo -e "${YELLOW}Kafka broker: localhost:9092${NC}"
echo -e "${YELLOW}ZooKeeper: localhost:2181${NC}"
if [ -n "$CLICKHOUSE_PASSWORD" ]; then
echo -e "${YELLOW}ClickHouse analytics user: $CLICKHOUSE_USER (password set)${NC}"
else
echo -e "${YELLOW}ClickHouse analytics user: $CLICKHOUSE_USER (no password)${NC}"
fi
Review the script before running. Execute with: bash install.sh