Build a production-ready real-time analytics pipeline using Apache Spark 3.5 streaming, Kafka for data ingestion, and MinIO for distributed object storage. This tutorial covers fault-tolerant streaming configurations and end-to-end pipeline implementation.
Prerequisites
- Root or sudo access
- At least 8GB RAM
- 50GB available disk space
- Network connectivity for downloads
What this solves
This tutorial implements a complete real-time analytics platform using Apache Spark 3.5 streaming with Kafka integration and MinIO object storage. You'll build a fault-tolerant pipeline that ingests streaming data from Kafka, processes it with Spark, and stores results in MinIO for analytics workloads. This setup handles high-throughput data streams with exactly-once processing guarantees.
Step-by-step installation
Update system packages and install Java
Spark requires Java 11 or higher. Install OpenJDK 11 and update your system packages first.
sudo apt update && sudo apt upgrade -y
sudo apt install -y openjdk-11-jdk wget curl unzip
Create service accounts and directories
Create dedicated users for Spark, Kafka, and MinIO services with proper directory structure.
sudo useradd -r -s /bin/false spark
sudo useradd -r -s /bin/false kafka
sudo useradd -r -s /bin/false minio-user
sudo mkdir -p /opt/{spark,kafka,minio}
sudo mkdir -p /var/log/{spark,kafka,minio}
sudo mkdir -p /data/{kafka-logs,minio}
Install Apache Spark 3.5
Download and install Spark 3.5 with Hadoop support for distributed processing capabilities.
cd /opt
sudo wget https://archive.apache.org/dist/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz
sudo tar -xzf spark-3.5.0-bin-hadoop3.tgz
sudo mv spark-3.5.0-bin-hadoop3 spark/current
sudo chown -R spark:spark /opt/spark
sudo chown -R spark:spark /var/log/spark
sudo rm spark-3.5.0-bin-hadoop3.tgz
Install Apache Kafka
Install Kafka for real-time data streaming and message queuing between producers and Spark consumers.
cd /opt
sudo wget https://downloads.apache.org/kafka/2.13-3.6.1/kafka_2.13-3.6.1.tgz
sudo tar -xzf kafka_2.13-3.6.1.tgz
sudo mv kafka_2.13-3.6.1 kafka/current
sudo chown -R kafka:kafka /opt/kafka
sudo chown -R kafka:kafka /var/log/kafka
sudo chown -R kafka:kafka /data/kafka-logs
sudo rm kafka_2.13-3.6.1.tgz
Install MinIO object storage
Download and install MinIO for S3-compatible distributed object storage.
sudo wget https://dl.min.io/server/minio/release/linux-amd64/minio -O /opt/minio/minio
sudo chmod +x /opt/minio/minio
sudo chown minio-user:minio-user /opt/minio/minio
sudo chown -R minio-user:minio-user /data/minio
sudo chown -R minio-user:minio-user /var/log/minio
Configure environment variables
Set up environment variables for all services to ensure proper PATH and JAVA_HOME configuration.
JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
SPARK_HOME=/opt/spark/current
KAFKA_HOME=/opt/kafka/current
PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/spark/current/bin:/opt/spark/current/sbin:/opt/kafka/current/bin"
source /etc/environment
Configure Kafka server properties
Configure Kafka with proper logging, replication, and performance settings for production use.
broker.id=1
listeners=PLAINTEXT://0.0.0.0:9092
log.dirs=/data/kafka-logs
num.network.threads=8
num.io.threads=16
socket.send.buffer.bytes=102400
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600
num.partitions=3
num.recovery.threads.per.data.dir=2
offsets.topic.replication.factor=1
transaction.state.log.replication.factor=1
transaction.state.log.min.isr=1
log.retention.hours=168
log.segment.bytes=1073741824
log.retention.check.interval.ms=300000
zookeeper.connect=localhost:2181
zookeeper.connection.timeout.ms=18000
group.initial.rebalance.delay.ms=0
Configure ZooKeeper for Kafka
Set up ZooKeeper configuration for Kafka cluster coordination and metadata management.
dataDir=/data/kafka-logs/zookeeper
clientPort=2181
maxClientCnxns=0
admin.enableServer=false
tickTime=2000
initLimit=10
syncLimit=5
sudo mkdir -p /data/kafka-logs/zookeeper
sudo chown -R kafka:kafka /data/kafka-logs/zookeeper
Configure Spark for streaming
Configure Spark with optimized settings for streaming workloads and Kafka integration.
spark.master local[*]
spark.sql.adaptive.enabled true
spark.sql.adaptive.coalescePartitions.enabled true
spark.streaming.stopGracefullyOnShutdown true
spark.streaming.backpressure.enabled true
spark.streaming.kafka.consumer.poll.ms 512
spark.streaming.kafka.maxRatePerPartition 1000
spark.streaming.receiver.maxRate 1000
spark.sql.streaming.metricsEnabled true
spark.sql.streaming.noDataMicroBatches.enabled false
spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.endpoint http://localhost:9000
spark.hadoop.fs.s3a.access.key minioadmin
spark.hadoop.fs.s3a.secret.key minioadmin
spark.hadoop.fs.s3a.path.style.access true
spark.hadoop.fs.s3a.connection.ssl.enabled false
Download Kafka and S3 connectors for Spark
Download required JAR files for Kafka integration and MinIO S3 connectivity in Spark applications.
cd /opt/spark/current/jars
sudo wget https://repo1.maven.org/maven2/org/apache/spark/spark-sql-kafka-0-10_2.12/3.5.0/spark-sql-kafka-0-10_2.12-3.5.0.jar
sudo wget https://repo1.maven.org/maven2/org/apache/kafka/kafka-clients/3.6.1/kafka-clients-3.6.1.jar
sudo wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.4/hadoop-aws-3.3.4.jar
sudo wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.367/aws-java-sdk-bundle-1.12.367.jar
sudo chown spark:spark *.jar
Create systemd service for ZooKeeper
Create a systemd service file for ZooKeeper to enable automatic startup and proper service management.
[Unit]
Description=Apache ZooKeeper
Requires=network.target
After=network.target
[Service]
Type=simple
User=kafka
Group=kafka
ExecStart=/opt/kafka/current/bin/zookeeper-server-start.sh /opt/kafka/current/config/zookeeper.properties
ExecStop=/opt/kafka/current/bin/zookeeper-server-stop.sh
Restart=always
RestartSec=10
StandardOutput=journal
StandardError=journal
SyslogIdentifier=zookeeper
[Install]
WantedBy=multi-user.target
Create systemd service for Kafka
Create a systemd service file for Kafka with proper dependencies on ZooKeeper.
[Unit]
Description=Apache Kafka
Requires=zookeeper.service
After=zookeeper.service
[Service]
Type=simple
User=kafka
Group=kafka
Environment=JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
ExecStart=/opt/kafka/current/bin/kafka-server-start.sh /opt/kafka/current/config/server.properties
ExecStop=/opt/kafka/current/bin/kafka-server-stop.sh
Restart=always
RestartSec=10
StandardOutput=journal
StandardError=journal
SyslogIdentifier=kafka
[Install]
WantedBy=multi-user.target
Create systemd service for MinIO
Create a systemd service file for MinIO object storage with proper security and performance settings.
[Unit]
Description=MinIO Object Storage
Wants=network-online.target
After=network-online.target
[Service]
Type=simple
User=minio-user
Group=minio-user
Environment=MINIO_ROOT_USER=minioadmin
Environment=MINIO_ROOT_PASSWORD=minioadmin123
ExecStart=/opt/minio/minio server /data/minio --address :9000 --console-address :9001
Restart=always
RestartSec=10
StandardOutput=journal
StandardError=journal
SyslogIdentifier=minio
[Install]
WantedBy=multi-user.target
Start and enable all services
Enable and start all services in the correct order with ZooKeeper first, then Kafka, then MinIO.
sudo systemctl daemon-reload
sudo systemctl enable --now zookeeper
sleep 10
sudo systemctl enable --now kafka
sudo systemctl enable --now minio
sudo systemctl status zookeeper kafka minio
Create Kafka topic for streaming data
Create a Kafka topic with multiple partitions for high-throughput streaming data ingestion.
/opt/kafka/current/bin/kafka-topics.sh --create --topic streaming-data --bootstrap-server localhost:9092 --partitions 3 --replication-factor 1
/opt/kafka/current/bin/kafka-topics.sh --describe --topic streaming-data --bootstrap-server localhost:9092
Create MinIO bucket for analytics data
Create a bucket in MinIO for storing processed streaming data and analytics results.
sudo apt install -y mc || sudo dnf install -y mc
mc alias set myminio http://localhost:9000 minioadmin minioadmin123
mc mb myminio/analytics-data
mc policy set public myminio/analytics-data
Create sample streaming application
Create a Python Spark streaming application that reads from Kafka and writes to MinIO.
from pyspark.sql import SparkSession
from pyspark.sql.functions import from_json, col, window, count, avg
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, TimestampType
Create Spark session
spark = SparkSession.builder \
.appName("KafkaStreamingToMinIO") \
.config("spark.sql.adaptive.enabled", "true") \
.config("spark.sql.adaptive.coalescePartitions.enabled", "true") \
.getOrCreate()
spark.sparkContext.setLogLevel("WARN")
Define schema for incoming JSON data
schema = StructType([
StructField("user_id", StringType()),
StructField("event_type", StringType()),
StructField("value", IntegerType()),
StructField("timestamp", TimestampType())
])
Read from Kafka
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "streaming-data") \
.option("startingOffsets", "latest") \
.load()
Parse JSON data
parsed_df = df.select(
from_json(col("value").cast("string"), schema).alias("data")
).select("data.*")
Perform windowed aggregation
aggregated_df = parsed_df \
.withWatermark("timestamp", "10 minutes") \
.groupBy(
window(col("timestamp"), "5 minutes"),
col("event_type")
) \
.agg(
count("*").alias("event_count"),
avg("value").alias("avg_value")
)
Write to MinIO
query = aggregated_df.writeStream \
.outputMode("append") \
.format("parquet") \
.option("path", "s3a://analytics-data/streaming-results/") \
.option("checkpointLocation", "s3a://analytics-data/checkpoints/") \
.trigger(processingTime='30 seconds') \
.start()
query.awaitTermination()
sudo chown spark:spark /opt/spark/streaming_app.py
Create data producer script
Create a Python script to generate sample streaming data and send it to Kafka for testing the pipeline.
import json
import time
import random
from datetime import datetime
from kafka import KafkaProducer
producer = KafkaProducer(
bootstrap_servers=['localhost:9092'],
value_serializer=lambda x: json.dumps(x).encode('utf-8')
)
def generate_event():
return {
"user_id": f"user_{random.randint(1, 1000)}",
"event_type": random.choice(["click", "view", "purchase", "signup"]),
"value": random.randint(1, 100),
"timestamp": datetime.now().isoformat()
}
print("Starting data producer...")
while True:
event = generate_event()
producer.send('streaming-data', event)
print(f"Sent: {event}")
time.sleep(1)
producer.close()
sudo chown kafka:kafka /opt/kafka/producer.py
pip install kafka-python to run the producer script.Verify your setup
Test the complete streaming pipeline by starting the producer and Spark streaming application.
# Check service status
sudo systemctl status zookeeper kafka minio
Test Kafka connectivity
/opt/kafka/current/bin/kafka-console-consumer.sh --topic streaming-data --from-beginning --bootstrap-server localhost:9092 --max-messages 5
Check MinIO access
mc ls myminio/
Run the streaming application
sudo -u spark /opt/spark/current/bin/spark-submit \
--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.0 \
/opt/spark/streaming_app.py
In another terminal, start the data producer
python3 /opt/kafka/producer.py
Common issues
| Symptom | Cause | Fix |
|---|---|---|
| Kafka connection refused | ZooKeeper not started first | sudo systemctl start zookeeper then sudo systemctl start kafka |
| Spark can't write to MinIO | S3A credentials misconfigured | Check spark.hadoop.fs.s3a.* settings in spark-defaults.conf |
| OutOfMemory errors in Spark | Insufficient memory allocation | Add spark.executor.memory 4g and spark.driver.memory 2g to spark-defaults.conf |
| MinIO access denied | Bucket permissions or credentials | mc policy set public myminio/analytics-data and verify credentials |
| Kafka consumer lag | Slow Spark processing | Increase spark.streaming.kafka.maxRatePerPartition or add more Kafka partitions |
Next steps
- Configure MinIO with Apache Spark 3.5 for big data analytics and object storage integration
- Install and configure Apache Kafka with cluster setup and monitoring
- Implement Spark cluster with YARN and HDFS for distributed computing
- Configure Kafka Streams for real-time data processing and analytics
- Set up Prometheus and Grafana monitoring stack with Docker compose
Automated install script
Run this to automate the entire setup
#!/usr/bin/env bash
set -euo pipefail
# Spark streaming with Kafka and MinIO installer
# Production-ready installation script
# Color codes
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m'
# Global variables
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
LOG_FILE="/tmp/spark-kafka-minio-install.log"
JAVA_HOME=""
SPARK_VERSION="3.5.0"
KAFKA_VERSION="2.13-3.6.1"
# Usage function
usage() {
echo "Usage: $0 [OPTIONS]"
echo "Options:"
echo " -h, --help Show this help message"
echo " --minio-user MinIO admin username (default: admin)"
echo " --minio-pass MinIO admin password (default: minioadmin)"
exit 1
}
# Parse arguments
MINIO_USER="admin"
MINIO_PASS="minioadmin"
while [[ $# -gt 0 ]]; do
case $1 in
-h|--help) usage ;;
--minio-user) MINIO_USER="$2"; shift 2 ;;
--minio-pass) MINIO_PASS="$2"; shift 2 ;;
*) echo -e "${RED}Unknown option: $1${NC}"; usage ;;
esac
done
# Logging function
log() {
echo -e "$1" | tee -a "$LOG_FILE"
}
# Error handling and cleanup
cleanup() {
log "${YELLOW}Cleaning up temporary files...${NC}"
cd /tmp
rm -f spark-*.tgz kafka_*.tgz
}
error_exit() {
log "${RED}Error on line $1. Exiting...${NC}"
cleanup
exit 1
}
trap 'error_exit $LINENO' ERR
# Check if running as root
check_root() {
if [[ $EUID -ne 0 ]]; then
log "${RED}This script must be run as root${NC}"
exit 1
fi
}
# Detect distribution
detect_distro() {
if [ -f /etc/os-release ]; then
. /etc/os-release
case "$ID" in
ubuntu|debian)
PKG_MGR="apt"
PKG_UPDATE="apt update && apt upgrade -y"
PKG_INSTALL="apt install -y"
JAVA_PKG="openjdk-11-jdk"
JAVA_HOME="/usr/lib/jvm/java-11-openjdk-amd64"
;;
almalinux|rocky|centos|rhel|ol)
PKG_MGR="dnf"
PKG_UPDATE="dnf update -y"
PKG_INSTALL="dnf install -y"
JAVA_PKG="java-11-openjdk-devel"
JAVA_HOME="/usr/lib/jvm/java-11-openjdk"
;;
fedora)
PKG_MGR="dnf"
PKG_UPDATE="dnf update -y"
PKG_INSTALL="dnf install -y"
JAVA_PKG="java-11-openjdk-devel"
JAVA_HOME="/usr/lib/jvm/java-11-openjdk"
;;
amzn)
PKG_MGR="yum"
PKG_UPDATE="yum update -y"
PKG_INSTALL="yum install -y"
JAVA_PKG="java-11-openjdk-devel"
JAVA_HOME="/usr/lib/jvm/java-11-openjdk"
;;
*)
log "${RED}Unsupported distribution: $ID${NC}"
exit 1
;;
esac
else
log "${RED}Cannot detect distribution${NC}"
exit 1
fi
}
# Main installation function
main() {
log "${GREEN}Starting Spark streaming with Kafka and MinIO installation...${NC}"
check_root
detect_distro
# [1/8] Update system and install dependencies
log "${GREEN}[1/8] Updating system packages and installing Java...${NC}"
$PKG_UPDATE
$PKG_INSTALL $JAVA_PKG wget curl unzip tar gzip
# [2/8] Create service accounts and directories
log "${GREEN}[2/8] Creating service accounts and directories...${NC}"
useradd -r -s /bin/false spark || true
useradd -r -s /bin/false kafka || true
useradd -r -s /bin/false minio-user || true
mkdir -p /opt/{spark,kafka,minio}
mkdir -p /var/log/{spark,kafka,minio}
mkdir -p /data/{kafka-logs,minio}
# [3/8] Install Apache Spark
log "${GREEN}[3/8] Installing Apache Spark ${SPARK_VERSION}...${NC}"
cd /tmp
wget -q "https://archive.apache.org/dist/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop3.tgz"
tar -xzf "spark-${SPARK_VERSION}-bin-hadoop3.tgz"
rm -rf /opt/spark/current
mv "spark-${SPARK_VERSION}-bin-hadoop3" /opt/spark/current
chown -R spark:spark /opt/spark
chown -R spark:spark /var/log/spark
# [4/8] Install Apache Kafka
log "${GREEN}[4/8] Installing Apache Kafka ${KAFKA_VERSION}...${NC}"
wget -q "https://downloads.apache.org/kafka/kafka_${KAFKA_VERSION}.tgz"
tar -xzf "kafka_${KAFKA_VERSION}.tgz"
rm -rf /opt/kafka/current
mv "kafka_${KAFKA_VERSION}" /opt/kafka/current
chown -R kafka:kafka /opt/kafka
chown -R kafka:kafka /var/log/kafka
chown -R kafka:kafka /data/kafka-logs
# [5/8] Install MinIO
log "${GREEN}[5/8] Installing MinIO object storage...${NC}"
wget -q https://dl.min.io/server/minio/release/linux-amd64/minio -O /opt/minio/minio
chmod 755 /opt/minio/minio
chown minio-user:minio-user /opt/minio/minio
chown -R minio-user:minio-user /data/minio
chown -R minio-user:minio-user /var/log/minio
# [6/8] Configure environment variables
log "${GREEN}[6/8] Configuring environment variables...${NC}"
cat > /etc/environment << EOF
JAVA_HOME=${JAVA_HOME}
SPARK_HOME=/opt/spark/current
KAFKA_HOME=/opt/kafka/current
PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/spark/current/bin:/opt/spark/current/sbin:/opt/kafka/current/bin"
EOF
# [7/8] Configure services
log "${GREEN}[7/8] Configuring Kafka and Spark...${NC}"
# Kafka server configuration
cat > /opt/kafka/current/config/server.properties << 'EOF'
broker.id=1
listeners=PLAINTEXT://0.0.0.0:9092
log.dirs=/data/kafka-logs
num.network.threads=8
num.io.threads=16
socket.send.buffer.bytes=102400
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600
num.partitions=3
num.recovery.threads.per.data.dir=2
offsets.topic.replication.factor=1
transaction.state.log.replication.factor=1
transaction.state.log.min.isr=1
log.retention.hours=168
log.segment.bytes=1073741824
log.retention.check.interval.ms=300000
zookeeper.connect=localhost:2181
zookeeper.connection.timeout.ms=18000
group.initial.rebalance.delay.ms=0
EOF
# ZooKeeper configuration
cat > /opt/kafka/current/config/zookeeper.properties << 'EOF'
dataDir=/data/kafka-logs/zookeeper
clientPort=2181
maxClientCnxns=0
admin.enableServer=false
tickTime=2000
initLimit=10
syncLimit=5
EOF
mkdir -p /data/kafka-logs/zookeeper
chown -R kafka:kafka /data/kafka-logs/zookeeper
# Spark defaults configuration
cat > /opt/spark/current/conf/spark-defaults.conf << 'EOF'
spark.master local[*]
spark.sql.adaptive.enabled true
spark.sql.adaptive.coalescePartitions.enabled true
spark.streaming.kafka.maxRatePerPartition 10000
spark.streaming.backpressure.enabled true
spark.streaming.receiver.maxRate 10000
spark.streaming.kafka.consumer.cache.enabled true
spark.sql.streaming.checkpointLocation /data/spark-checkpoints
EOF
mkdir -p /data/spark-checkpoints
chown -R spark:spark /data/spark-checkpoints
# Create systemd services
create_systemd_services
# [8/8] Start services
log "${GREEN}[8/8] Starting services...${NC}"
systemctl daemon-reload
systemctl enable zookeeper kafka minio
systemctl start zookeeper
sleep 5
systemctl start kafka
systemctl start minio
cleanup
verify_installation
log "${GREEN}Installation completed successfully!${NC}"
log "${YELLOW}MinIO Console: http://localhost:9001 (${MINIO_USER}/${MINIO_PASS})${NC}"
log "${YELLOW}Kafka: localhost:9092${NC}"
}
# Create systemd service files
create_systemd_services() {
# ZooKeeper service
cat > /etc/systemd/system/zookeeper.service << 'EOF'
[Unit]
Description=Apache ZooKeeper
Requires=network.target
After=network.target
[Service]
Type=forking
User=kafka
Group=kafka
ExecStart=/opt/kafka/current/bin/zookeeper-server-start.sh -daemon /opt/kafka/current/config/zookeeper.properties
ExecStop=/opt/kafka/current/bin/zookeeper-server-stop.sh
Restart=on-abnormal
StandardOutput=journal
StandardError=journal
[Install]
WantedBy=multi-user.target
EOF
# Kafka service
cat > /etc/systemd/system/kafka.service << 'EOF'
[Unit]
Description=Apache Kafka
Requires=zookeeper.service
After=zookeeper.service
[Service]
Type=forking
User=kafka
Group=kafka
Environment="JAVA_HOME=/usr/lib/jvm/java-11-openjdk"
ExecStart=/opt/kafka/current/bin/kafka-server-start.sh -daemon /opt/kafka/current/config/server.properties
ExecStop=/opt/kafka/current/bin/kafka-server-stop.sh
Restart=on-abnormal
StandardOutput=journal
StandardError=journal
[Install]
WantedBy=multi-user.target
EOF
# MinIO service
cat > /etc/systemd/system/minio.service << EOF
[Unit]
Description=MinIO Object Storage
After=network.target
[Service]
Type=simple
User=minio-user
Group=minio-user
Environment="MINIO_ROOT_USER=${MINIO_USER}"
Environment="MINIO_ROOT_PASSWORD=${MINIO_PASS}"
ExecStart=/opt/minio/minio server /data/minio --console-address ":9001"
Restart=always
StandardOutput=journal
StandardError=journal
[Install]
WantedBy=multi-user.target
EOF
}
# Verify installation
verify_installation() {
log "${GREEN}Verifying installation...${NC}"
# Check Java
java -version || log "${RED}Java verification failed${NC}"
# Check service status
systemctl is-active zookeeper kafka minio || log "${YELLOW}Some services may not be running${NC}"
# Check ports
sleep 10
ss -tuln | grep -E ':(2181|9092|9000|9001)' || log "${YELLOW}Some ports may not be listening${NC}"
log "${GREEN}Verification completed${NC}"
}
# Run main function
main "$@"
Review the script before running. Execute with: bash install.sh