Implement Spark streaming with Kafka and MinIO for real-time analytics and big data processing

Advanced 45 min Apr 02, 2026 40 views
Ubuntu 24.04 Debian 12 AlmaLinux 9 Rocky Linux 9

Build a production-ready real-time analytics pipeline using Apache Spark 3.5 streaming, Kafka for data ingestion, and MinIO for distributed object storage. This tutorial covers fault-tolerant streaming configurations and end-to-end pipeline implementation.

Prerequisites

  • Root or sudo access
  • At least 8GB RAM
  • 50GB available disk space
  • Network connectivity for downloads

What this solves

This tutorial implements a complete real-time analytics platform using Apache Spark 3.5 streaming with Kafka integration and MinIO object storage. You'll build a fault-tolerant pipeline that ingests streaming data from Kafka, processes it with Spark, and stores results in MinIO for analytics workloads. This setup handles high-throughput data streams with exactly-once processing guarantees.

Step-by-step installation

Update system packages and install Java

Spark requires Java 11 or higher. Install OpenJDK 11 and update your system packages first.

sudo apt update && sudo apt upgrade -y
sudo apt install -y openjdk-11-jdk wget curl unzip
sudo dnf update -y
sudo dnf install -y java-11-openjdk-devel wget curl unzip

Create service accounts and directories

Create dedicated users for Spark, Kafka, and MinIO services with proper directory structure.

sudo useradd -r -s /bin/false spark
sudo useradd -r -s /bin/false kafka
sudo useradd -r -s /bin/false minio-user
sudo mkdir -p /opt/{spark,kafka,minio}
sudo mkdir -p /var/log/{spark,kafka,minio}
sudo mkdir -p /data/{kafka-logs,minio}

Install Apache Spark 3.5

Download and install Spark 3.5 with Hadoop support for distributed processing capabilities.

cd /opt
sudo wget https://archive.apache.org/dist/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz
sudo tar -xzf spark-3.5.0-bin-hadoop3.tgz
sudo mv spark-3.5.0-bin-hadoop3 spark/current
sudo chown -R spark:spark /opt/spark
sudo chown -R spark:spark /var/log/spark
sudo rm spark-3.5.0-bin-hadoop3.tgz

Install Apache Kafka

Install Kafka for real-time data streaming and message queuing between producers and Spark consumers.

cd /opt
sudo wget https://downloads.apache.org/kafka/2.13-3.6.1/kafka_2.13-3.6.1.tgz
sudo tar -xzf kafka_2.13-3.6.1.tgz
sudo mv kafka_2.13-3.6.1 kafka/current
sudo chown -R kafka:kafka /opt/kafka
sudo chown -R kafka:kafka /var/log/kafka
sudo chown -R kafka:kafka /data/kafka-logs
sudo rm kafka_2.13-3.6.1.tgz

Install MinIO object storage

Download and install MinIO for S3-compatible distributed object storage.

sudo wget https://dl.min.io/server/minio/release/linux-amd64/minio -O /opt/minio/minio
sudo chmod +x /opt/minio/minio
sudo chown minio-user:minio-user /opt/minio/minio
sudo chown -R minio-user:minio-user /data/minio
sudo chown -R minio-user:minio-user /var/log/minio

Configure environment variables

Set up environment variables for all services to ensure proper PATH and JAVA_HOME configuration.

JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
SPARK_HOME=/opt/spark/current
KAFKA_HOME=/opt/kafka/current
PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/spark/current/bin:/opt/spark/current/sbin:/opt/kafka/current/bin"
source /etc/environment

Configure Kafka server properties

Configure Kafka with proper logging, replication, and performance settings for production use.

broker.id=1
listeners=PLAINTEXT://0.0.0.0:9092
log.dirs=/data/kafka-logs
num.network.threads=8
num.io.threads=16
socket.send.buffer.bytes=102400
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600
num.partitions=3
num.recovery.threads.per.data.dir=2
offsets.topic.replication.factor=1
transaction.state.log.replication.factor=1
transaction.state.log.min.isr=1
log.retention.hours=168
log.segment.bytes=1073741824
log.retention.check.interval.ms=300000
zookeeper.connect=localhost:2181
zookeeper.connection.timeout.ms=18000
group.initial.rebalance.delay.ms=0

Configure ZooKeeper for Kafka

Set up ZooKeeper configuration for Kafka cluster coordination and metadata management.

dataDir=/data/kafka-logs/zookeeper
clientPort=2181
maxClientCnxns=0
admin.enableServer=false
tickTime=2000
initLimit=10
syncLimit=5
sudo mkdir -p /data/kafka-logs/zookeeper
sudo chown -R kafka:kafka /data/kafka-logs/zookeeper

Configure Spark for streaming

Configure Spark with optimized settings for streaming workloads and Kafka integration.

spark.master local[*]
spark.sql.adaptive.enabled true
spark.sql.adaptive.coalescePartitions.enabled true
spark.streaming.stopGracefullyOnShutdown true
spark.streaming.backpressure.enabled true
spark.streaming.kafka.consumer.poll.ms 512
spark.streaming.kafka.maxRatePerPartition 1000
spark.streaming.receiver.maxRate 1000
spark.sql.streaming.metricsEnabled true
spark.sql.streaming.noDataMicroBatches.enabled false
spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.endpoint http://localhost:9000
spark.hadoop.fs.s3a.access.key minioadmin
spark.hadoop.fs.s3a.secret.key minioadmin
spark.hadoop.fs.s3a.path.style.access true
spark.hadoop.fs.s3a.connection.ssl.enabled false

Download Kafka and S3 connectors for Spark

Download required JAR files for Kafka integration and MinIO S3 connectivity in Spark applications.

cd /opt/spark/current/jars
sudo wget https://repo1.maven.org/maven2/org/apache/spark/spark-sql-kafka-0-10_2.12/3.5.0/spark-sql-kafka-0-10_2.12-3.5.0.jar
sudo wget https://repo1.maven.org/maven2/org/apache/kafka/kafka-clients/3.6.1/kafka-clients-3.6.1.jar
sudo wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.4/hadoop-aws-3.3.4.jar
sudo wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.367/aws-java-sdk-bundle-1.12.367.jar
sudo chown spark:spark *.jar

Create systemd service for ZooKeeper

Create a systemd service file for ZooKeeper to enable automatic startup and proper service management.

[Unit]
Description=Apache ZooKeeper
Requires=network.target
After=network.target

[Service]
Type=simple
User=kafka
Group=kafka
ExecStart=/opt/kafka/current/bin/zookeeper-server-start.sh /opt/kafka/current/config/zookeeper.properties
ExecStop=/opt/kafka/current/bin/zookeeper-server-stop.sh
Restart=always
RestartSec=10
StandardOutput=journal
StandardError=journal
SyslogIdentifier=zookeeper

[Install]
WantedBy=multi-user.target

Create systemd service for Kafka

Create a systemd service file for Kafka with proper dependencies on ZooKeeper.

[Unit]
Description=Apache Kafka
Requires=zookeeper.service
After=zookeeper.service

[Service]
Type=simple
User=kafka
Group=kafka
Environment=JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
ExecStart=/opt/kafka/current/bin/kafka-server-start.sh /opt/kafka/current/config/server.properties
ExecStop=/opt/kafka/current/bin/kafka-server-stop.sh
Restart=always
RestartSec=10
StandardOutput=journal
StandardError=journal
SyslogIdentifier=kafka

[Install]
WantedBy=multi-user.target

Create systemd service for MinIO

Create a systemd service file for MinIO object storage with proper security and performance settings.

[Unit]
Description=MinIO Object Storage
Wants=network-online.target
After=network-online.target

[Service]
Type=simple
User=minio-user
Group=minio-user
Environment=MINIO_ROOT_USER=minioadmin
Environment=MINIO_ROOT_PASSWORD=minioadmin123
ExecStart=/opt/minio/minio server /data/minio --address :9000 --console-address :9001
Restart=always
RestartSec=10
StandardOutput=journal
StandardError=journal
SyslogIdentifier=minio

[Install]
WantedBy=multi-user.target

Start and enable all services

Enable and start all services in the correct order with ZooKeeper first, then Kafka, then MinIO.

sudo systemctl daemon-reload
sudo systemctl enable --now zookeeper
sleep 10
sudo systemctl enable --now kafka
sudo systemctl enable --now minio
sudo systemctl status zookeeper kafka minio

Create Kafka topic for streaming data

Create a Kafka topic with multiple partitions for high-throughput streaming data ingestion.

/opt/kafka/current/bin/kafka-topics.sh --create --topic streaming-data --bootstrap-server localhost:9092 --partitions 3 --replication-factor 1
/opt/kafka/current/bin/kafka-topics.sh --describe --topic streaming-data --bootstrap-server localhost:9092

Create MinIO bucket for analytics data

Create a bucket in MinIO for storing processed streaming data and analytics results.

sudo apt install -y mc || sudo dnf install -y mc
mc alias set myminio http://localhost:9000 minioadmin minioadmin123
mc mb myminio/analytics-data
mc policy set public myminio/analytics-data

Create sample streaming application

Create a Python Spark streaming application that reads from Kafka and writes to MinIO.

from pyspark.sql import SparkSession
from pyspark.sql.functions import from_json, col, window, count, avg
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, TimestampType

Create Spark session

spark = SparkSession.builder \ .appName("KafkaStreamingToMinIO") \ .config("spark.sql.adaptive.enabled", "true") \ .config("spark.sql.adaptive.coalescePartitions.enabled", "true") \ .getOrCreate() spark.sparkContext.setLogLevel("WARN")

Define schema for incoming JSON data

schema = StructType([ StructField("user_id", StringType()), StructField("event_type", StringType()), StructField("value", IntegerType()), StructField("timestamp", TimestampType()) ])

Read from Kafka

df = spark \ .readStream \ .format("kafka") \ .option("kafka.bootstrap.servers", "localhost:9092") \ .option("subscribe", "streaming-data") \ .option("startingOffsets", "latest") \ .load()

Parse JSON data

parsed_df = df.select( from_json(col("value").cast("string"), schema).alias("data") ).select("data.*")

Perform windowed aggregation

aggregated_df = parsed_df \ .withWatermark("timestamp", "10 minutes") \ .groupBy( window(col("timestamp"), "5 minutes"), col("event_type") ) \ .agg( count("*").alias("event_count"), avg("value").alias("avg_value") )

Write to MinIO

query = aggregated_df.writeStream \ .outputMode("append") \ .format("parquet") \ .option("path", "s3a://analytics-data/streaming-results/") \ .option("checkpointLocation", "s3a://analytics-data/checkpoints/") \ .trigger(processingTime='30 seconds') \ .start() query.awaitTermination()
sudo chown spark:spark /opt/spark/streaming_app.py

Create data producer script

Create a Python script to generate sample streaming data and send it to Kafka for testing the pipeline.

import json
import time
import random
from datetime import datetime
from kafka import KafkaProducer

producer = KafkaProducer(
    bootstrap_servers=['localhost:9092'],
    value_serializer=lambda x: json.dumps(x).encode('utf-8')
)

def generate_event():
    return {
        "user_id": f"user_{random.randint(1, 1000)}",
        "event_type": random.choice(["click", "view", "purchase", "signup"]),
        "value": random.randint(1, 100),
        "timestamp": datetime.now().isoformat()
    }

print("Starting data producer...")
while True:
    event = generate_event()
    producer.send('streaming-data', event)
    print(f"Sent: {event}")
    time.sleep(1)

producer.close()
sudo chown kafka:kafka /opt/kafka/producer.py
Note: Install the kafka-python library with pip install kafka-python to run the producer script.

Verify your setup

Test the complete streaming pipeline by starting the producer and Spark streaming application.

# Check service status
sudo systemctl status zookeeper kafka minio

Test Kafka connectivity

/opt/kafka/current/bin/kafka-console-consumer.sh --topic streaming-data --from-beginning --bootstrap-server localhost:9092 --max-messages 5

Check MinIO access

mc ls myminio/

Run the streaming application

sudo -u spark /opt/spark/current/bin/spark-submit \ --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.0 \ /opt/spark/streaming_app.py

In another terminal, start the data producer

python3 /opt/kafka/producer.py

Common issues

SymptomCauseFix
Kafka connection refusedZooKeeper not started firstsudo systemctl start zookeeper then sudo systemctl start kafka
Spark can't write to MinIOS3A credentials misconfiguredCheck spark.hadoop.fs.s3a.* settings in spark-defaults.conf
OutOfMemory errors in SparkInsufficient memory allocationAdd spark.executor.memory 4g and spark.driver.memory 2g to spark-defaults.conf
MinIO access deniedBucket permissions or credentialsmc policy set public myminio/analytics-data and verify credentials
Kafka consumer lagSlow Spark processingIncrease spark.streaming.kafka.maxRatePerPartition or add more Kafka partitions

Next steps

Automated install script

Run this to automate the entire setup

#spark #kafka #minio #streaming #real-time-analytics

Need help?

Don't want to manage this yourself?

We handle infrastructure for businesses that depend on uptime. From initial setup to ongoing operations.

Talk to an engineer