Set up Spark 3.5 Delta Lake with MinIO for ACID transactions and big data analytics

Advanced 45 min Apr 06, 2026 13 views
Ubuntu 24.04 Debian 12 AlmaLinux 9 Rocky Linux 9

Configure Apache Spark 3.5 with Delta Lake and MinIO object storage for ACID transactions, data versioning, and distributed analytics processing. Includes complete setup for production-grade data lake architecture.

Prerequisites

  • 4GB RAM minimum
  • Java 17 compatible system
  • Network access for package downloads
  • sudo privileges

What this solves

Apache Spark 3.5 with Delta Lake provides ACID transactions and versioned data management for big data workloads, while MinIO offers S3-compatible object storage for distributed data lakes. This setup enables reliable data processing with transaction guarantees, time travel capabilities, and schema evolution support essential for enterprise analytics pipelines.

Step-by-step installation

Update system packages and install prerequisites

Start by updating your package manager and installing Java 17 and essential build tools for Spark.

sudo apt update && sudo apt upgrade -y
sudo apt install -y openjdk-17-jdk wget curl unzip python3 python3-pip
sudo dnf update -y
sudo dnf install -y java-17-openjdk java-17-openjdk-devel wget curl unzip python3 python3-pip

Configure Java environment variables

Set up JAVA_HOME environment variable for Spark to locate the Java installation properly.

echo 'export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64' | sudo tee -a /etc/environment
echo 'export PATH=$JAVA_HOME/bin:$PATH' | sudo tee -a /etc/environment
source /etc/environment
java -version

Create Spark user and directories

Create a dedicated user for Spark operations and set up the required directory structure with proper permissions.

sudo useradd -m -s /bin/bash spark
sudo mkdir -p /opt/spark /opt/spark/logs /opt/spark/work
sudo chown -R spark:spark /opt/spark

Download and install Apache Spark 3.5

Download Spark 3.5 with Hadoop support and extract it to the installation directory.

cd /tmp
wget https://archive.apache.org/dist/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz
tar -xzf spark-3.5.0-bin-hadoop3.tgz
sudo mv spark-3.5.0-bin-hadoop3/* /opt/spark/
sudo chown -R spark:spark /opt/spark

Configure Spark environment

Set up Spark configuration files and environment variables for optimal performance and Delta Lake integration.

#!/usr/bin/env bash
export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
export SPARK_HOME=/opt/spark
export SPARK_CONF_DIR=/opt/spark/conf
export SPARK_LOG_DIR=/opt/spark/logs
export SPARK_WORKER_DIR=/opt/spark/work
export PYSPARK_PYTHON=python3
export SPARK_MASTER_HOST=localhost
export SPARK_MASTER_PORT=7077
export SPARK_MASTER_WEBUI_PORT=8080
export SPARK_WORKER_WEBUI_PORT=8081

Create Spark defaults configuration

Configure Spark with Delta Lake dependencies and MinIO S3-compatible settings for object storage integration.

# Delta Lake Configuration
spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension
spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog

S3/MinIO Configuration

spark.hadoop.fs.s3a.endpoint=http://localhost:9000 spark.hadoop.fs.s3a.access.key=minioadmin spark.hadoop.fs.s3a.secret.key=minioadmin spark.hadoop.fs.s3a.path.style.access=true spark.hadoop.fs.s3a.connection.ssl.enabled=false spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem spark.hadoop.fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider

Performance Optimization

spark.serializer=org.apache.spark.serializer.KryoSerializer spark.sql.adaptive.enabled=true spark.sql.adaptive.coalescePartitions.enabled=true spark.executor.memory=2g spark.driver.memory=1g spark.executor.cores=2 spark.default.parallelism=8

Download Delta Lake and AWS SDK JARs

Download the required JAR files for Delta Lake functionality and S3A filesystem support with MinIO.

cd /opt/spark/jars
sudo wget https://repo1.maven.org/maven2/io/delta/delta-core_2.12/2.4.0/delta-core_2.12-2.4.0.jar
sudo wget https://repo1.maven.org/maven2/io/delta/delta-storage/2.4.0/delta-storage-2.4.0.jar
sudo wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.4/hadoop-aws-3.3.4.jar
sudo wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.367/aws-java-sdk-bundle-1.12.367.jar
sudo chown spark:spark *.jar

Install and configure MinIO server

Set up MinIO object storage server to provide S3-compatible storage for Delta Lake data files.

wget https://dl.min.io/server/minio/release/linux-amd64/minio
chmod +x minio
sudo mv minio /usr/local/bin/
sudo mkdir -p /opt/minio/data
sudo useradd -r minio-user
sudo chown -R minio-user:minio-user /opt/minio

Create MinIO systemd service

Configure MinIO as a systemd service with proper security settings and automatic startup.

[Unit]
Description=MinIO Object Storage Server
After=network.target

[Service]
Type=simple
User=minio-user
Group=minio-user
WorkingDirectory=/opt/minio
Environment=MINIO_ROOT_USER=minioadmin
Environment=MINIO_ROOT_PASSWORD=minioadmin123
ExecStart=/usr/local/bin/minio server /opt/minio/data --address :9000 --console-address :9001
Restart=always
RestartSec=10
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target

Start MinIO and create data bucket

Enable and start the MinIO service, then create a bucket for Delta Lake data storage.

sudo systemctl daemon-reload
sudo systemctl enable --now minio
sudo systemctl status minio

Install MinIO client

wget https://dl.min.io/client/mc/release/linux-amd64/mc chmod +x mc sudo mv mc /usr/local/bin/

Configure MinIO client and create bucket

mc alias set local http://localhost:9000 minioadmin minioadmin123 mc mb local/delta-lake mc ls local

Create Spark master and worker systemd services

Set up systemd services for Spark master and worker nodes to run as managed services.

[Unit]
Description=Apache Spark Master
After=network.target

[Service]
Type=forking
User=spark
Group=spark
WorkingDirectory=/opt/spark
Environment=JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
Environment=SPARK_HOME=/opt/spark
ExecStart=/opt/spark/sbin/start-master.sh
ExecStop=/opt/spark/sbin/stop-master.sh
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target
[Unit]
Description=Apache Spark Worker
After=network.target spark-master.service
Requires=spark-master.service

[Service]
Type=forking
User=spark
Group=spark
WorkingDirectory=/opt/spark
Environment=JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
Environment=SPARK_HOME=/opt/spark
ExecStart=/opt/spark/sbin/start-worker.sh spark://localhost:7077
ExecStop=/opt/spark/sbin/stop-worker.sh
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

Start Spark services

Enable and start the Spark master and worker services to create a working Spark cluster.

sudo chmod +x /opt/spark/sbin/*.sh
sudo systemctl daemon-reload
sudo systemctl enable --now spark-master
sudo systemctl enable --now spark-worker
sudo systemctl status spark-master
sudo systemctl status spark-worker

Configure firewall for Spark and MinIO

Open the necessary ports for Spark web UI, cluster communication, and MinIO access.

sudo ufw allow 7077/tcp  # Spark Master
sudo ufw allow 8080/tcp  # Spark Master Web UI
sudo ufw allow 8081/tcp  # Spark Worker Web UI
sudo ufw allow 9000/tcp  # MinIO API
sudo ufw allow 9001/tcp  # MinIO Console
sudo ufw reload
sudo firewall-cmd --permanent --add-port=7077/tcp
sudo firewall-cmd --permanent --add-port=8080/tcp
sudo firewall-cmd --permanent --add-port=8081/tcp
sudo firewall-cmd --permanent --add-port=9000/tcp
sudo firewall-cmd --permanent --add-port=9001/tcp
sudo firewall-cmd --reload

Create Delta Lake test application

Create a Python application to test Delta Lake ACID transactions and data versioning with MinIO storage.

from pyspark.sql import SparkSession
from delta import *

Create Spark session with Delta Lake configuration

spark = SparkSession.builder \ .appName("DeltaLakeTest") \ .master("spark://localhost:7077") \ .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \ .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \ .config("spark.hadoop.fs.s3a.endpoint", "http://localhost:9000") \ .config("spark.hadoop.fs.s3a.access.key", "minioadmin") \ .config("spark.hadoop.fs.s3a.secret.key", "minioadmin123") \ .config("spark.hadoop.fs.s3a.path.style.access", "true") \ .config("spark.hadoop.fs.s3a.connection.ssl.enabled", "false") \ .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \ .getOrCreate()

Create test data

data = [(1, "John", 25), (2, "Jane", 30), (3, "Bob", 35)] columns = ["id", "name", "age"] df = spark.createDataFrame(data, columns)

Write as Delta table to MinIO

print("Writing Delta table...") df.write.format("delta").mode("overwrite").save("s3a://delta-lake/users")

Read Delta table

print("Reading Delta table...") delta_df = spark.read.format("delta").load("s3a://delta-lake/users") delta_df.show()

Perform an update operation (ACID transaction)

print("Updating records...") delta_table = DeltaTable.forPath(spark, "s3a://delta-lake/users") delta_table.update( condition="id = 1", set={"age": "26"} )

Show updated data

print("After update:") delta_table.toDF().show()

Show table history (versioning)

print("Table history:") delta_table.history().show() spark.stop()

Install Python dependencies and run test

Install the required Python packages and execute the Delta Lake test to verify ACID transactions work correctly.

sudo pip3 install pyspark==3.5.0 delta-spark==2.4.0
sudo chown spark:spark /opt/spark/test_delta.py
sudo -u spark python3 /opt/spark/test_delta.py

Verify your setup

Check that all services are running correctly and the Delta Lake integration is functional.

# Check service status
sudo systemctl status minio spark-master spark-worker

Verify Spark cluster

curl -s http://localhost:8080 | grep -i "spark master"

Check MinIO buckets

mc ls local/delta-lake

Test Spark shell with Delta Lake

/opt/spark/bin/spark-shell --packages io.delta:delta-core_2.12:2.4.0 --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"
Note: Access the Spark Master Web UI at http://your-server-ip:8080 and MinIO Console at http://your-server-ip:9001 to monitor your cluster and storage.

Configure ACID transaction settings

Optimize Delta Lake performance settings

Configure advanced Delta Lake settings for better performance and transaction isolation.

# Delta Lake Performance Settings
spark.databricks.delta.retentionDurationCheck.enabled=false
spark.databricks.delta.vacuum.parallelDelete.enabled=true
spark.databricks.delta.merge.repartitionBeforeWrite.enabled=true
spark.databricks.delta.optimizeWrite.enabled=true
spark.databricks.delta.autoCompact.enabled=true

Transaction Isolation

spark.databricks.delta.properties.defaults.isolation.level=WriteSerializable spark.databricks.delta.properties.defaults.checkpointInterval=10

Create production Delta Lake table with schema evolution

Demonstrate advanced Delta Lake features including schema evolution and time travel capabilities.

from pyspark.sql import SparkSession
from pyspark.sql.types import *
from delta import *
import datetime

spark = SparkSession.builder \
    .appName("ProductionDeltaExample") \
    .master("spark://localhost:7077") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()

Create initial schema

schema = StructType([ StructField("transaction_id", StringType(), False), StructField("customer_id", IntegerType(), False), StructField("amount", DoubleType(), False), StructField("timestamp", TimestampType(), False) ])

Generate sample transaction data

from datetime import datetime, timedelta import random transactions = [] for i in range(1000): transactions.append(( f"txn_{i:05d}", random.randint(1, 100), round(random.uniform(10.0, 500.0), 2), datetime.now() - timedelta(days=random.randint(0, 30)) )) df = spark.createDataFrame(transactions, schema)

Write with partitioning for better performance

print("Creating partitioned Delta table...") df.write.format("delta") \ .mode("overwrite") \ .partitionBy("customer_id") \ .save("s3a://delta-lake/transactions")

Demonstrate time travel

print("\nTable versions:") delta_table = DeltaTable.forPath(spark, "s3a://delta-lake/transactions") delta_table.history().select("version", "timestamp", "operation").show()

Schema evolution - add new column

print("\nAdding new column (schema evolution)...") new_data = [("txn_99999", 101, 299.99, datetime.now(), "credit_card")] new_columns = ["transaction_id", "customer_id", "amount", "timestamp", "payment_method"] new_df = spark.createDataFrame(new_data, new_columns) new_df.write.format("delta") \ .mode("append") \ .option("mergeSchema", "true") \ .save("s3a://delta-lake/transactions") print("\nSchema after evolution:") spark.read.format("delta").load("s3a://delta-lake/transactions").printSchema() spark.stop()

Common issues

SymptomCauseFix
Spark fails to startJava not found or wrong versionVerify JAVA_HOME: echo $JAVA_HOME && java -version
Delta Lake JARs not foundJAR files missing or wrong locationCheck JARs in /opt/spark/jars: ls -la /opt/spark/jars/delta
MinIO connection refusedMinIO service not runningRestart MinIO: sudo systemctl restart minio
S3A filesystem errorsWrong endpoint or credentialsVerify MinIO config: mc admin info local
Permission denied on logsIncorrect directory ownershipFix ownership: sudo chown -R spark:spark /opt/spark
Worker not connecting to masterFirewall blocking portsCheck port 7077: netstat -tlnp | grep 7077
Warning: Never use chmod 777 on Spark directories. This gives every user full access to your data and logs. Instead, use proper ownership with chown and minimal permissions like 755 for directories and 644 for files.

Next steps

Automated install script

Run this to automate the entire setup

#spark #delta-lake #minio #acid-transactions #big-data

Need help?

Don't want to manage this yourself?

We handle infrastructure for businesses that depend on uptime. From initial setup to ongoing operations.

Talk to an engineer