Setup Spark 3.5 Delta Lake MinIO ACID Transactions

Configure Apache Spark 3.5 with Delta Lake and MinIO object storage for ACID transactions, data versioning, and distributed analytics processing. Includes complete setup for production-grade data lake architecture.

Prerequisites

4GB RAM minimum
Java 17 compatible system
Network access for package downloads
sudo privileges

What this solves

Apache Spark 3.5 with Delta Lake provides ACID transactions and versioned data management for big data workloads, while MinIO offers S3-compatible object storage for distributed data lakes. This setup enables reliable data processing with transaction guarantees, time travel capabilities, and schema evolution support essential for enterprise analytics pipelines.

Step-by-step installation

Update system packages and install prerequisites

Start by updating your package manager and installing Java 17 and essential build tools for Spark.

sudo apt update && sudo apt upgrade -y
sudo apt install -y openjdk-17-jdk wget curl unzip python3 python3-pip

sudo dnf update -y
sudo dnf install -y java-17-openjdk java-17-openjdk-devel wget curl unzip python3 python3-pip

Configure Java environment variables

Set up JAVA_HOME environment variable for Spark to locate the Java installation properly.

echo 'export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64' | sudo tee -a /etc/environment
echo 'export PATH=$JAVA_HOME/bin:$PATH' | sudo tee -a /etc/environment
source /etc/environment
java -version

Create Spark user and directories

Create a dedicated user for Spark operations and set up the required directory structure with proper permissions.

sudo useradd -m -s /bin/bash spark
sudo mkdir -p /opt/spark /opt/spark/logs /opt/spark/work
sudo chown -R spark:spark /opt/spark

Download and install Apache Spark 3.5

Download Spark 3.5 with Hadoop support and extract it to the installation directory.

cd /tmp
wget https://archive.apache.org/dist/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz
tar -xzf spark-3.5.0-bin-hadoop3.tgz
sudo mv spark-3.5.0-bin-hadoop3/* /opt/spark/
sudo chown -R spark:spark /opt/spark

Configure Spark environment

Set up Spark configuration files and environment variables for optimal performance and Delta Lake integration.

#!/usr/bin/env bash
export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
export SPARK_HOME=/opt/spark
export SPARK_CONF_DIR=/opt/spark/conf
export SPARK_LOG_DIR=/opt/spark/logs
export SPARK_WORKER_DIR=/opt/spark/work
export PYSPARK_PYTHON=python3
export SPARK_MASTER_HOST=localhost
export SPARK_MASTER_PORT=7077
export SPARK_MASTER_WEBUI_PORT=8080
export SPARK_WORKER_WEBUI_PORT=8081

Create Spark defaults configuration

Configure Spark with Delta Lake dependencies and MinIO S3-compatible settings for object storage integration.

# Delta Lake Configuration
spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension
spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog

S3/MinIO Configuration
spark.hadoop.fs.s3a.endpoint=http://localhost:9000
spark.hadoop.fs.s3a.access.key=minioadmin
spark.hadoop.fs.s3a.secret.key=minioadmin
spark.hadoop.fs.s3a.path.style.access=true
spark.hadoop.fs.s3a.connection.ssl.enabled=false
spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider

Performance Optimization
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.sql.adaptive.enabled=true
spark.sql.adaptive.coalescePartitions.enabled=true
spark.executor.memory=2g
spark.driver.memory=1g
spark.executor.cores=2
spark.default.parallelism=8

Download Delta Lake and AWS SDK JARs

Download the required JAR files for Delta Lake functionality and S3A filesystem support with MinIO.

cd /opt/spark/jars
sudo wget https://repo1.maven.org/maven2/io/delta/delta-core_2.12/2.4.0/delta-core_2.12-2.4.0.jar
sudo wget https://repo1.maven.org/maven2/io/delta/delta-storage/2.4.0/delta-storage-2.4.0.jar
sudo wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.4/hadoop-aws-3.3.4.jar
sudo wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.367/aws-java-sdk-bundle-1.12.367.jar
sudo chown spark:spark *.jar

Install and configure MinIO server

Set up MinIO object storage server to provide S3-compatible storage for Delta Lake data files.

wget https://dl.min.io/server/minio/release/linux-amd64/minio
chmod +x minio
sudo mv minio /usr/local/bin/
sudo mkdir -p /opt/minio/data
sudo useradd -r minio-user
sudo chown -R minio-user:minio-user /opt/minio

Create MinIO systemd service

Configure MinIO as a systemd service with proper security settings and automatic startup.

[Unit]
Description=MinIO Object Storage Server
After=network.target

[Service]
Type=simple
User=minio-user
Group=minio-user
WorkingDirectory=/opt/minio
Environment=MINIO_ROOT_USER=minioadmin
Environment=MINIO_ROOT_PASSWORD=minioadmin123
ExecStart=/usr/local/bin/minio server /opt/minio/data --address :9000 --console-address :9001
Restart=always
RestartSec=10
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target

Start MinIO and create data bucket

Enable and start the MinIO service, then create a bucket for Delta Lake data storage.

sudo systemctl daemon-reload
sudo systemctl enable --now minio
sudo systemctl status minio

Install MinIO client
wget https://dl.min.io/client/mc/release/linux-amd64/mc
chmod +x mc
sudo mv mc /usr/local/bin/

Configure MinIO client and create bucket
mc alias set local http://localhost:9000 minioadmin minioadmin123
mc mb local/delta-lake
mc ls local

Create Spark master and worker systemd services

Set up systemd services for Spark master and worker nodes to run as managed services.

[Unit]
Description=Apache Spark Master
After=network.target

[Service]
Type=forking
User=spark
Group=spark
WorkingDirectory=/opt/spark
Environment=JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
Environment=SPARK_HOME=/opt/spark
ExecStart=/opt/spark/sbin/start-master.sh
ExecStop=/opt/spark/sbin/stop-master.sh
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

[Unit]
Description=Apache Spark Worker
After=network.target spark-master.service
Requires=spark-master.service

[Service]
Type=forking
User=spark
Group=spark
WorkingDirectory=/opt/spark
Environment=JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
Environment=SPARK_HOME=/opt/spark
ExecStart=/opt/spark/sbin/start-worker.sh spark://localhost:7077
ExecStop=/opt/spark/sbin/stop-worker.sh
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

Start Spark services

Enable and start the Spark master and worker services to create a working Spark cluster.

sudo chmod +x /opt/spark/sbin/*.sh
sudo systemctl daemon-reload
sudo systemctl enable --now spark-master
sudo systemctl enable --now spark-worker
sudo systemctl status spark-master
sudo systemctl status spark-worker

Configure firewall for Spark and MinIO

Open the necessary ports for Spark web UI, cluster communication, and MinIO access.

sudo ufw allow 7077/tcp  # Spark Master
sudo ufw allow 8080/tcp  # Spark Master Web UI
sudo ufw allow 8081/tcp  # Spark Worker Web UI
sudo ufw allow 9000/tcp  # MinIO API
sudo ufw allow 9001/tcp  # MinIO Console
sudo ufw reload

sudo firewall-cmd --permanent --add-port=7077/tcp
sudo firewall-cmd --permanent --add-port=8080/tcp
sudo firewall-cmd --permanent --add-port=8081/tcp
sudo firewall-cmd --permanent --add-port=9000/tcp
sudo firewall-cmd --permanent --add-port=9001/tcp
sudo firewall-cmd --reload

Create Delta Lake test application

Create a Python application to test Delta Lake ACID transactions and data versioning with MinIO storage.

from pyspark.sql import SparkSession
from delta import *

Create Spark session with Delta Lake configuration
spark = SparkSession.builder \
    .appName("DeltaLakeTest") \
    .master("spark://localhost:7077") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .config("spark.hadoop.fs.s3a.endpoint", "http://localhost:9000") \
    .config("spark.hadoop.fs.s3a.access.key", "minioadmin") \
    .config("spark.hadoop.fs.s3a.secret.key", "minioadmin123") \
    .config("spark.hadoop.fs.s3a.path.style.access", "true") \
    .config("spark.hadoop.fs.s3a.connection.ssl.enabled", "false") \
    .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
    .getOrCreate()

Create test data
data = [(1, "John", 25), (2, "Jane", 30), (3, "Bob", 35)]
columns = ["id", "name", "age"]
df = spark.createDataFrame(data, columns)

Write as Delta table to MinIO
print("Writing Delta table...")
df.write.format("delta").mode("overwrite").save("s3a://delta-lake/users")

Read Delta table
print("Reading Delta table...")
delta_df = spark.read.format("delta").load("s3a://delta-lake/users")
delta_df.show()

Perform an update operation (ACID transaction)
print("Updating records...")
delta_table = DeltaTable.forPath(spark, "s3a://delta-lake/users")
delta_table.update(
    condition="id = 1",
    set={"age": "26"}
)

Show updated data
print("After update:")
delta_table.toDF().show()

Show table history (versioning)
print("Table history:")
delta_table.history().show()

spark.stop()

Install Python dependencies and run test

Install the required Python packages and execute the Delta Lake test to verify ACID transactions work correctly.

sudo pip3 install pyspark==3.5.0 delta-spark==2.4.0
sudo chown spark:spark /opt/spark/test_delta.py
sudo -u spark python3 /opt/spark/test_delta.py

Verify your setup

Check that all services are running correctly and the Delta Lake integration is functional.

# Check service status
sudo systemctl status minio spark-master spark-worker

Verify Spark cluster
curl -s http://localhost:8080 | grep -i "spark master"

Check MinIO buckets
mc ls local/delta-lake

Test Spark shell with Delta Lake
/opt/spark/bin/spark-shell --packages io.delta:delta-core_2.12:2.4.0 --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"

Note: Access the Spark Master Web UI at http://your-server-ip:8080 and MinIO Console at http://your-server-ip:9001 to monitor your cluster and storage.

Configure ACID transaction settings

Optimize Delta Lake performance settings

Configure advanced Delta Lake settings for better performance and transaction isolation.

# Delta Lake Performance Settings
spark.databricks.delta.retentionDurationCheck.enabled=false
spark.databricks.delta.vacuum.parallelDelete.enabled=true
spark.databricks.delta.merge.repartitionBeforeWrite.enabled=true
spark.databricks.delta.optimizeWrite.enabled=true
spark.databricks.delta.autoCompact.enabled=true

Transaction Isolation
spark.databricks.delta.properties.defaults.isolation.level=WriteSerializable
spark.databricks.delta.properties.defaults.checkpointInterval=10

Create production Delta Lake table with schema evolution

Demonstrate advanced Delta Lake features including schema evolution and time travel capabilities.

from pyspark.sql import SparkSession
from pyspark.sql.types import *
from delta import *
import datetime

spark = SparkSession.builder \
    .appName("ProductionDeltaExample") \
    .master("spark://localhost:7077") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()

Create initial schema
schema = StructType([
    StructField("transaction_id", StringType(), False),
    StructField("customer_id", IntegerType(), False),
    StructField("amount", DoubleType(), False),
    StructField("timestamp", TimestampType(), False)
])

Generate sample transaction data
from datetime import datetime, timedelta
import random

transactions = []
for i in range(1000):
    transactions.append((
        f"txn_{i:05d}",
        random.randint(1, 100),
        round(random.uniform(10.0, 500.0), 2),
        datetime.now() - timedelta(days=random.randint(0, 30))
    ))

df = spark.createDataFrame(transactions, schema)

Write with partitioning for better performance
print("Creating partitioned Delta table...")
df.write.format("delta") \
    .mode("overwrite") \
    .partitionBy("customer_id") \
    .save("s3a://delta-lake/transactions")

Demonstrate time travel
print("\nTable versions:")
delta_table = DeltaTable.forPath(spark, "s3a://delta-lake/transactions")
delta_table.history().select("version", "timestamp", "operation").show()

Schema evolution - add new column
print("\nAdding new column (schema evolution)...")
new_data = [("txn_99999", 101, 299.99, datetime.now(), "credit_card")]
new_columns = ["transaction_id", "customer_id", "amount", "timestamp", "payment_method"]
new_df = spark.createDataFrame(new_data, new_columns)

new_df.write.format("delta") \
    .mode("append") \
    .option("mergeSchema", "true") \
    .save("s3a://delta-lake/transactions")

print("\nSchema after evolution:")
spark.read.format("delta").load("s3a://delta-lake/transactions").printSchema()

spark.stop()

Common issues

Symptom	Cause	Fix
Spark fails to start	Java not found or wrong version	Verify JAVA_HOME: `echo $JAVA_HOME && java -version`
Delta Lake JARs not found	JAR files missing or wrong location	Check JARs in /opt/spark/jars: `ls -la /opt/spark/jars/delta`
MinIO connection refused	MinIO service not running	Restart MinIO: `sudo systemctl restart minio`
S3A filesystem errors	Wrong endpoint or credentials	Verify MinIO config: `mc admin info local`
Permission denied on logs	Incorrect directory ownership	Fix ownership: `sudo chown -R spark:spark /opt/spark`
Worker not connecting to master	Firewall blocking ports	Check port 7077: `netstat -tlnp \| grep 7077`

Warning: Never use chmod 777 on Spark directories. This gives every user full access to your data and logs. Instead, use proper ownership with chown and minimal permissions like 755 for directories and 644 for files.

Next steps

Automated install script

Run this to automate the entire setup

install.sh

#!/usr/bin/env bash
set -euo pipefail

# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m' # No Color

# Configuration
SPARK_VERSION="3.5.0"
DELTA_VERSION="2.4.0"
HADOOP_AWS_VERSION="3.3.4"
AWS_SDK_VERSION="1.12.367"
MINIO_USER="minio"
MINIO_ACCESS_KEY="${MINIO_ACCESS_KEY:-minioadmin}"
MINIO_SECRET_KEY="${MINIO_SECRET_KEY:-minioadmin}"
MINIO_DATA_DIR="/opt/minio/data"

# Error handling
cleanup() {
    echo -e "${RED}[ERROR]${NC} Installation failed. Cleaning up..."
    systemctl stop minio 2>/dev/null || true
    systemctl stop spark-master 2>/dev/null || true
    systemctl stop spark-worker 2>/dev/null || true
    userdel -r spark 2>/dev/null || true
    userdel -r minio 2>/dev/null || true
    rm -rf /opt/spark /opt/minio 2>/dev/null || true
}
trap cleanup ERR

print_usage() {
    echo "Usage: $0 [OPTIONS]"
    echo "Options:"
    echo "  --minio-access-key KEY    MinIO access key (default: minioadmin)"
    echo "  --minio-secret-key KEY    MinIO secret key (default: minioadmin)"
    echo "  -h, --help               Show this help message"
}

# Parse arguments
while [[ $# -gt 0 ]]; do
    case $1 in
        --minio-access-key)
            MINIO_ACCESS_KEY="$2"
            shift 2
            ;;
        --minio-secret-key)
            MINIO_SECRET_KEY="$2"
            shift 2
            ;;
        -h|--help)
            print_usage
            exit 0
            ;;
        *)
            echo -e "${RED}Unknown option: $1${NC}"
            print_usage
            exit 1
            ;;
    esac
done

# Check if running as root
if [[ $EUID -ne 0 ]]; then
    echo -e "${RED}[ERROR]${NC} This script must be run as root"
    exit 1
fi

# Detect distro
if [ -f /etc/os-release ]; then
    . /etc/os-release
    case "$ID" in
        ubuntu|debian)
            PKG_MGR="apt"
            PKG_UPDATE="apt update && apt upgrade -y"
            PKG_INSTALL="apt install -y"
            JAVA_HOME_PATH="/usr/lib/jvm/java-17-openjdk-amd64"
            ;;
        almalinux|rocky|centos|rhel|ol|fedora)
            PKG_MGR="dnf"
            PKG_UPDATE="dnf update -y"
            PKG_INSTALL="dnf install -y"
            JAVA_HOME_PATH="/usr/lib/jvm/java-17-openjdk"
            ;;
        amzn)
            PKG_MGR="yum"
            PKG_UPDATE="yum update -y"
            PKG_INSTALL="yum install -y"
            JAVA_HOME_PATH="/usr/lib/jvm/java-17-openjdk"
            ;;
        *)
            echo -e "${RED}[ERROR]${NC} Unsupported distro: $ID"
            exit 1
            ;;
    esac
else
    echo -e "${RED}[ERROR]${NC} Cannot detect OS distribution"
    exit 1
fi

echo -e "${GREEN}[1/10]${NC} Updating system packages and installing prerequisites..."
$PKG_UPDATE

if [[ "$PKG_MGR" == "apt" ]]; then
    $PKG_INSTALL openjdk-17-jdk wget curl unzip python3 python3-pip tar
else
    $PKG_INSTALL java-17-openjdk java-17-openjdk-devel wget curl unzip python3 python3-pip tar
fi

echo -e "${GREEN}[2/10]${NC} Configuring Java environment..."
cat >> /etc/environment << EOF
JAVA_HOME=$JAVA_HOME_PATH
PATH=\$JAVA_HOME/bin:\$PATH
EOF
export JAVA_HOME="$JAVA_HOME_PATH"
export PATH="$JAVA_HOME/bin:$PATH"

echo -e "${GREEN}[3/10]${NC} Creating Spark user and directories..."
useradd -m -s /bin/bash spark || true
mkdir -p /opt/spark/{logs,work,conf}
chown -R spark:spark /opt/spark

echo -e "${GREEN}[4/10]${NC} Downloading and installing Apache Spark 3.5..."
cd /tmp
wget -q "https://archive.apache.org/dist/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop3.tgz"
tar -xzf "spark-${SPARK_VERSION}-bin-hadoop3.tgz"
cp -r "spark-${SPARK_VERSION}-bin-hadoop3"/* /opt/spark/
chown -R spark:spark /opt/spark
chmod 755 /opt/spark/bin/*

echo -e "${GREEN}[5/10]${NC} Configuring Spark environment..."
cat > /opt/spark/conf/spark-env.sh << EOF
#!/usr/bin/env bash
export JAVA_HOME=$JAVA_HOME_PATH
export SPARK_HOME=/opt/spark
export SPARK_CONF_DIR=/opt/spark/conf
export SPARK_LOG_DIR=/opt/spark/logs
export SPARK_WORKER_DIR=/opt/spark/work
export PYSPARK_PYTHON=python3
export SPARK_MASTER_HOST=localhost
export SPARK_MASTER_PORT=7077
export SPARK_MASTER_WEBUI_PORT=8080
export SPARK_WORKER_WEBUI_PORT=8081
EOF

cat > /opt/spark/conf/spark-defaults.conf << EOF
# Delta Lake Configuration
spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension
spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog

# S3/MinIO Configuration
spark.hadoop.fs.s3a.endpoint=http://localhost:9000
spark.hadoop.fs.s3a.access.key=$MINIO_ACCESS_KEY
spark.hadoop.fs.s3a.secret.key=$MINIO_SECRET_KEY
spark.hadoop.fs.s3a.path.style.access=true
spark.hadoop.fs.s3a.connection.ssl.enabled=false
spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider

# Performance Optimization
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.sql.adaptive.enabled=true
spark.sql.adaptive.coalescePartitions.enabled=true
spark.executor.memory=2g
spark.driver.memory=1g
spark.executor.cores=2
spark.default.parallelism=8
EOF

chown -R spark:spark /opt/spark/conf
chmod 644 /opt/spark/conf/*
chmod 755 /opt/spark/conf/spark-env.sh

echo -e "${GREEN}[6/10]${NC} Downloading Delta Lake and AWS SDK JARs..."
cd /opt/spark/jars
wget -q "https://repo1.maven.org/maven2/io/delta/delta-core_2.12/${DELTA_VERSION}/delta-core_2.12-${DELTA_VERSION}.jar"
wget -q "https://repo1.maven.org/maven2/io/delta/delta-storage/${DELTA_VERSION}/delta-storage-${DELTA_VERSION}.jar"
wget -q "https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/${HADOOP_AWS_VERSION}/hadoop-aws-${HADOOP_AWS_VERSION}.jar"
wget -q "https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/${AWS_SDK_VERSION}/aws-java-sdk-bundle-${AWS_SDK_VERSION}.jar"
chown spark:spark *.jar

echo -e "${GREEN}[7/10]${NC} Setting up MinIO..."
useradd -r -s /bin/false -d /opt/minio minio || true
mkdir -p /opt/minio/bin "$MINIO_DATA_DIR"
cd /opt/minio/bin
wget -q https://dl.min.io/server/minio/release/linux-amd64/minio
chmod 755 minio
chown -R minio:minio /opt/minio

echo -e "${GREEN}[8/10]${NC} Creating systemd services..."
cat > /etc/systemd/system/minio.service << EOF
[Unit]
Description=MinIO
Documentation=https://min.io/docs/minio/linux/index.html
Wants=network-online.target
After=network-online.target

[Service]
WorkingDirectory=/opt/minio
User=minio
Group=minio
ProtectProc=invisible
Environment=MINIO_ROOT_USER=$MINIO_ACCESS_KEY
Environment=MINIO_ROOT_PASSWORD=$MINIO_SECRET_KEY
ExecStart=/opt/minio/bin/minio server $MINIO_DATA_DIR --console-address ":9001"
Restart=always
LimitNOFILE=65536
TasksMax=infinity
TimeoutStopSec=infinity
SendSIGKILL=no

[Install]
WantedBy=multi-user.target
EOF

cat > /etc/systemd/system/spark-master.service << EOF
[Unit]
Description=Apache Spark Master
After=network.target

[Service]
Type=forking
User=spark
Group=spark
WorkingDirectory=/opt/spark
Environment=SPARK_HOME=/opt/spark
Environment=JAVA_HOME=$JAVA_HOME_PATH
ExecStart=/opt/spark/sbin/start-master.sh
ExecStop=/opt/spark/sbin/stop-master.sh
Restart=always

[Install]
WantedBy=multi-user.target
EOF

cat > /etc/systemd/system/spark-worker.service << EOF
[Unit]
Description=Apache Spark Worker
After=spark-master.service

[Service]
Type=forking
User=spark
Group=spark
WorkingDirectory=/opt/spark
Environment=SPARK_HOME=/opt/spark
Environment=JAVA_HOME=$JAVA_HOME_PATH
ExecStart=/opt/spark/sbin/start-worker.sh spark://localhost:7077
ExecStop=/opt/spark/sbin/stop-worker.sh
Restart=always

[Install]
WantedBy=multi-user.target
EOF

echo -e "${GREEN}[9/10]${NC} Starting services..."
systemctl daemon-reload
systemctl enable minio spark-master spark-worker
systemctl start minio
sleep 5
systemctl start spark-master
sleep 5
systemctl start spark-worker

echo -e "${GREEN}[10/10]${NC} Verifying installation..."
if systemctl is-active --quiet minio; then
    echo -e "${GREEN}✓${NC} MinIO is running"
else
    echo -e "${RED}✗${NC} MinIO failed to start"
    exit 1
fi

if systemctl is-active --quiet spark-master; then
    echo -e "${GREEN}✓${NC} Spark Master is running"
else
    echo -e "${RED}✗${NC} Spark Master failed to start"
    exit 1
fi

if systemctl is-active --quiet spark-worker; then
    echo -e "${GREEN}✓${NC} Spark Worker is running"
else
    echo -e "${RED}✗${NC} Spark Worker failed to start"
    exit 1
fi

echo -e "${GREEN}Installation completed successfully!${NC}"
echo
echo "Access points:"
echo "  - MinIO Console: http://localhost:9001"
echo "  - MinIO API: http://localhost:9000"
echo "  - Spark Master UI: http://localhost:8080"
echo "  - Spark Worker UI: http://localhost:8081"
echo
echo "Credentials:"
echo "  - MinIO Access Key: $MINIO_ACCESS_KEY"
echo "  - MinIO Secret Key: $MINIO_SECRET_KEY"

Review the script before running. Execute with: bash install.sh

#spark #delta-lake #minio #acid-transactions #big-data

Set up Spark 3.5 Delta Lake with MinIO for ACID transactions and big data analytics

Prerequisites

What this solves

Step-by-step installation

Update system packages and install prerequisites

Configure Java environment variables

Create Spark user and directories

Download and install Apache Spark 3.5

Configure Spark environment

Create Spark defaults configuration

S3/MinIO Configuration

Performance Optimization

Download Delta Lake and AWS SDK JARs

Install and configure MinIO server

Create MinIO systemd service

Start MinIO and create data bucket

Install MinIO client

Configure MinIO client and create bucket

Create Spark master and worker systemd services

Start Spark services

Configure firewall for Spark and MinIO

Create Delta Lake test application

Create Spark session with Delta Lake configuration

Create test data

Write as Delta table to MinIO

Read Delta table

Perform an update operation (ACID transaction)

Show updated data

Show table history (versioning)

Install Python dependencies and run test

Verify your setup

Verify Spark cluster

Check MinIO buckets

Test Spark shell with Delta Lake

Configure ACID transaction settings

Optimize Delta Lake performance settings

Transaction Isolation

Create production Delta Lake table with schema evolution

Create initial schema

Generate sample transaction data

Write with partitioning for better performance

Demonstrate time travel

Schema evolution - add new column

Common issues

Next steps

Related tutorials

Setup ScyllaDB backup validation and automated restore testing

Implement MariaDB backup encryption with Mariabackup and automated restoration

Configure MariaDB Galera cluster for multi-master replication with automatic failover

Don't want to manage this yourself?