Configure MinIO with Apache Spark 3.5 for Big Data Analytics

Set up Apache Spark 3.5 with MinIO S3-compatible object storage for scalable big data analytics. Configure distributed storage, implement data lake patterns, and run production analytics workflows on your cluster infrastructure.

Prerequisites

At least 8GB RAM
50GB free disk space
Root or sudo access
Internet connection for package downloads

What this solves

Apache Spark with MinIO provides a powerful combination for big data analytics with S3-compatible object storage. This setup enables you to process petabytes of data using Spark's distributed computing while storing datasets cost-effectively in MinIO's high-performance object storage. You need this configuration when building data lakes, running ETL pipelines, or performing large-scale analytics that require both distributed processing and scalable storage.

Step-by-step installation

Update system packages and install Java

Apache Spark requires Java 8 or 11. Install OpenJDK and update your package manager for the latest versions.

sudo apt update && sudo apt upgrade -y
sudo apt install -y openjdk-11-jdk wget curl unzip

sudo dnf update -y
sudo dnf install -y java-11-openjdk-devel wget curl unzip

Install Apache Spark 3.5

Download and install Apache Spark 3.5 with Hadoop support for distributed storage integration.

cd /opt
sudo wget https://archive.apache.org/dist/spark/spark-3.5.1/spark-3.5.1-bin-hadoop3.tgz
sudo tar -xzf spark-3.5.1-bin-hadoop3.tgz
sudo mv spark-3.5.1-bin-hadoop3 spark
sudo chown -R $USER:$USER /opt/spark
sudo chmod 755 /opt/spark

Configure Spark environment variables

Set up environment variables for Spark and Java paths in your shell profile.

JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
SPARK_HOME=/opt/spark
PATH="/opt/spark/bin:/opt/spark/sbin:$PATH"

echo 'export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64' >> ~/.bashrc
echo 'export SPARK_HOME=/opt/spark' >> ~/.bashrc
echo 'export PATH=$SPARK_HOME/bin:$SPARK_HOME/sbin:$PATH' >> ~/.bashrc
source ~/.bashrc

Install MinIO server

Install MinIO object storage server that will serve as S3-compatible storage for Spark datasets.

wget https://dl.min.io/server/minio/release/linux-amd64/minio
sudo chmod +x minio
sudo mv minio /usr/local/bin/
sudo useradd -r minio-user -s /sbin/nologin
sudo mkdir -p /opt/minio/data /etc/minio
sudo chown minio-user:minio-user /opt/minio/data /etc/minio

Configure MinIO credentials and service

Set up MinIO access credentials and create a systemd service for automatic startup.

MINIO_ROOT_USER=minioadmin
MINIO_ROOT_PASSWORD=minioadmin123
MINIO_VOLUMES=/opt/minio/data
MINIO_OPTS="--console-address :9001"

[Unit]
Description=MinIO Object Storage
Documentation=https://docs.min.io
Wants=network-online.target
After=network-online.target

[Service]
Type=notify
User=minio-user
Group=minio-user
EnvironmentFile=/etc/minio/minio.conf
ExecStart=/usr/local/bin/minio server $MINIO_OPTS $MINIO_VOLUMES
Restart=always
LimitNOFILE=65536
TasksMax=infinity
TimeoutStopSec=infinity
SendSIGKILL=no

[Install]
WantedBy=multi-user.target

Start and enable MinIO service

Enable MinIO to start on boot and verify it's running properly.

sudo systemctl daemon-reload
sudo systemctl enable --now minio
sudo systemctl status minio

Download required JAR files for S3 integration

Download Hadoop AWS and AWS SDK JAR files needed for Spark to communicate with S3-compatible storage.

cd /opt/spark/jars
sudo wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.4/hadoop-aws-3.3.4.jar
sudo wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.262/aws-java-sdk-bundle-1.12.262.jar
sudo chown $USER:$USER *.jar

Configure Spark for MinIO integration

Create Spark configuration to enable S3A file system with MinIO endpoint and credentials.

spark.hadoop.fs.s3a.endpoint=http://localhost:9000
spark.hadoop.fs.s3a.access.key=minioadmin
spark.hadoop.fs.s3a.secret.key=minioadmin123
spark.hadoop.fs.s3a.path.style.access=true
spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.connection.ssl.enabled=false
spark.hadoop.fs.s3a.attempts.maximum=3
spark.hadoop.fs.s3a.connection.establish.timeout=5000
spark.hadoop.fs.s3a.connection.timeout=200000
spark.sql.adaptive.enabled=true
spark.sql.adaptive.coalescePartitions.enabled=true
spark.serializer=org.apache.spark.serializer.KryoSerializer

Install MinIO client for bucket management

Install the MinIO client (mc) to create and manage buckets for your Spark data.

wget https://dl.min.io/client/mc/release/linux-amd64/mc
sudo chmod +x mc
sudo mv mc /usr/local/bin/
mc alias set local http://localhost:9000 minioadmin minioadmin123
mc mb local/spark-data
mc mb local/spark-warehouse

Configure firewall for MinIO and Spark

Open necessary ports for MinIO API, console, and Spark web UI access.

sudo ufw allow 9000/tcp comment 'MinIO API'
sudo ufw allow 9001/tcp comment 'MinIO Console'
sudo ufw allow 4040/tcp comment 'Spark Web UI'
sudo ufw allow 7077/tcp comment 'Spark Master'
sudo ufw allow 8080/tcp comment 'Spark Master Web UI'

sudo firewall-cmd --permanent --add-port=9000/tcp
sudo firewall-cmd --permanent --add-port=9001/tcp
sudo firewall-cmd --permanent --add-port=4040/tcp
sudo firewall-cmd --permanent --add-port=7077/tcp
sudo firewall-cmd --permanent --add-port=8080/tcp
sudo firewall-cmd --reload

Start Spark standalone cluster

Start the Spark master and worker processes to create a standalone cluster for distributed processing.

$SPARK_HOME/sbin/start-master.sh
$SPARK_HOME/sbin/start-worker.sh spark://localhost:7077

Create sample data analytics job

Create a Python script that demonstrates reading and writing data between Spark and MinIO.

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, count, avg, max, min
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType
import random

Create Spark session with MinIO configuration
spark = SparkSession.builder \
    .appName("SparkMinIOAnalytics") \
    .config("spark.hadoop.fs.s3a.endpoint", "http://localhost:9000") \
    .config("spark.hadoop.fs.s3a.access.key", "minioadmin") \
    .config("spark.hadoop.fs.s3a.secret.key", "minioadmin123") \
    .config("spark.hadoop.fs.s3a.path.style.access", "true") \
    .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
    .getOrCreate()

Create sample sales data
schema = StructType([
    StructField("product_id", StringType(), True),
    StructField("category", StringType(), True),
    StructField("price", DoubleType(), True),
    StructField("quantity", IntegerType(), True),
    StructField("region", StringType(), True)
])

Generate sample data
sample_data = []
for i in range(10000):
    sample_data.append((
        f"prod_{i % 100}",
        random.choice(["Electronics", "Clothing", "Books", "Home"]),
        round(random.uniform(10.0, 500.0), 2),
        random.randint(1, 10),
        random.choice(["North", "South", "East", "West"])
    ))

df = spark.createDataFrame(sample_data, schema)

Write data to MinIO
print("Writing data to MinIO...")
df.write.mode("overwrite").parquet("s3a://spark-data/sales/")

Read data from MinIO and perform analytics
print("Reading data from MinIO and performing analytics...")
sales_df = spark.read.parquet("s3a://spark-data/sales/")

Perform analytics
analytics_results = sales_df.groupBy("category", "region") \
    .agg(
        count("*").alias("total_sales"),
        avg("price").alias("avg_price"),
        max("quantity").alias("max_quantity"),
        min("quantity").alias("min_quantity")
    ) \
    .orderBy(col("total_sales").desc())

print("Analytics Results:")
analytics_results.show(20)

Save analytics results back to MinIO
analytics_results.write.mode("overwrite").parquet("s3a://spark-data/analytics-results/")

print("Analytics completed and results saved to MinIO")
spark.stop()

Verify your setup

Test the Spark and MinIO integration by running the analytics job and checking the results.

# Check MinIO is running
curl -I http://localhost:9000/minio/health/ready

Check Spark master is running
curl -s http://localhost:8080 | grep -q "Spark Master"
echo "Spark Master Status: $?"

Run the analytics job
spark-submit --master spark://localhost:7077 /opt/spark/examples/spark-minio-analytics.py

Verify data was written to MinIO
mc ls local/spark-data/
mc ls local/spark-data/analytics-results/

Check Spark application in web UI
echo "Access Spark Master UI: http://localhost:8080"
echo "Access MinIO Console: http://localhost:9001"

Production optimization

Configure Spark memory and performance settings

Optimize Spark configuration for production workloads with proper memory allocation and performance tuning.

# Add these settings to existing configuration
spark.executor.memory=4g
spark.driver.memory=2g
spark.executor.cores=4
spark.sql.adaptive.coalescePartitions.initialPartitionNum=200
spark.sql.adaptive.advisoryPartitionSizeInBytes=128MB
spark.sql.adaptive.skewJoin.enabled=true
spark.sql.shuffle.partitions=200
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.kryo.unsafe=true

Set up MinIO data retention and lifecycle policies

Configure lifecycle management for automated data cleanup and cost optimization.

{
  "Rules": [
    {
      "ID": "ArchiveOldAnalytics",
      "Status": "Enabled",
      "Filter": {
        "Prefix": "analytics-results/"
      },
      "Expiration": {
        "Days": 90
      }
    },
    {
      "ID": "DeleteTempData",
      "Status": "Enabled",
      "Filter": {
        "Prefix": "temp/"
      },
      "Expiration": {
        "Days": 7
      }
    }
  ]
}

mc ilm import local/spark-data < lifecycle-policy.json

Warning: Never use chmod 777 on Spark or MinIO directories. This gives every user on the system full access to your data. Use proper ownership with chown and minimal permissions (755 for directories, 644 for files).

Common issues

Symptom	Cause	Fix
Spark can't connect to MinIO	Wrong endpoint or credentials	Verify spark-defaults.conf settings and MinIO is running on port 9000
Access denied errors	Incorrect S3A configuration	Check access key, secret key, and path.style.access=true setting
Java heap space errors	Insufficient memory allocation	Increase spark.executor.memory and spark.driver.memory in configuration
Connection timeouts	Network or firewall issues	Verify firewall rules and increase connection timeout values
JAR file not found errors	Missing AWS SDK dependencies	Ensure hadoop-aws and aws-java-sdk-bundle JARs are in /opt/spark/jars/
Permission denied on Spark directories	Incorrect file ownership	Use `sudo chown -R $USER:$USER /opt/spark` instead of chmod 777

Next steps

Automated install script

Run this to automate the entire setup

install.sh

#!/usr/bin/env bash

set -euo pipefail

# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
BLUE='\033[0;34m'
NC='\033[0m' # No Color

# Configuration variables
MINIO_USER="minioadmin"
MINIO_PASSWORD="${MINIO_PASSWORD:-minioadmin123}"
SPARK_VERSION="3.5.1"
HADOOP_VERSION="3.3.4"
AWS_SDK_VERSION="1.12.262"

# Print colored output
print_status() {
    echo -e "${BLUE}[INFO]${NC} $1"
}

print_success() {
    echo -e "${GREEN}[SUCCESS]${NC} $1"
}

print_warning() {
    echo -e "${YELLOW}[WARNING]${NC} $1"
}

print_error() {
    echo -e "${RED}[ERROR]${NC} $1"
}

# Cleanup function for rollback
cleanup() {
    print_warning "Installation failed. Cleaning up..."
    sudo systemctl stop minio 2>/dev/null || true
    sudo systemctl disable minio 2>/dev/null || true
    sudo rm -f /etc/systemd/system/minio.service
    sudo rm -rf /opt/spark /opt/minio /etc/minio
    sudo userdel minio-user 2>/dev/null || true
    sudo rm -f /usr/local/bin/minio
}

trap cleanup ERR

# Check if running with appropriate privileges
check_privileges() {
    if [[ $EUID -eq 0 ]]; then
        print_error "Don't run this script as root. Use a user with sudo privileges."
        exit 1
    fi
    
    if ! sudo -n true 2>/dev/null; then
        print_error "This script requires sudo privileges. Please run: sudo -v"
        exit 1
    fi
}

# Auto-detect distribution
detect_distro() {
    if [ ! -f /etc/os-release ]; then
        print_error "/etc/os-release not found. Cannot detect distribution."
        exit 1
    fi
    
    . /etc/os-release
    case "$ID" in
        ubuntu|debian)
            PKG_MGR="apt"
            PKG_UPDATE="apt update"
            PKG_INSTALL="apt install -y"
            JAVA_HOME="/usr/lib/jvm/java-11-openjdk-amd64"
            JAVA_PKG="openjdk-11-jdk"
            ;;
        almalinux|rocky|centos|rhel|ol|fedora)
            PKG_MGR="dnf"
            PKG_UPDATE="dnf update -y"
            PKG_INSTALL="dnf install -y"
            JAVA_HOME="/usr/lib/jvm/java-11-openjdk"
            JAVA_PKG="java-11-openjdk-devel"
            ;;
        amzn)
            PKG_MGR="yum"
            PKG_UPDATE="yum update -y"
            PKG_INSTALL="yum install -y"
            JAVA_HOME="/usr/lib/jvm/java-11-openjdk"
            JAVA_PKG="java-11-openjdk-devel"
            ;;
        *)
            print_error "Unsupported distribution: $ID"
            exit 1
            ;;
    esac
    print_success "Detected distribution: $ID"
}

# Update system and install dependencies
install_dependencies() {
    print_status "[1/8] Updating system and installing dependencies..."
    
    sudo $PKG_UPDATE
    sudo $PKG_INSTALL $JAVA_PKG wget curl unzip tar
    
    # Verify Java installation
    if ! java -version >/dev/null 2>&1; then
        print_error "Java installation failed"
        exit 1
    fi
    
    print_success "Dependencies installed successfully"
}

# Install Apache Spark
install_spark() {
    print_status "[2/8] Installing Apache Spark ${SPARK_VERSION}..."
    
    cd /tmp
    wget -q "https://archive.apache.org/dist/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop3.tgz"
    
    sudo mkdir -p /opt
    sudo tar -xzf "spark-${SPARK_VERSION}-bin-hadoop3.tgz" -C /opt
    sudo mv "/opt/spark-${SPARK_VERSION}-bin-hadoop3" /opt/spark
    sudo chown -R $USER:$USER /opt/spark
    sudo chmod -R 755 /opt/spark
    
    rm -f "spark-${SPARK_VERSION}-bin-hadoop3.tgz"
    print_success "Apache Spark installed successfully"
}

# Configure environment variables
configure_environment() {
    print_status "[3/8] Configuring environment variables..."
    
    # Remove existing entries to avoid duplicates
    grep -v "JAVA_HOME\|SPARK_HOME\|export.*spark" ~/.bashrc > ~/.bashrc.tmp || true
    mv ~/.bashrc.tmp ~/.bashrc
    
    # Add new environment variables
    {
        echo "export JAVA_HOME=${JAVA_HOME}"
        echo "export SPARK_HOME=/opt/spark"
        echo "export PATH=\$SPARK_HOME/bin:\$SPARK_HOME/sbin:\$PATH"
    } >> ~/.bashrc
    
    export JAVA_HOME="${JAVA_HOME}"
    export SPARK_HOME="/opt/spark"
    export PATH="/opt/spark/bin:/opt/spark/sbin:$PATH"
    
    print_success "Environment variables configured"
}

# Install MinIO server
install_minio() {
    print_status "[4/8] Installing MinIO server..."
    
    cd /tmp
    wget -q https://dl.min.io/server/minio/release/linux-amd64/minio
    sudo chmod 755 minio
    sudo mv minio /usr/local/bin/
    
    # Create MinIO user and directories
    sudo useradd -r minio-user -s /sbin/nologin 2>/dev/null || true
    sudo mkdir -p /opt/minio/data /etc/minio
    sudo chown minio-user:minio-user /opt/minio/data /etc/minio
    sudo chmod 755 /opt/minio/data /etc/minio
    
    print_success "MinIO server installed successfully"
}

# Configure MinIO service
configure_minio() {
    print_status "[5/8] Configuring MinIO service..."
    
    # Create MinIO configuration
    sudo tee /etc/minio/minio.conf > /dev/null <<EOF
MINIO_ROOT_USER=${MINIO_USER}
MINIO_ROOT_PASSWORD=${MINIO_PASSWORD}
MINIO_VOLUMES=/opt/minio/data
MINIO_OPTS=--console-address :9001
EOF
    
    sudo chmod 640 /etc/minio/minio.conf
    sudo chown minio-user:minio-user /etc/minio/minio.conf
    
    # Create systemd service
    sudo tee /etc/systemd/system/minio.service > /dev/null <<EOF
[Unit]
Description=MinIO Object Storage
Documentation=https://docs.min.io
Wants=network-online.target
After=network-online.target

[Service]
Type=notify
User=minio-user
Group=minio-user
EnvironmentFile=/etc/minio/minio.conf
ExecStart=/usr/local/bin/minio server \$MINIO_OPTS \$MINIO_VOLUMES
Restart=always
LimitNOFILE=65536
TasksMax=infinity
TimeoutStopSec=infinity
SendSIGKILL=no

[Install]
WantedBy=multi-user.target
EOF
    
    sudo chmod 644 /etc/systemd/system/minio.service
    print_success "MinIO service configured"
}

# Start MinIO service
start_minio() {
    print_status "[6/8] Starting MinIO service..."
    
    sudo systemctl daemon-reload
    sudo systemctl enable minio
    sudo systemctl start minio
    
    # Wait for service to start
    sleep 5
    
    if ! sudo systemctl is-active --quiet minio; then
        print_error "MinIO service failed to start"
        sudo systemctl status minio
        exit 1
    fi
    
    print_success "MinIO service started successfully"
}

# Download S3 integration JAR files
download_jars() {
    print_status "[7/8] Downloading S3 integration JAR files..."
    
    cd /opt/spark/jars
    sudo wget -q "https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/${HADOOP_VERSION}/hadoop-aws-${HADOOP_VERSION}.jar"
    sudo wget -q "https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/${AWS_SDK_VERSION}/aws-java-sdk-bundle-${AWS_SDK_VERSION}.jar"
    
    sudo chown $USER:$USER hadoop-aws-*.jar aws-java-sdk-bundle-*.jar
    sudo chmod 644 hadoop-aws-*.jar aws-java-sdk-bundle-*.jar
    
    print_success "JAR files downloaded successfully"
}

# Configure Spark for MinIO integration
configure_spark_minio() {
    print_status "[8/8] Configuring Spark for MinIO integration..."
    
    sudo tee /opt/spark/conf/spark-defaults.conf > /dev/null <<EOF
spark.hadoop.fs.s3a.endpoint=http://localhost:9000
spark.hadoop.fs.s3a.access.key=${MINIO_USER}
spark.hadoop.fs.s3a.secret.key=${MINIO_PASSWORD}
spark.hadoop.fs.s3a.path.style.access=true
spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.connection.ssl.enabled=false
spark.hadoop.fs.s3a.attempts.maximum=3
spark.hadoop.fs.s3a.connection.establish.timeout=5000
spark.hadoop.fs.s3a.connection.timeout=200000
spark.sql.adaptive.enabled=true
spark.sql.adaptive.coalescePartitions.enabled=true
spark.serializer=org.apache.spark.serializer.KryoSerializer
EOF
    
    sudo chown $USER:$USER /opt/spark/conf/spark-defaults.conf
    sudo chmod 644 /opt/spark/conf/spark-defaults.conf
    
    print_success "Spark configuration completed"
}

# Verify installation
verify_installation() {
    print_status "Verifying installation..."
    
    # Check MinIO service
    if sudo systemctl is-active --quiet minio; then
        print_success "✓ MinIO service is running"
    else
        print_error "✗ MinIO service is not running"
        return 1
    fi
    
    # Check Spark installation
    if /opt/spark/bin/spark-submit --version >/dev/null 2>&1; then
        print_success "✓ Spark is installed and accessible"
    else
        print_error "✗ Spark installation verification failed"
        return 1
    fi
    
    # Check JAR files
    if [[ -f "/opt/spark/jars/hadoop-aws-${HADOOP_VERSION}.jar" ]]; then
        print_success "✓ Hadoop AWS JAR file is present"
    else
        print_error "✗ Hadoop AWS JAR file is missing"
        return 1
    fi
    
    print_success "Installation verification completed successfully!"
    echo
    echo -e "${GREEN}MinIO Console:${NC} http://localhost:9001"
    echo -e "${GREEN}MinIO API:${NC} http://localhost:9000"
    echo -e "${GREEN}Credentials:${NC} ${MINIO_USER} / ${MINIO_PASSWORD}"
    echo
    echo -e "${YELLOW}To use Spark with MinIO, source your bashrc:${NC}"
    echo "source ~/.bashrc"
}

# Main execution
main() {
    check_privileges
    detect_distro
    install_dependencies
    install_spark
    configure_environment
    install_minio
    configure_minio
    start_minio
    download_jars
    configure_spark_minio
    verify_installation
    
    trap - ERR
    print_success "Apache Spark with MinIO installation completed successfully!"
}

main "$@"

Review the script before running. Execute with: bash install.sh

#apache-spark #minio #big-data #s3-storage #analytics

Configure MinIO with Apache Spark 3.5 for big data analytics and object storage integration

Prerequisites

What this solves

Step-by-step installation

Update system packages and install Java

Install Apache Spark 3.5

Configure Spark environment variables

Install MinIO server

Configure MinIO credentials and service

Start and enable MinIO service

Download required JAR files for S3 integration

Configure Spark for MinIO integration

Install MinIO client for bucket management

Configure firewall for MinIO and Spark

Start Spark standalone cluster

Create sample data analytics job

Create Spark session with MinIO configuration

Create sample sales data

Generate sample data

Write data to MinIO

Read data from MinIO and perform analytics

Perform analytics

Save analytics results back to MinIO

Verify your setup

Check Spark master is running

Run the analytics job

Verify data was written to MinIO

Check Spark application in web UI

Production optimization

Configure Spark memory and performance settings

Set up MinIO data retention and lifecycle policies

Common issues

Next steps

Related tutorials

Deploy FastAPI applications with Docker Compose and production optimization

Configure container resource limits with Docker and systemd for production workloads

Setup Gunicorn blue-green deployment with NGINX for zero downtime Python applications

Don't want to manage this yourself?