Configure MinIO with Apache Spark 3.5 for big data analytics and object storage integration

Intermediate 45 min Apr 02, 2026 529 views
Ubuntu 24.04 Debian 12 AlmaLinux 9 Rocky Linux 9

Set up Apache Spark 3.5 with MinIO S3-compatible object storage for scalable big data analytics. Configure distributed storage, implement data lake patterns, and run production analytics workflows on your cluster infrastructure.

Prerequisites

  • At least 8GB RAM
  • 50GB free disk space
  • Root or sudo access
  • Internet connection for package downloads

What this solves

Apache Spark with MinIO provides a powerful combination for big data analytics with S3-compatible object storage. This setup enables you to process petabytes of data using Spark's distributed computing while storing datasets cost-effectively in MinIO's high-performance object storage. You need this configuration when building data lakes, running ETL pipelines, or performing large-scale analytics that require both distributed processing and scalable storage.

Step-by-step installation

Update system packages and install Java

Apache Spark requires Java 8 or 11. Install OpenJDK and update your package manager for the latest versions.

sudo apt update && sudo apt upgrade -y
sudo apt install -y openjdk-11-jdk wget curl unzip
sudo dnf update -y
sudo dnf install -y java-11-openjdk-devel wget curl unzip

Install Apache Spark 3.5

Download and install Apache Spark 3.5 with Hadoop support for distributed storage integration.

cd /opt
sudo wget https://archive.apache.org/dist/spark/spark-3.5.1/spark-3.5.1-bin-hadoop3.tgz
sudo tar -xzf spark-3.5.1-bin-hadoop3.tgz
sudo mv spark-3.5.1-bin-hadoop3 spark
sudo chown -R $USER:$USER /opt/spark
sudo chmod 755 /opt/spark

Configure Spark environment variables

Set up environment variables for Spark and Java paths in your shell profile.

JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
SPARK_HOME=/opt/spark
PATH="/opt/spark/bin:/opt/spark/sbin:$PATH"
echo 'export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64' >> ~/.bashrc
echo 'export SPARK_HOME=/opt/spark' >> ~/.bashrc
echo 'export PATH=$SPARK_HOME/bin:$SPARK_HOME/sbin:$PATH' >> ~/.bashrc
source ~/.bashrc

Install MinIO server

Install MinIO object storage server that will serve as S3-compatible storage for Spark datasets.

wget https://dl.min.io/server/minio/release/linux-amd64/minio
sudo chmod +x minio
sudo mv minio /usr/local/bin/
sudo useradd -r minio-user -s /sbin/nologin
sudo mkdir -p /opt/minio/data /etc/minio
sudo chown minio-user:minio-user /opt/minio/data /etc/minio

Configure MinIO credentials and service

Set up MinIO access credentials and create a systemd service for automatic startup.

MINIO_ROOT_USER=minioadmin
MINIO_ROOT_PASSWORD=minioadmin123
MINIO_VOLUMES=/opt/minio/data
MINIO_OPTS="--console-address :9001"
[Unit]
Description=MinIO Object Storage
Documentation=https://docs.min.io
Wants=network-online.target
After=network-online.target

[Service]
Type=notify
User=minio-user
Group=minio-user
EnvironmentFile=/etc/minio/minio.conf
ExecStart=/usr/local/bin/minio server $MINIO_OPTS $MINIO_VOLUMES
Restart=always
LimitNOFILE=65536
TasksMax=infinity
TimeoutStopSec=infinity
SendSIGKILL=no

[Install]
WantedBy=multi-user.target

Start and enable MinIO service

Enable MinIO to start on boot and verify it's running properly.

sudo systemctl daemon-reload
sudo systemctl enable --now minio
sudo systemctl status minio

Download required JAR files for S3 integration

Download Hadoop AWS and AWS SDK JAR files needed for Spark to communicate with S3-compatible storage.

cd /opt/spark/jars
sudo wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.4/hadoop-aws-3.3.4.jar
sudo wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.262/aws-java-sdk-bundle-1.12.262.jar
sudo chown $USER:$USER *.jar

Configure Spark for MinIO integration

Create Spark configuration to enable S3A file system with MinIO endpoint and credentials.

spark.hadoop.fs.s3a.endpoint=http://localhost:9000
spark.hadoop.fs.s3a.access.key=minioadmin
spark.hadoop.fs.s3a.secret.key=minioadmin123
spark.hadoop.fs.s3a.path.style.access=true
spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.connection.ssl.enabled=false
spark.hadoop.fs.s3a.attempts.maximum=3
spark.hadoop.fs.s3a.connection.establish.timeout=5000
spark.hadoop.fs.s3a.connection.timeout=200000
spark.sql.adaptive.enabled=true
spark.sql.adaptive.coalescePartitions.enabled=true
spark.serializer=org.apache.spark.serializer.KryoSerializer

Install MinIO client for bucket management

Install the MinIO client (mc) to create and manage buckets for your Spark data.

wget https://dl.min.io/client/mc/release/linux-amd64/mc
sudo chmod +x mc
sudo mv mc /usr/local/bin/
mc alias set local http://localhost:9000 minioadmin minioadmin123
mc mb local/spark-data
mc mb local/spark-warehouse

Configure firewall for MinIO and Spark

Open necessary ports for MinIO API, console, and Spark web UI access.

sudo ufw allow 9000/tcp comment 'MinIO API'
sudo ufw allow 9001/tcp comment 'MinIO Console'
sudo ufw allow 4040/tcp comment 'Spark Web UI'
sudo ufw allow 7077/tcp comment 'Spark Master'
sudo ufw allow 8080/tcp comment 'Spark Master Web UI'
sudo firewall-cmd --permanent --add-port=9000/tcp
sudo firewall-cmd --permanent --add-port=9001/tcp
sudo firewall-cmd --permanent --add-port=4040/tcp
sudo firewall-cmd --permanent --add-port=7077/tcp
sudo firewall-cmd --permanent --add-port=8080/tcp
sudo firewall-cmd --reload

Start Spark standalone cluster

Start the Spark master and worker processes to create a standalone cluster for distributed processing.

$SPARK_HOME/sbin/start-master.sh
$SPARK_HOME/sbin/start-worker.sh spark://localhost:7077

Create sample data analytics job

Create a Python script that demonstrates reading and writing data between Spark and MinIO.

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, count, avg, max, min
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType
import random

Create Spark session with MinIO configuration

spark = SparkSession.builder \ .appName("SparkMinIOAnalytics") \ .config("spark.hadoop.fs.s3a.endpoint", "http://localhost:9000") \ .config("spark.hadoop.fs.s3a.access.key", "minioadmin") \ .config("spark.hadoop.fs.s3a.secret.key", "minioadmin123") \ .config("spark.hadoop.fs.s3a.path.style.access", "true") \ .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \ .getOrCreate()

Create sample sales data

schema = StructType([ StructField("product_id", StringType(), True), StructField("category", StringType(), True), StructField("price", DoubleType(), True), StructField("quantity", IntegerType(), True), StructField("region", StringType(), True) ])

Generate sample data

sample_data = [] for i in range(10000): sample_data.append(( f"prod_{i % 100}", random.choice(["Electronics", "Clothing", "Books", "Home"]), round(random.uniform(10.0, 500.0), 2), random.randint(1, 10), random.choice(["North", "South", "East", "West"]) )) df = spark.createDataFrame(sample_data, schema)

Write data to MinIO

print("Writing data to MinIO...") df.write.mode("overwrite").parquet("s3a://spark-data/sales/")

Read data from MinIO and perform analytics

print("Reading data from MinIO and performing analytics...") sales_df = spark.read.parquet("s3a://spark-data/sales/")

Perform analytics

analytics_results = sales_df.groupBy("category", "region") \ .agg( count("*").alias("total_sales"), avg("price").alias("avg_price"), max("quantity").alias("max_quantity"), min("quantity").alias("min_quantity") ) \ .orderBy(col("total_sales").desc()) print("Analytics Results:") analytics_results.show(20)

Save analytics results back to MinIO

analytics_results.write.mode("overwrite").parquet("s3a://spark-data/analytics-results/") print("Analytics completed and results saved to MinIO") spark.stop()

Verify your setup

Test the Spark and MinIO integration by running the analytics job and checking the results.

# Check MinIO is running
curl -I http://localhost:9000/minio/health/ready

Check Spark master is running

curl -s http://localhost:8080 | grep -q "Spark Master" echo "Spark Master Status: $?"

Run the analytics job

spark-submit --master spark://localhost:7077 /opt/spark/examples/spark-minio-analytics.py

Verify data was written to MinIO

mc ls local/spark-data/ mc ls local/spark-data/analytics-results/

Check Spark application in web UI

echo "Access Spark Master UI: http://localhost:8080" echo "Access MinIO Console: http://localhost:9001"

Production optimization

Configure Spark memory and performance settings

Optimize Spark configuration for production workloads with proper memory allocation and performance tuning.

# Add these settings to existing configuration
spark.executor.memory=4g
spark.driver.memory=2g
spark.executor.cores=4
spark.sql.adaptive.coalescePartitions.initialPartitionNum=200
spark.sql.adaptive.advisoryPartitionSizeInBytes=128MB
spark.sql.adaptive.skewJoin.enabled=true
spark.sql.shuffle.partitions=200
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.kryo.unsafe=true

Set up MinIO data retention and lifecycle policies

Configure lifecycle management for automated data cleanup and cost optimization.

{
  "Rules": [
    {
      "ID": "ArchiveOldAnalytics",
      "Status": "Enabled",
      "Filter": {
        "Prefix": "analytics-results/"
      },
      "Expiration": {
        "Days": 90
      }
    },
    {
      "ID": "DeleteTempData",
      "Status": "Enabled",
      "Filter": {
        "Prefix": "temp/"
      },
      "Expiration": {
        "Days": 7
      }
    }
  ]
}
mc ilm import local/spark-data < lifecycle-policy.json
Warning: Never use chmod 777 on Spark or MinIO directories. This gives every user on the system full access to your data. Use proper ownership with chown and minimal permissions (755 for directories, 644 for files).

Common issues

SymptomCauseFix
Spark can't connect to MinIOWrong endpoint or credentialsVerify spark-defaults.conf settings and MinIO is running on port 9000
Access denied errorsIncorrect S3A configurationCheck access key, secret key, and path.style.access=true setting
Java heap space errorsInsufficient memory allocationIncrease spark.executor.memory and spark.driver.memory in configuration
Connection timeoutsNetwork or firewall issuesVerify firewall rules and increase connection timeout values
JAR file not found errorsMissing AWS SDK dependenciesEnsure hadoop-aws and aws-java-sdk-bundle JARs are in /opt/spark/jars/
Permission denied on Spark directoriesIncorrect file ownershipUse sudo chown -R $USER:$USER /opt/spark instead of chmod 777

Next steps

Automated install script

Run this to automate the entire setup

Need help?

Don't want to manage this yourself?

We handle managed devops services for businesses that depend on uptime. From initial setup to ongoing operations.