Set up Apache Spark 3.5 with MinIO S3-compatible object storage for scalable big data analytics. Configure distributed storage, implement data lake patterns, and run production analytics workflows on your cluster infrastructure.
Prerequisites
- At least 8GB RAM
- 50GB free disk space
- Root or sudo access
- Internet connection for package downloads
What this solves
Apache Spark with MinIO provides a powerful combination for big data analytics with S3-compatible object storage. This setup enables you to process petabytes of data using Spark's distributed computing while storing datasets cost-effectively in MinIO's high-performance object storage. You need this configuration when building data lakes, running ETL pipelines, or performing large-scale analytics that require both distributed processing and scalable storage.
Step-by-step installation
Update system packages and install Java
Apache Spark requires Java 8 or 11. Install OpenJDK and update your package manager for the latest versions.
sudo apt update && sudo apt upgrade -y
sudo apt install -y openjdk-11-jdk wget curl unzip
Install Apache Spark 3.5
Download and install Apache Spark 3.5 with Hadoop support for distributed storage integration.
cd /opt
sudo wget https://archive.apache.org/dist/spark/spark-3.5.1/spark-3.5.1-bin-hadoop3.tgz
sudo tar -xzf spark-3.5.1-bin-hadoop3.tgz
sudo mv spark-3.5.1-bin-hadoop3 spark
sudo chown -R $USER:$USER /opt/spark
sudo chmod 755 /opt/spark
Configure Spark environment variables
Set up environment variables for Spark and Java paths in your shell profile.
JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
SPARK_HOME=/opt/spark
PATH="/opt/spark/bin:/opt/spark/sbin:$PATH"
echo 'export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64' >> ~/.bashrc
echo 'export SPARK_HOME=/opt/spark' >> ~/.bashrc
echo 'export PATH=$SPARK_HOME/bin:$SPARK_HOME/sbin:$PATH' >> ~/.bashrc
source ~/.bashrc
Install MinIO server
Install MinIO object storage server that will serve as S3-compatible storage for Spark datasets.
wget https://dl.min.io/server/minio/release/linux-amd64/minio
sudo chmod +x minio
sudo mv minio /usr/local/bin/
sudo useradd -r minio-user -s /sbin/nologin
sudo mkdir -p /opt/minio/data /etc/minio
sudo chown minio-user:minio-user /opt/minio/data /etc/minio
Configure MinIO credentials and service
Set up MinIO access credentials and create a systemd service for automatic startup.
MINIO_ROOT_USER=minioadmin
MINIO_ROOT_PASSWORD=minioadmin123
MINIO_VOLUMES=/opt/minio/data
MINIO_OPTS="--console-address :9001"
[Unit]
Description=MinIO Object Storage
Documentation=https://docs.min.io
Wants=network-online.target
After=network-online.target
[Service]
Type=notify
User=minio-user
Group=minio-user
EnvironmentFile=/etc/minio/minio.conf
ExecStart=/usr/local/bin/minio server $MINIO_OPTS $MINIO_VOLUMES
Restart=always
LimitNOFILE=65536
TasksMax=infinity
TimeoutStopSec=infinity
SendSIGKILL=no
[Install]
WantedBy=multi-user.target
Start and enable MinIO service
Enable MinIO to start on boot and verify it's running properly.
sudo systemctl daemon-reload
sudo systemctl enable --now minio
sudo systemctl status minio
Download required JAR files for S3 integration
Download Hadoop AWS and AWS SDK JAR files needed for Spark to communicate with S3-compatible storage.
cd /opt/spark/jars
sudo wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.4/hadoop-aws-3.3.4.jar
sudo wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.262/aws-java-sdk-bundle-1.12.262.jar
sudo chown $USER:$USER *.jar
Configure Spark for MinIO integration
Create Spark configuration to enable S3A file system with MinIO endpoint and credentials.
spark.hadoop.fs.s3a.endpoint=http://localhost:9000
spark.hadoop.fs.s3a.access.key=minioadmin
spark.hadoop.fs.s3a.secret.key=minioadmin123
spark.hadoop.fs.s3a.path.style.access=true
spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.connection.ssl.enabled=false
spark.hadoop.fs.s3a.attempts.maximum=3
spark.hadoop.fs.s3a.connection.establish.timeout=5000
spark.hadoop.fs.s3a.connection.timeout=200000
spark.sql.adaptive.enabled=true
spark.sql.adaptive.coalescePartitions.enabled=true
spark.serializer=org.apache.spark.serializer.KryoSerializer
Install MinIO client for bucket management
Install the MinIO client (mc) to create and manage buckets for your Spark data.
wget https://dl.min.io/client/mc/release/linux-amd64/mc
sudo chmod +x mc
sudo mv mc /usr/local/bin/
mc alias set local http://localhost:9000 minioadmin minioadmin123
mc mb local/spark-data
mc mb local/spark-warehouse
Configure firewall for MinIO and Spark
Open necessary ports for MinIO API, console, and Spark web UI access.
sudo ufw allow 9000/tcp comment 'MinIO API'
sudo ufw allow 9001/tcp comment 'MinIO Console'
sudo ufw allow 4040/tcp comment 'Spark Web UI'
sudo ufw allow 7077/tcp comment 'Spark Master'
sudo ufw allow 8080/tcp comment 'Spark Master Web UI'
Start Spark standalone cluster
Start the Spark master and worker processes to create a standalone cluster for distributed processing.
$SPARK_HOME/sbin/start-master.sh
$SPARK_HOME/sbin/start-worker.sh spark://localhost:7077
Create sample data analytics job
Create a Python script that demonstrates reading and writing data between Spark and MinIO.
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, count, avg, max, min
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType
import random
Create Spark session with MinIO configuration
spark = SparkSession.builder \
.appName("SparkMinIOAnalytics") \
.config("spark.hadoop.fs.s3a.endpoint", "http://localhost:9000") \
.config("spark.hadoop.fs.s3a.access.key", "minioadmin") \
.config("spark.hadoop.fs.s3a.secret.key", "minioadmin123") \
.config("spark.hadoop.fs.s3a.path.style.access", "true") \
.config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
.getOrCreate()
Create sample sales data
schema = StructType([
StructField("product_id", StringType(), True),
StructField("category", StringType(), True),
StructField("price", DoubleType(), True),
StructField("quantity", IntegerType(), True),
StructField("region", StringType(), True)
])
Generate sample data
sample_data = []
for i in range(10000):
sample_data.append((
f"prod_{i % 100}",
random.choice(["Electronics", "Clothing", "Books", "Home"]),
round(random.uniform(10.0, 500.0), 2),
random.randint(1, 10),
random.choice(["North", "South", "East", "West"])
))
df = spark.createDataFrame(sample_data, schema)
Write data to MinIO
print("Writing data to MinIO...")
df.write.mode("overwrite").parquet("s3a://spark-data/sales/")
Read data from MinIO and perform analytics
print("Reading data from MinIO and performing analytics...")
sales_df = spark.read.parquet("s3a://spark-data/sales/")
Perform analytics
analytics_results = sales_df.groupBy("category", "region") \
.agg(
count("*").alias("total_sales"),
avg("price").alias("avg_price"),
max("quantity").alias("max_quantity"),
min("quantity").alias("min_quantity")
) \
.orderBy(col("total_sales").desc())
print("Analytics Results:")
analytics_results.show(20)
Save analytics results back to MinIO
analytics_results.write.mode("overwrite").parquet("s3a://spark-data/analytics-results/")
print("Analytics completed and results saved to MinIO")
spark.stop()
Verify your setup
Test the Spark and MinIO integration by running the analytics job and checking the results.
# Check MinIO is running
curl -I http://localhost:9000/minio/health/ready
Check Spark master is running
curl -s http://localhost:8080 | grep -q "Spark Master"
echo "Spark Master Status: $?"
Run the analytics job
spark-submit --master spark://localhost:7077 /opt/spark/examples/spark-minio-analytics.py
Verify data was written to MinIO
mc ls local/spark-data/
mc ls local/spark-data/analytics-results/
Check Spark application in web UI
echo "Access Spark Master UI: http://localhost:8080"
echo "Access MinIO Console: http://localhost:9001"
Production optimization
Configure Spark memory and performance settings
Optimize Spark configuration for production workloads with proper memory allocation and performance tuning.
# Add these settings to existing configuration
spark.executor.memory=4g
spark.driver.memory=2g
spark.executor.cores=4
spark.sql.adaptive.coalescePartitions.initialPartitionNum=200
spark.sql.adaptive.advisoryPartitionSizeInBytes=128MB
spark.sql.adaptive.skewJoin.enabled=true
spark.sql.shuffle.partitions=200
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.kryo.unsafe=true
Set up MinIO data retention and lifecycle policies
Configure lifecycle management for automated data cleanup and cost optimization.
{
"Rules": [
{
"ID": "ArchiveOldAnalytics",
"Status": "Enabled",
"Filter": {
"Prefix": "analytics-results/"
},
"Expiration": {
"Days": 90
}
},
{
"ID": "DeleteTempData",
"Status": "Enabled",
"Filter": {
"Prefix": "temp/"
},
"Expiration": {
"Days": 7
}
}
]
}
mc ilm import local/spark-data < lifecycle-policy.json
Common issues
| Symptom | Cause | Fix |
|---|---|---|
| Spark can't connect to MinIO | Wrong endpoint or credentials | Verify spark-defaults.conf settings and MinIO is running on port 9000 |
| Access denied errors | Incorrect S3A configuration | Check access key, secret key, and path.style.access=true setting |
| Java heap space errors | Insufficient memory allocation | Increase spark.executor.memory and spark.driver.memory in configuration |
| Connection timeouts | Network or firewall issues | Verify firewall rules and increase connection timeout values |
| JAR file not found errors | Missing AWS SDK dependencies | Ensure hadoop-aws and aws-java-sdk-bundle JARs are in /opt/spark/jars/ |
| Permission denied on Spark directories | Incorrect file ownership | Use sudo chown -R $USER:$USER /opt/spark instead of chmod 777 |
Next steps
- Install and configure MinIO object storage with SSL and clustering
- Configure Apache Airflow monitoring with Prometheus alerts
- Set up Spark Delta Lake with MinIO for ACID transactions
- Configure Spark on Kubernetes with MinIO for cloud-native analytics
- Implement Spark Streaming with Kafka and MinIO for real-time analytics
Automated install script
Run this to automate the entire setup
#!/usr/bin/env bash
set -euo pipefail
# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
BLUE='\033[0;34m'
NC='\033[0m' # No Color
# Configuration variables
MINIO_USER="minioadmin"
MINIO_PASSWORD="${MINIO_PASSWORD:-minioadmin123}"
SPARK_VERSION="3.5.1"
HADOOP_VERSION="3.3.4"
AWS_SDK_VERSION="1.12.262"
# Print colored output
print_status() {
echo -e "${BLUE}[INFO]${NC} $1"
}
print_success() {
echo -e "${GREEN}[SUCCESS]${NC} $1"
}
print_warning() {
echo -e "${YELLOW}[WARNING]${NC} $1"
}
print_error() {
echo -e "${RED}[ERROR]${NC} $1"
}
# Cleanup function for rollback
cleanup() {
print_warning "Installation failed. Cleaning up..."
sudo systemctl stop minio 2>/dev/null || true
sudo systemctl disable minio 2>/dev/null || true
sudo rm -f /etc/systemd/system/minio.service
sudo rm -rf /opt/spark /opt/minio /etc/minio
sudo userdel minio-user 2>/dev/null || true
sudo rm -f /usr/local/bin/minio
}
trap cleanup ERR
# Check if running with appropriate privileges
check_privileges() {
if [[ $EUID -eq 0 ]]; then
print_error "Don't run this script as root. Use a user with sudo privileges."
exit 1
fi
if ! sudo -n true 2>/dev/null; then
print_error "This script requires sudo privileges. Please run: sudo -v"
exit 1
fi
}
# Auto-detect distribution
detect_distro() {
if [ ! -f /etc/os-release ]; then
print_error "/etc/os-release not found. Cannot detect distribution."
exit 1
fi
. /etc/os-release
case "$ID" in
ubuntu|debian)
PKG_MGR="apt"
PKG_UPDATE="apt update"
PKG_INSTALL="apt install -y"
JAVA_HOME="/usr/lib/jvm/java-11-openjdk-amd64"
JAVA_PKG="openjdk-11-jdk"
;;
almalinux|rocky|centos|rhel|ol|fedora)
PKG_MGR="dnf"
PKG_UPDATE="dnf update -y"
PKG_INSTALL="dnf install -y"
JAVA_HOME="/usr/lib/jvm/java-11-openjdk"
JAVA_PKG="java-11-openjdk-devel"
;;
amzn)
PKG_MGR="yum"
PKG_UPDATE="yum update -y"
PKG_INSTALL="yum install -y"
JAVA_HOME="/usr/lib/jvm/java-11-openjdk"
JAVA_PKG="java-11-openjdk-devel"
;;
*)
print_error "Unsupported distribution: $ID"
exit 1
;;
esac
print_success "Detected distribution: $ID"
}
# Update system and install dependencies
install_dependencies() {
print_status "[1/8] Updating system and installing dependencies..."
sudo $PKG_UPDATE
sudo $PKG_INSTALL $JAVA_PKG wget curl unzip tar
# Verify Java installation
if ! java -version >/dev/null 2>&1; then
print_error "Java installation failed"
exit 1
fi
print_success "Dependencies installed successfully"
}
# Install Apache Spark
install_spark() {
print_status "[2/8] Installing Apache Spark ${SPARK_VERSION}..."
cd /tmp
wget -q "https://archive.apache.org/dist/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop3.tgz"
sudo mkdir -p /opt
sudo tar -xzf "spark-${SPARK_VERSION}-bin-hadoop3.tgz" -C /opt
sudo mv "/opt/spark-${SPARK_VERSION}-bin-hadoop3" /opt/spark
sudo chown -R $USER:$USER /opt/spark
sudo chmod -R 755 /opt/spark
rm -f "spark-${SPARK_VERSION}-bin-hadoop3.tgz"
print_success "Apache Spark installed successfully"
}
# Configure environment variables
configure_environment() {
print_status "[3/8] Configuring environment variables..."
# Remove existing entries to avoid duplicates
grep -v "JAVA_HOME\|SPARK_HOME\|export.*spark" ~/.bashrc > ~/.bashrc.tmp || true
mv ~/.bashrc.tmp ~/.bashrc
# Add new environment variables
{
echo "export JAVA_HOME=${JAVA_HOME}"
echo "export SPARK_HOME=/opt/spark"
echo "export PATH=\$SPARK_HOME/bin:\$SPARK_HOME/sbin:\$PATH"
} >> ~/.bashrc
export JAVA_HOME="${JAVA_HOME}"
export SPARK_HOME="/opt/spark"
export PATH="/opt/spark/bin:/opt/spark/sbin:$PATH"
print_success "Environment variables configured"
}
# Install MinIO server
install_minio() {
print_status "[4/8] Installing MinIO server..."
cd /tmp
wget -q https://dl.min.io/server/minio/release/linux-amd64/minio
sudo chmod 755 minio
sudo mv minio /usr/local/bin/
# Create MinIO user and directories
sudo useradd -r minio-user -s /sbin/nologin 2>/dev/null || true
sudo mkdir -p /opt/minio/data /etc/minio
sudo chown minio-user:minio-user /opt/minio/data /etc/minio
sudo chmod 755 /opt/minio/data /etc/minio
print_success "MinIO server installed successfully"
}
# Configure MinIO service
configure_minio() {
print_status "[5/8] Configuring MinIO service..."
# Create MinIO configuration
sudo tee /etc/minio/minio.conf > /dev/null <<EOF
MINIO_ROOT_USER=${MINIO_USER}
MINIO_ROOT_PASSWORD=${MINIO_PASSWORD}
MINIO_VOLUMES=/opt/minio/data
MINIO_OPTS=--console-address :9001
EOF
sudo chmod 640 /etc/minio/minio.conf
sudo chown minio-user:minio-user /etc/minio/minio.conf
# Create systemd service
sudo tee /etc/systemd/system/minio.service > /dev/null <<EOF
[Unit]
Description=MinIO Object Storage
Documentation=https://docs.min.io
Wants=network-online.target
After=network-online.target
[Service]
Type=notify
User=minio-user
Group=minio-user
EnvironmentFile=/etc/minio/minio.conf
ExecStart=/usr/local/bin/minio server \$MINIO_OPTS \$MINIO_VOLUMES
Restart=always
LimitNOFILE=65536
TasksMax=infinity
TimeoutStopSec=infinity
SendSIGKILL=no
[Install]
WantedBy=multi-user.target
EOF
sudo chmod 644 /etc/systemd/system/minio.service
print_success "MinIO service configured"
}
# Start MinIO service
start_minio() {
print_status "[6/8] Starting MinIO service..."
sudo systemctl daemon-reload
sudo systemctl enable minio
sudo systemctl start minio
# Wait for service to start
sleep 5
if ! sudo systemctl is-active --quiet minio; then
print_error "MinIO service failed to start"
sudo systemctl status minio
exit 1
fi
print_success "MinIO service started successfully"
}
# Download S3 integration JAR files
download_jars() {
print_status "[7/8] Downloading S3 integration JAR files..."
cd /opt/spark/jars
sudo wget -q "https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/${HADOOP_VERSION}/hadoop-aws-${HADOOP_VERSION}.jar"
sudo wget -q "https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/${AWS_SDK_VERSION}/aws-java-sdk-bundle-${AWS_SDK_VERSION}.jar"
sudo chown $USER:$USER hadoop-aws-*.jar aws-java-sdk-bundle-*.jar
sudo chmod 644 hadoop-aws-*.jar aws-java-sdk-bundle-*.jar
print_success "JAR files downloaded successfully"
}
# Configure Spark for MinIO integration
configure_spark_minio() {
print_status "[8/8] Configuring Spark for MinIO integration..."
sudo tee /opt/spark/conf/spark-defaults.conf > /dev/null <<EOF
spark.hadoop.fs.s3a.endpoint=http://localhost:9000
spark.hadoop.fs.s3a.access.key=${MINIO_USER}
spark.hadoop.fs.s3a.secret.key=${MINIO_PASSWORD}
spark.hadoop.fs.s3a.path.style.access=true
spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.connection.ssl.enabled=false
spark.hadoop.fs.s3a.attempts.maximum=3
spark.hadoop.fs.s3a.connection.establish.timeout=5000
spark.hadoop.fs.s3a.connection.timeout=200000
spark.sql.adaptive.enabled=true
spark.sql.adaptive.coalescePartitions.enabled=true
spark.serializer=org.apache.spark.serializer.KryoSerializer
EOF
sudo chown $USER:$USER /opt/spark/conf/spark-defaults.conf
sudo chmod 644 /opt/spark/conf/spark-defaults.conf
print_success "Spark configuration completed"
}
# Verify installation
verify_installation() {
print_status "Verifying installation..."
# Check MinIO service
if sudo systemctl is-active --quiet minio; then
print_success "✓ MinIO service is running"
else
print_error "✗ MinIO service is not running"
return 1
fi
# Check Spark installation
if /opt/spark/bin/spark-submit --version >/dev/null 2>&1; then
print_success "✓ Spark is installed and accessible"
else
print_error "✗ Spark installation verification failed"
return 1
fi
# Check JAR files
if [[ -f "/opt/spark/jars/hadoop-aws-${HADOOP_VERSION}.jar" ]]; then
print_success "✓ Hadoop AWS JAR file is present"
else
print_error "✗ Hadoop AWS JAR file is missing"
return 1
fi
print_success "Installation verification completed successfully!"
echo
echo -e "${GREEN}MinIO Console:${NC} http://localhost:9001"
echo -e "${GREEN}MinIO API:${NC} http://localhost:9000"
echo -e "${GREEN}Credentials:${NC} ${MINIO_USER} / ${MINIO_PASSWORD}"
echo
echo -e "${YELLOW}To use Spark with MinIO, source your bashrc:${NC}"
echo "source ~/.bashrc"
}
# Main execution
main() {
check_privileges
detect_distro
install_dependencies
install_spark
configure_environment
install_minio
configure_minio
start_minio
download_jars
configure_spark_minio
verify_installation
trap - ERR
print_success "Apache Spark with MinIO installation completed successfully!"
}
main "$@"
Review the script before running. Execute with: bash install.sh