Deploy multiple Ollama instances with HAProxy load balancing for high availability AI model serving

Intermediate 45 min Apr 11, 2026 463 views
Ubuntu 24.04 Debian 12 AlmaLinux 9 Rocky Linux 9

Set up multiple Ollama instances behind HAProxy load balancer for high-availability AI model serving with health checks, SSL termination, and automatic failover to eliminate single points of failure.

Prerequisites

  • Multiple servers (minimum 3: 1 for HAProxy, 2 for Ollama instances)
  • Root or sudo access
  • Basic understanding of HTTP load balancing
  • At least 8GB RAM per Ollama instance

What this solves

Running a single Ollama instance creates a single point of failure for AI model serving. When that instance goes down, your entire AI service becomes unavailable. This tutorial shows you how to deploy multiple Ollama instances behind HAProxy load balancer for high availability, automatic failover, and improved performance through load distribution.

Step-by-step installation

Update system packages

Start by updating your package manager to ensure you have the latest versions available.

sudo apt update && sudo apt upgrade -y
sudo dnf update -y

Install HAProxy load balancer

HAProxy will distribute requests across multiple Ollama instances and handle health checks for automatic failover.

sudo apt install -y haproxy
sudo dnf install -y haproxy

Install Ollama on multiple instances

Download and install Ollama on each server that will serve AI models. Run this on each target server.

curl -fsSL https://ollama.ai/install.sh | sh

Configure Ollama service binding

Configure each Ollama instance to bind to all interfaces instead of localhost only. This allows HAProxy to connect from other servers.

sudo mkdir -p /etc/systemd/system/ollama.service.d
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"

Start Ollama services

Enable and start Ollama on each server instance. Check the status to confirm successful startup.

sudo systemctl daemon-reload
sudo systemctl enable --now ollama
sudo systemctl status ollama

Download AI models on each instance

Pull the same AI model on each Ollama instance to ensure consistent responses across all backends.

ollama pull llama3.2:3b
ollama pull codellama:7b

Configure HAProxy load balancer

Create HAProxy configuration with multiple Ollama backend servers, health checks, and load balancing algorithms.

global
    log         127.0.0.1:514 local0
    chroot      /var/lib/haproxy
    stats socket /run/haproxy/admin.sock mode 660 level admin
    stats timeout 30s
    user        haproxy
    group       haproxy
    daemon

defaults
    mode        http
    log         global
    option      httplog
    option      dontlognull
    option      http-server-close
    option      forwardfor
    option      redispatch
    retries     3
    timeout http-request    10s
    timeout queue           1m
    timeout connect         10s
    timeout client          1m
    timeout server          1m
    timeout http-keep-alive 10s
    timeout check           10s
    maxconn     3000

frontend ollama_frontend
    bind *:8080
    bind *:8443 ssl crt /etc/ssl/certs/ollama.pem
    redirect scheme https if !{ ssl_fc }
    default_backend ollama_backend

backend ollama_backend
    balance roundrobin
    option httpchk GET /api/tags
    http-check expect status 200
    server ollama1 203.0.113.10:11434 check inter 30s fall 3 rise 2
    server ollama2 203.0.113.11:11434 check inter 30s fall 3 rise 2
    server ollama3 203.0.113.12:11434 check inter 30s fall 3 rise 2

listen stats
    bind *:8404
    stats enable
    stats uri /stats
    stats refresh 30s
    stats admin if TRUE

Generate SSL certificates

Create SSL certificates for secure HTTPS connections to the load balancer. Replace example.com with your actual domain.

sudo openssl req -x509 -nodes -days 365 -newkey rsa:2048 \
    -keyout /etc/ssl/private/ollama.key \
    -out /etc/ssl/certs/ollama.crt \
    -subj "/C=US/ST=State/L=City/O=Organization/CN=example.com"
sudo cat /etc/ssl/certs/ollama.crt /etc/ssl/private/ollama.key | sudo tee /etc/ssl/certs/ollama.pem

Configure firewall rules

Open necessary ports for HAProxy frontend and Ollama backend communication.

sudo ufw allow 8080/tcp
sudo ufw allow 8443/tcp
sudo ufw allow 8404/tcp
sudo ufw allow 11434/tcp
sudo firewall-cmd --permanent --add-port=8080/tcp
sudo firewall-cmd --permanent --add-port=8443/tcp
sudo firewall-cmd --permanent --add-port=8404/tcp
sudo firewall-cmd --permanent --add-port=11434/tcp
sudo firewall-cmd --reload

Start HAProxy service

Enable and start HAProxy to begin load balancing across Ollama instances.

sudo systemctl enable --now haproxy
sudo systemctl status haproxy

Configure health check monitoring

Set up systemd service for monitoring Ollama instance health and automatic restart on failures.

[Unit]
Description=Ollama Health Check Service
After=network.target

[Service]
Type=oneshot
ExecStart=/usr/local/bin/ollama-health-check.sh
User=ollama
Group=ollama
[Unit]
Description=Run Ollama Health Check Every 60 Seconds
Requires=ollama-health-check.service

[Timer]
OnCalendar=::0/60
Persistent=true

[Install]
WantedBy=timers.target

Create health check script

Create monitoring script to check Ollama service health and restart if needed.

#!/bin/bash

OLLAMA_URL="http://localhost:11434/api/tags"
HEALTH_CHECK_TIMEOUT=10
MAX_RESTART_ATTEMPTS=3
RESTART_COUNT_FILE="/tmp/ollama_restart_count"

Function to check Ollama health

check_ollama_health() { curl -s --max-time $HEALTH_CHECK_TIMEOUT "$OLLAMA_URL" > /dev/null 2>&1 return $? }

Function to get restart count

get_restart_count() { if [[ -f "$RESTART_COUNT_FILE" ]]; then cat "$RESTART_COUNT_FILE" else echo 0 fi }

Function to increment restart count

increment_restart_count() { local count=$(get_restart_count) echo $((count + 1)) > "$RESTART_COUNT_FILE" }

Function to reset restart count

reset_restart_count() { echo 0 > "$RESTART_COUNT_FILE" }

Main health check logic

if check_ollama_health; then echo "$(date): Ollama is healthy" reset_restart_count exit 0 else echo "$(date): Ollama health check failed" restart_count=$(get_restart_count) if [[ $restart_count -lt $MAX_RESTART_ATTEMPTS ]]; then echo "$(date): Attempting to restart Ollama (attempt $((restart_count + 1))/$MAX_RESTART_ATTEMPTS)" systemctl restart ollama increment_restart_count # Wait a moment and check again sleep 30 if check_ollama_health; then echo "$(date): Ollama restart successful" reset_restart_count exit 0 else echo "$(date): Ollama restart failed" exit 1 fi else echo "$(date): Maximum restart attempts reached. Manual intervention required." exit 1 fi fi
sudo chmod 755 /usr/local/bin/ollama-health-check.sh
sudo systemctl enable --now ollama-health-check.timer

Configure SSL termination and security

Enable SSL security headers

Add security headers to HAProxy configuration for enhanced HTTPS security.

sudo cp /etc/haproxy/haproxy.cfg /etc/haproxy/haproxy.cfg.backup
# Add these lines to the frontend ollama_frontend section
frontend ollama_frontend
    bind *:8080
    bind *:8443 ssl crt /etc/ssl/certs/ollama.pem
    redirect scheme https if !{ ssl_fc }
    
    # Security headers
    http-response set-header Strict-Transport-Security "max-age=31536000; includeSubDomains; preload"
    http-response set-header X-Frame-Options "DENY"
    http-response set-header X-Content-Type-Options "nosniff"
    http-response set-header X-XSS-Protection "1; mode=block"
    http-response set-header Referrer-Policy "strict-origin-when-cross-origin"
    
    # Rate limiting
    stick-table type ip size 100k expire 30s store http_req_rate(10s)
    http-request track-sc0 src
    http-request reject if { sc_http_req_rate(0) gt 20 }
    
    default_backend ollama_backend

Configure backend connection limits

Set connection limits and timeouts for better resource management and security.

# Update backend ollama_backend section
backend ollama_backend
    balance roundrobin
    option httpchk GET /api/tags
    http-check expect status 200
    
    # Connection limits
    timeout server 60s
    timeout connect 5s
    
    server ollama1 203.0.113.10:11434 check inter 30s fall 3 rise 2 maxconn 50
    server ollama2 203.0.113.11:11434 check inter 30s fall 3 rise 2 maxconn 50
    server ollama3 203.0.113.12:11434 check inter 30s fall 3 rise 2 maxconn 50

Restart HAProxy with new configuration

Apply the updated configuration and verify HAProxy starts successfully.

sudo haproxy -f /etc/haproxy/haproxy.cfg -c
sudo systemctl restart haproxy
sudo systemctl status haproxy

Monitor performance and implement scaling

Set up HAProxy statistics monitoring

Configure detailed statistics collection for monitoring load balancer performance and backend health.

# Update the stats section with authentication
listen stats
    bind *:8404
    stats enable
    stats uri /stats
    stats refresh 10s
    stats admin if TRUE
    stats auth admin:secure_password_here
    stats show-legends
    stats show-desc "Ollama Load Balancer Statistics"

Create performance monitoring script

Set up automated monitoring script to track response times and queue depths.

#!/bin/bash

LOG_FILE="/var/log/ollama-performance.log"
METRICS_FILE="/var/log/ollama-metrics.json"
ALERT_THRESHOLD_MS=5000
EMAIL_ALERT="admin@example.com"

Function to test response time

test_response_time() { local url="$1" local start_time=$(date +%s%N) curl -s -o /dev/null -w "%{http_code}" --max-time 30 "$url" > /dev/null 2>&1 local exit_code=$? local end_time=$(date +%s%N) local response_time_ms=$(( (end_time - start_time) / 1000000 )) echo "$response_time_ms,$exit_code" }

Test each backend directly

backends=("203.0.113.10:11434" "203.0.113.11:11434" "203.0.113.12:11434") loadbalancer="localhost:8080"

Test load balancer

lb_result=$(test_response_time "http://$loadbalancer/api/tags") lb_time=$(echo "$lb_result" | cut -d',' -f1) lb_status=$(echo "$lb_result" | cut -d',' -f2) echo "$(date): LoadBalancer response_time=${lb_time}ms status=$lb_status" >> "$LOG_FILE"

Test individual backends

for i in "${!backends[@]}"; do backend="${backends[$i]}" result=$(test_response_time "http://$backend/api/tags") time=$(echo "$result" | cut -d',' -f1) status=$(echo "$result" | cut -d',' -f2) echo "$(date): Backend$((i+1)) ($backend) response_time=${time}ms status=$status" >> "$LOG_FILE" # Alert on slow responses if [[ $time -gt $ALERT_THRESHOLD_MS ]]; then echo "ALERT: Backend$((i+1)) slow response: ${time}ms" | mail -s "Ollama Performance Alert" "$EMAIL_ALERT" fi done

Create JSON metrics for external monitoring

cat > "$METRICS_FILE" << EOF { "timestamp": "$(date -Iseconds)", "loadbalancer": { "response_time_ms": $lb_time, "status_code": $lb_status }, "backends": [ EOF for i in "${!backends[@]}"; do backend="${backends[$i]}" result=$(test_response_time "http://$backend/api/tags") time=$(echo "$result" | cut -d',' -f1) status=$(echo "$result" | cut -d',' -f2) echo " {" >> "$METRICS_FILE" echo " \"name\": \"backend$((i+1))\"," >> "$METRICS_FILE" echo " \"address\": \"$backend\"," >> "$METRICS_FILE" echo " \"response_time_ms\": $time," >> "$METRICS_FILE" echo " \"status_code\": $status" >> "$METRICS_FILE" if [[ $i -eq $((${#backends[@]} - 1)) ]]; then echo " }" >> "$METRICS_FILE" else echo " }," >> "$METRICS_FILE" fi done echo " ]" >> "$METRICS_FILE" echo "}" >> "$METRICS_FILE"
sudo chmod 755 /usr/local/bin/ollama-performance-monitor.sh

Configure log rotation

Set up log rotation to prevent performance and metrics logs from consuming too much disk space.

/var/log/ollama-performance.log {
    daily
    rotate 30
    compress
    delaycompress
    missingok
    notifempty
    copytruncate
}

/var/log/ollama-metrics.json {
    daily
    rotate 7
    compress
    delaycompress
    missingok
    notifempty
    copytruncate
}
Security Note: Never use chmod 777 on configuration files. HAProxy requires strict permissions on SSL certificates (644 for .pem files) and configuration files (644 for haproxy.cfg). Use chown haproxy:haproxy for service-owned files.

Verify your setup

Test the load balancer functionality and verify all components are working correctly.

# Check HAProxy status
sudo systemctl status haproxy

Test load balancer endpoint

curl -k https://localhost:8443/api/tags

Check backend health via HAProxy stats

curl -u admin:secure_password_here http://localhost:8404/stats

Test model inference through load balancer

curl -k -X POST https://localhost:8443/api/generate -d '{ "model": "llama3.2:3b", "prompt": "What is load balancing?", "stream": false }'

Check individual Ollama instances

for port in 11434; do echo "Testing direct connection to Ollama on port $port:" curl -s http://localhost:$port/api/tags | jq '.models[].name' done

Verify SSL certificate

openssl s_client -connect localhost:8443 -servername example.com < /dev/null

Common issues

SymptomCauseFix
502 Bad Gateway errorsOllama instances not respondingCheck sudo systemctl status ollama on backend servers
SSL certificate errorsInvalid or expired certificatesRegenerate certificates with correct domain: openssl req -x509 -nodes -days 365...
Health checks failingOllama not binding to 0.0.0.0Verify OLLAMA_HOST=0.0.0.0:11434 in systemd override
Connection refused on 8080/8443HAProxy not started or firewall blockingsudo systemctl restart haproxy && sudo ufw allow 8080,8443/tcp
Models not found errorsDifferent models on different instancesRun ollama pull model_name on all backend servers
High response timesInsufficient resources or overloadingScale up backend instances or adjust maxconn limits

Next steps

Automated install script

Run this to automate the entire setup

Need help?

Don't want to manage this yourself?

We handle managed devops services for businesses that depend on uptime. From initial setup to ongoing operations.