Deploy Ollama HAProxy Load Balancing

Set up multiple Ollama instances behind HAProxy load balancer for high-availability AI model serving with health checks, SSL termination, and automatic failover to eliminate single points of failure.

Prerequisites

Multiple servers (minimum 3: 1 for HAProxy, 2 for Ollama instances)
Root or sudo access
Basic understanding of HTTP load balancing
At least 8GB RAM per Ollama instance

What this solves

Running a single Ollama instance creates a single point of failure for AI model serving. When that instance goes down, your entire AI service becomes unavailable. This tutorial shows you how to deploy multiple Ollama instances behind HAProxy load balancer for high availability, automatic failover, and improved performance through load distribution.

Step-by-step installation

Update system packages

Start by updating your package manager to ensure you have the latest versions available.

sudo apt update && sudo apt upgrade -y

sudo dnf update -y

Install HAProxy load balancer

HAProxy will distribute requests across multiple Ollama instances and handle health checks for automatic failover.

sudo apt install -y haproxy

sudo dnf install -y haproxy

Install Ollama on multiple instances

Download and install Ollama on each server that will serve AI models. Run this on each target server.

curl -fsSL https://ollama.ai/install.sh | sh

Configure Ollama service binding

Configure each Ollama instance to bind to all interfaces instead of localhost only. This allows HAProxy to connect from other servers.

sudo mkdir -p /etc/systemd/system/ollama.service.d

[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"

Start Ollama services

Enable and start Ollama on each server instance. Check the status to confirm successful startup.

sudo systemctl daemon-reload
sudo systemctl enable --now ollama
sudo systemctl status ollama

Download AI models on each instance

Pull the same AI model on each Ollama instance to ensure consistent responses across all backends.

ollama pull llama3.2:3b
ollama pull codellama:7b

Configure HAProxy load balancer

Create HAProxy configuration with multiple Ollama backend servers, health checks, and load balancing algorithms.

global
    log         127.0.0.1:514 local0
    chroot      /var/lib/haproxy
    stats socket /run/haproxy/admin.sock mode 660 level admin
    stats timeout 30s
    user        haproxy
    group       haproxy
    daemon

defaults
    mode        http
    log         global
    option      httplog
    option      dontlognull
    option      http-server-close
    option      forwardfor
    option      redispatch
    retries     3
    timeout http-request    10s
    timeout queue           1m
    timeout connect         10s
    timeout client          1m
    timeout server          1m
    timeout http-keep-alive 10s
    timeout check           10s
    maxconn     3000

frontend ollama_frontend
    bind *:8080
    bind *:8443 ssl crt /etc/ssl/certs/ollama.pem
    redirect scheme https if !{ ssl_fc }
    default_backend ollama_backend

backend ollama_backend
    balance roundrobin
    option httpchk GET /api/tags
    http-check expect status 200
    server ollama1 203.0.113.10:11434 check inter 30s fall 3 rise 2
    server ollama2 203.0.113.11:11434 check inter 30s fall 3 rise 2
    server ollama3 203.0.113.12:11434 check inter 30s fall 3 rise 2

listen stats
    bind *:8404
    stats enable
    stats uri /stats
    stats refresh 30s
    stats admin if TRUE

Generate SSL certificates

Create SSL certificates for secure HTTPS connections to the load balancer. Replace example.com with your actual domain.

sudo openssl req -x509 -nodes -days 365 -newkey rsa:2048 \
    -keyout /etc/ssl/private/ollama.key \
    -out /etc/ssl/certs/ollama.crt \
    -subj "/C=US/ST=State/L=City/O=Organization/CN=example.com"

sudo cat /etc/ssl/certs/ollama.crt /etc/ssl/private/ollama.key | sudo tee /etc/ssl/certs/ollama.pem

Configure firewall rules

Open necessary ports for HAProxy frontend and Ollama backend communication.

sudo ufw allow 8080/tcp
sudo ufw allow 8443/tcp
sudo ufw allow 8404/tcp
sudo ufw allow 11434/tcp

sudo firewall-cmd --permanent --add-port=8080/tcp
sudo firewall-cmd --permanent --add-port=8443/tcp
sudo firewall-cmd --permanent --add-port=8404/tcp
sudo firewall-cmd --permanent --add-port=11434/tcp
sudo firewall-cmd --reload

Start HAProxy service

Enable and start HAProxy to begin load balancing across Ollama instances.

sudo systemctl enable --now haproxy
sudo systemctl status haproxy

Configure health check monitoring

Set up systemd service for monitoring Ollama instance health and automatic restart on failures.

[Unit]
Description=Ollama Health Check Service
After=network.target

[Service]
Type=oneshot
ExecStart=/usr/local/bin/ollama-health-check.sh
User=ollama
Group=ollama

[Unit]
Description=Run Ollama Health Check Every 60 Seconds
Requires=ollama-health-check.service

[Timer]
OnCalendar=::0/60
Persistent=true

[Install]
WantedBy=timers.target

Create health check script

Create monitoring script to check Ollama service health and restart if needed.

#!/bin/bash

OLLAMA_URL="http://localhost:11434/api/tags"
HEALTH_CHECK_TIMEOUT=10
MAX_RESTART_ATTEMPTS=3
RESTART_COUNT_FILE="/tmp/ollama_restart_count"

Function to check Ollama health
check_ollama_health() {
    curl -s --max-time $HEALTH_CHECK_TIMEOUT "$OLLAMA_URL" > /dev/null 2>&1
    return $?
}

Function to get restart count
get_restart_count() {
    if [[ -f "$RESTART_COUNT_FILE" ]]; then
        cat "$RESTART_COUNT_FILE"
    else
        echo 0
    fi
}

Function to increment restart count
increment_restart_count() {
    local count=$(get_restart_count)
    echo $((count + 1)) > "$RESTART_COUNT_FILE"
}

Function to reset restart count
reset_restart_count() {
    echo 0 > "$RESTART_COUNT_FILE"
}

Main health check logic
if check_ollama_health; then
    echo "$(date): Ollama is healthy"
    reset_restart_count
    exit 0
else
    echo "$(date): Ollama health check failed"
    restart_count=$(get_restart_count)
    
    if [[ $restart_count -lt $MAX_RESTART_ATTEMPTS ]]; then
        echo "$(date): Attempting to restart Ollama (attempt $((restart_count + 1))/$MAX_RESTART_ATTEMPTS)"
        systemctl restart ollama
        increment_restart_count
        
        # Wait a moment and check again
        sleep 30
        if check_ollama_health; then
            echo "$(date): Ollama restart successful"
            reset_restart_count
            exit 0
        else
            echo "$(date): Ollama restart failed"
            exit 1
        fi
    else
        echo "$(date): Maximum restart attempts reached. Manual intervention required."
        exit 1
    fi
fi

sudo chmod 755 /usr/local/bin/ollama-health-check.sh
sudo systemctl enable --now ollama-health-check.timer

Configure SSL termination and security

Enable SSL security headers

Add security headers to HAProxy configuration for enhanced HTTPS security.

sudo cp /etc/haproxy/haproxy.cfg /etc/haproxy/haproxy.cfg.backup

# Add these lines to the frontend ollama_frontend section
frontend ollama_frontend
    bind *:8080
    bind *:8443 ssl crt /etc/ssl/certs/ollama.pem
    redirect scheme https if !{ ssl_fc }
    
    # Security headers
    http-response set-header Strict-Transport-Security "max-age=31536000; includeSubDomains; preload"
    http-response set-header X-Frame-Options "DENY"
    http-response set-header X-Content-Type-Options "nosniff"
    http-response set-header X-XSS-Protection "1; mode=block"
    http-response set-header Referrer-Policy "strict-origin-when-cross-origin"
    
    # Rate limiting
    stick-table type ip size 100k expire 30s store http_req_rate(10s)
    http-request track-sc0 src
    http-request reject if { sc_http_req_rate(0) gt 20 }
    
    default_backend ollama_backend

Configure backend connection limits

Set connection limits and timeouts for better resource management and security.

# Update backend ollama_backend section
backend ollama_backend
    balance roundrobin
    option httpchk GET /api/tags
    http-check expect status 200
    
    # Connection limits
    timeout server 60s
    timeout connect 5s
    
    server ollama1 203.0.113.10:11434 check inter 30s fall 3 rise 2 maxconn 50
    server ollama2 203.0.113.11:11434 check inter 30s fall 3 rise 2 maxconn 50
    server ollama3 203.0.113.12:11434 check inter 30s fall 3 rise 2 maxconn 50

Restart HAProxy with new configuration

Apply the updated configuration and verify HAProxy starts successfully.

sudo haproxy -f /etc/haproxy/haproxy.cfg -c
sudo systemctl restart haproxy
sudo systemctl status haproxy

Monitor performance and implement scaling

Set up HAProxy statistics monitoring

Configure detailed statistics collection for monitoring load balancer performance and backend health.

# Update the stats section with authentication
listen stats
    bind *:8404
    stats enable
    stats uri /stats
    stats refresh 10s
    stats admin if TRUE
    stats auth admin:secure_password_here
    stats show-legends
    stats show-desc "Ollama Load Balancer Statistics"

Create performance monitoring script

Set up automated monitoring script to track response times and queue depths.

#!/bin/bash

LOG_FILE="/var/log/ollama-performance.log"
METRICS_FILE="/var/log/ollama-metrics.json"
ALERT_THRESHOLD_MS=5000
EMAIL_ALERT="admin@example.com"

Function to test response time
test_response_time() {
    local url="$1"
    local start_time=$(date +%s%N)
    
    curl -s -o /dev/null -w "%{http_code}" --max-time 30 "$url" > /dev/null 2>&1
    local exit_code=$?
    
    local end_time=$(date +%s%N)
    local response_time_ms=$(( (end_time - start_time) / 1000000 ))
    
    echo "$response_time_ms,$exit_code"
}

Test each backend directly
backends=("203.0.113.10:11434" "203.0.113.11:11434" "203.0.113.12:11434")
loadbalancer="localhost:8080"

Test load balancer
lb_result=$(test_response_time "http://$loadbalancer/api/tags")
lb_time=$(echo "$lb_result" | cut -d',' -f1)
lb_status=$(echo "$lb_result" | cut -d',' -f2)

echo "$(date): LoadBalancer response_time=${lb_time}ms status=$lb_status" >> "$LOG_FILE"

Test individual backends
for i in "${!backends[@]}"; do
    backend="${backends[$i]}"
    result=$(test_response_time "http://$backend/api/tags")
    time=$(echo "$result" | cut -d',' -f1)
    status=$(echo "$result" | cut -d',' -f2)
    
    echo "$(date): Backend$((i+1)) ($backend) response_time=${time}ms status=$status" >> "$LOG_FILE"
    
    # Alert on slow responses
    if [[ $time -gt $ALERT_THRESHOLD_MS ]]; then
        echo "ALERT: Backend$((i+1)) slow response: ${time}ms" | mail -s "Ollama Performance Alert" "$EMAIL_ALERT"
    fi
done

Create JSON metrics for external monitoring
cat > "$METRICS_FILE" << EOF
{
    "timestamp": "$(date -Iseconds)",
    "loadbalancer": {
        "response_time_ms": $lb_time,
        "status_code": $lb_status
    },
    "backends": [
EOF

for i in "${!backends[@]}"; do
    backend="${backends[$i]}"
    result=$(test_response_time "http://$backend/api/tags")
    time=$(echo "$result" | cut -d',' -f1)
    status=$(echo "$result" | cut -d',' -f2)
    
    echo "        {" >> "$METRICS_FILE"
    echo "            \"name\": \"backend$((i+1))\"," >> "$METRICS_FILE"
    echo "            \"address\": \"$backend\"," >> "$METRICS_FILE"
    echo "            \"response_time_ms\": $time," >> "$METRICS_FILE"
    echo "            \"status_code\": $status" >> "$METRICS_FILE"
    
    if [[ $i -eq $((${#backends[@]} - 1)) ]]; then
        echo "        }" >> "$METRICS_FILE"
    else
        echo "        }," >> "$METRICS_FILE"
    fi
done

echo "    ]" >> "$METRICS_FILE"
echo "}" >> "$METRICS_FILE"

sudo chmod 755 /usr/local/bin/ollama-performance-monitor.sh

Configure log rotation

Set up log rotation to prevent performance and metrics logs from consuming too much disk space.

/var/log/ollama-performance.log {
    daily
    rotate 30
    compress
    delaycompress
    missingok
    notifempty
    copytruncate
}

/var/log/ollama-metrics.json {
    daily
    rotate 7
    compress
    delaycompress
    missingok
    notifempty
    copytruncate
}

Security Note: Never use chmod 777 on configuration files. HAProxy requires strict permissions on SSL certificates (644 for .pem files) and configuration files (644 for haproxy.cfg). Use chown haproxy:haproxy for service-owned files.

Verify your setup

Test the load balancer functionality and verify all components are working correctly.

# Check HAProxy status
sudo systemctl status haproxy

Test load balancer endpoint
curl -k https://localhost:8443/api/tags

Check backend health via HAProxy stats
curl -u admin:secure_password_here http://localhost:8404/stats

Test model inference through load balancer
curl -k -X POST https://localhost:8443/api/generate -d '{
  "model": "llama3.2:3b",
  "prompt": "What is load balancing?",
  "stream": false
}'

Check individual Ollama instances
for port in 11434; do
  echo "Testing direct connection to Ollama on port $port:"
  curl -s http://localhost:$port/api/tags | jq '.models[].name'
done

Verify SSL certificate
openssl s_client -connect localhost:8443 -servername example.com < /dev/null

Common issues

Symptom	Cause	Fix
502 Bad Gateway errors	Ollama instances not responding	Check `sudo systemctl status ollama` on backend servers
SSL certificate errors	Invalid or expired certificates	Regenerate certificates with correct domain: `openssl req -x509 -nodes -days 365...`
Health checks failing	Ollama not binding to 0.0.0.0	Verify `OLLAMA_HOST=0.0.0.0:11434` in systemd override
Connection refused on 8080/8443	HAProxy not started or firewall blocking	`sudo systemctl restart haproxy && sudo ufw allow 8080,8443/tcp`
Models not found errors	Different models on different instances	Run `ollama pull model_name` on all backend servers
High response times	Insufficient resources or overloading	Scale up backend instances or adjust maxconn limits

Next steps

Monitor HAProxy and Consul with Prometheus and Grafana dashboards for advanced monitoring setup
Implement HAProxy rate limiting and DDoS protection with advanced security rules for production security
Set up Ollama GPU acceleration with Kubernetes cluster for high-performance AI model serving
Configure Ollama model caching with Redis cluster to improve response times
Implement Ollama auto-scaling with Kubernetes HPA for dynamic resource allocation

Automated install script

Run this to automate the entire setup

install.sh

#!/usr/bin/env bash

set -euo pipefail

# Colors for output
readonly RED='\033[0;31m'
readonly GREEN='\033[0;32m'
readonly YELLOW='\033[1;33m'
readonly NC='\033[0m'

# Usage
usage() {
    echo "Usage: $0 [options]"
    echo "Options:"
    echo "  -d DOMAIN    Domain name for SSL certificate (default: localhost)"
    echo "  -s SERVERS   Comma-separated list of Ollama server IPs (default: 127.0.0.1)"
    echo "  -h           Show this help"
    exit 1
}

# Parse arguments
DOMAIN="localhost"
SERVERS="127.0.0.1"

while getopts "d:s:h" opt; do
    case $opt in
        d) DOMAIN="$OPTARG" ;;
        s) SERVERS="$OPTARG" ;;
        h) usage ;;
        *) usage ;;
    esac
done

# Check root/sudo
if [[ $EUID -ne 0 ]]; then
    echo -e "${RED}This script must be run as root or with sudo${NC}"
    exit 1
fi

# Detect distribution
if [[ ! -f /etc/os-release ]]; then
    echo -e "${RED}/etc/os-release not found. Cannot detect distribution${NC}"
    exit 1
fi

. /etc/os-release

case "$ID" in
    ubuntu|debian)
        PKG_MGR="apt"
        PKG_INSTALL="apt install -y"
        PKG_UPDATE="apt update && apt upgrade -y"
        FIREWALL_CMD="ufw"
        ;;
    almalinux|rocky|centos|rhel|ol|fedora)
        PKG_MGR="dnf"
        PKG_INSTALL="dnf install -y"
        PKG_UPDATE="dnf update -y"
        FIREWALL_CMD="firewall-cmd"
        ;;
    amzn)
        PKG_MGR="yum"
        PKG_INSTALL="yum install -y"
        PKG_UPDATE="yum update -y"
        FIREWALL_CMD="firewall-cmd"
        ;;
    *)
        echo -e "${RED}Unsupported distribution: $ID${NC}"
        exit 1
        ;;
esac

# Cleanup function
cleanup() {
    echo -e "${YELLOW}Cleaning up on error...${NC}"
    systemctl stop haproxy 2>/dev/null || true
    systemctl stop ollama 2>/dev/null || true
}

trap cleanup ERR

echo -e "${GREEN}[1/9] Updating system packages...${NC}"
$PKG_UPDATE

echo -e "${GREEN}[2/9] Installing HAProxy...${NC}"
$PKG_INSTALL haproxy curl openssl

echo -e "${GREEN}[3/9] Installing Ollama...${NC}"
curl -fsSL https://ollama.ai/install.sh | sh

echo -e "${GREEN}[4/9] Configuring Ollama service...${NC}"
mkdir -p /etc/systemd/system/ollama.service.d
cat > /etc/systemd/system/ollama.service.d/override.conf << 'EOF'
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
EOF

systemctl daemon-reload
systemctl enable ollama
systemctl start ollama

echo -e "${GREEN}[5/9] Downloading AI models...${NC}"
sleep 10
ollama pull llama3.2:3b || echo -e "${YELLOW}Warning: Failed to pull llama3.2:3b${NC}"

echo -e "${GREEN}[6/9] Generating SSL certificates...${NC}"
openssl req -x509 -nodes -days 365 -newkey rsa:2048 \
    -keyout /etc/ssl/private/ollama.key \
    -out /etc/ssl/certs/ollama.crt \
    -subj "/C=US/ST=State/L=City/O=Organization/CN=$DOMAIN"

chmod 600 /etc/ssl/private/ollama.key
chmod 644 /etc/ssl/certs/ollama.crt
cat /etc/ssl/certs/ollama.crt /etc/ssl/private/ollama.key > /etc/ssl/certs/ollama.pem
chmod 600 /etc/ssl/certs/ollama.pem
chown haproxy:haproxy /etc/ssl/certs/ollama.pem

echo -e "${GREEN}[7/9] Configuring HAProxy...${NC}"
cp /etc/haproxy/haproxy.cfg /etc/haproxy/haproxy.cfg.backup

# Generate backend servers configuration
BACKEND_CONFIG=""
COUNTER=1
IFS=',' read -ra SERVER_ARRAY <<< "$SERVERS"
for server in "${SERVER_ARRAY[@]}"; do
    server=$(echo "$server" | xargs)
    BACKEND_CONFIG+="\n    server ollama$COUNTER $server:11434 check inter 30s fall 3 rise 2"
    ((COUNTER++))
done

cat > /etc/haproxy/haproxy.cfg << EOF
global
    log         127.0.0.1:514 local0
    chroot      /var/lib/haproxy
    stats socket /run/haproxy/admin.sock mode 660 level admin
    stats timeout 30s
    user        haproxy
    group       haproxy
    daemon

defaults
    mode        http
    log         global
    option      httplog
    option      dontlognull
    option      http-server-close
    option      forwardfor
    option      redispatch
    retries     3
    timeout http-request    10s
    timeout queue           1m
    timeout connect         10s
    timeout client          1m
    timeout server          1m
    timeout http-keep-alive 10s
    timeout check           10s
    maxconn     3000

frontend ollama_frontend
    bind *:8080
    bind *:8443 ssl crt /etc/ssl/certs/ollama.pem
    redirect scheme https if !{ ssl_fc }
    default_backend ollama_backend

backend ollama_backend
    balance roundrobin
    option httpchk GET /api/tags
    http-check expect status 200$(echo -e "$BACKEND_CONFIG")

listen stats
    bind *:8404
    stats enable
    stats uri /stats
    stats refresh 30s
    stats admin if TRUE
EOF

chmod 644 /etc/haproxy/haproxy.cfg
chown root:root /etc/haproxy/haproxy.cfg

echo -e "${GREEN}[8/9] Configuring firewall...${NC}"
if [[ "$FIREWALL_CMD" == "ufw" ]]; then
    ufw --force enable 2>/dev/null || true
    ufw allow 8080/tcp
    ufw allow 8443/tcp
    ufw allow 8404/tcp
    ufw allow 11434/tcp
else
    systemctl enable firewalld 2>/dev/null || true
    systemctl start firewalld 2>/dev/null || true
    firewall-cmd --permanent --add-port=8080/tcp
    firewall-cmd --permanent --add-port=8443/tcp
    firewall-cmd --permanent --add-port=8404/tcp
    firewall-cmd --permanent --add-port=11434/tcp
    firewall-cmd --reload
fi

echo -e "${GREEN}[9/9] Starting services...${NC}"
systemctl enable haproxy
systemctl start haproxy

echo -e "${GREEN}Verifying installation...${NC}"
sleep 5

if systemctl is-active --quiet ollama; then
    echo -e "${GREEN}✓ Ollama is running${NC}"
else
    echo -e "${RED}✗ Ollama failed to start${NC}"
    exit 1
fi

if systemctl is-active --quiet haproxy; then
    echo -e "${GREEN}✓ HAProxy is running${NC}"
else
    echo -e "${RED}✗ HAProxy failed to start${NC}"
    exit 1
fi

if curl -s http://localhost:8080/api/tags >/dev/null; then
    echo -e "${GREEN}✓ Load balancer is responding${NC}"
else
    echo -e "${YELLOW}⚠ Load balancer not responding (may need time to start)${NC}"
fi

echo -e "${GREEN}Installation completed successfully!${NC}"
echo -e "Access points:"
echo -e "  HTTP:  http://localhost:8080"
echo -e "  HTTPS: https://localhost:8443"
echo -e "  Stats: http://localhost:8404/stats"
echo -e "  Direct Ollama: http://localhost:11434"

Review the script before running. Execute with: bash install.sh

#ollama #haproxy #load-balancing #high-availability #ai-models

Deploy multiple Ollama instances with HAProxy load balancing for high availability AI model serving

Prerequisites

What this solves

Step-by-step installation

Update system packages

Install HAProxy load balancer

Install Ollama on multiple instances

Configure Ollama service binding

Start Ollama services

Download AI models on each instance

Configure HAProxy load balancer

Generate SSL certificates

Configure firewall rules

Start HAProxy service

Configure health check monitoring

Create health check script

Function to check Ollama health

Function to get restart count

Function to increment restart count

Function to reset restart count

Main health check logic

Configure SSL termination and security

Enable SSL security headers

Configure backend connection limits

Restart HAProxy with new configuration

Monitor performance and implement scaling

Set up HAProxy statistics monitoring

Create performance monitoring script

Function to test response time

Test each backend directly

Test load balancer

Test individual backends

Create JSON metrics for external monitoring

Configure log rotation

Verify your setup

Test load balancer endpoint

Check backend health via HAProxy stats

Test model inference through load balancer

Check individual Ollama instances

Verify SSL certificate

Common issues

Next steps

Related tutorials

Set up Kafka Streams testing framework with TopologyTestDriver for automated stream processing validation

Configure Consul multi-datacenter WAN federation for geographic redundancy

Configure centralized cron management with Ansible automation and systemd timers

Don't want to manage this yourself?