Set up Jaeger high availability clustering with load balancing and failover

Advanced 45 min Apr 06, 2026 367 views
Ubuntu 24.04 Debian 12 AlmaLinux 9 Rocky Linux 9

Deploy a production-grade Jaeger distributed tracing cluster with Elasticsearch backend, load-balanced collectors, and automatic failover for enterprise-scale microservices monitoring.

Prerequisites

  • Root or sudo access
  • Minimum 8GB RAM
  • Java 11+ installed
  • Network access for package downloads
  • Multiple servers for true HA (optional)

What this solves

A single Jaeger instance creates a bottleneck and single point of failure for distributed tracing in production environments. This tutorial sets up a high-availability Jaeger cluster with multiple collectors behind a load balancer, Elasticsearch backend storage, and automatic failover to ensure continuous tracing collection even when individual components fail.

Step-by-step installation

Update system packages and install dependencies

Start by updating your package manager and installing required dependencies for the Jaeger cluster setup.

sudo apt update && sudo apt upgrade -y
sudo apt install -y wget curl openjdk-11-jre-headless systemd
sudo dnf update -y
sudo dnf install -y wget curl java-11-openjdk-headless systemd

Create Jaeger system user and directories

Create a dedicated system user for Jaeger components and set up the necessary directory structure with proper permissions.

sudo useradd --system --shell /bin/false --home-dir /opt/jaeger jaeger
sudo mkdir -p /opt/jaeger/{bin,config,logs,data}
sudo mkdir -p /etc/jaeger
sudo chown -R jaeger:jaeger /opt/jaeger
sudo chmod 755 /opt/jaeger/{bin,config}
sudo chmod 775 /opt/jaeger/{logs,data}

Download and install Jaeger binaries

Download the latest Jaeger release and extract the binaries to the appropriate directories.

cd /tmp
wget https://github.com/jaegertracing/jaeger/releases/download/v1.53.0/jaeger-1.53.0-linux-amd64.tar.gz
tar -xzf jaeger-1.53.0-linux-amd64.tar.gz
sudo cp jaeger-1.53.0-linux-amd64/jaeger-collector /opt/jaeger/bin/
sudo cp jaeger-1.53.0-linux-amd64/jaeger-query /opt/jaeger/bin/
sudo cp jaeger-1.53.0-linux-amd64/jaeger-agent /opt/jaeger/bin/
sudo chown jaeger:jaeger /opt/jaeger/bin/*
sudo chmod 755 /opt/jaeger/bin/*

Install and configure Elasticsearch cluster

Set up Elasticsearch as the backend storage for Jaeger traces. This provides persistence and enables distributed storage across multiple nodes.

wget -qO - https://artifacts.elastic.co/GPG-KEY-elasticsearch | sudo gpg --dearmor -o /usr/share/keyrings/elasticsearch-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/elasticsearch-keyring.gpg] https://artifacts.elastic.co/packages/8.x/apt stable main" | sudo tee /etc/apt/sources.list.d/elastic-8.x.list
sudo apt update
sudo apt install -y elasticsearch
sudo rpm --import https://artifacts.elastic.co/GPG-KEY-elasticsearch
sudo tee /etc/yum.repos.d/elasticsearch.repo > /dev/null <

Configure Elasticsearch for Jaeger

Configure Elasticsearch with optimized settings for Jaeger workloads, including cluster discovery and memory allocation.

cluster.name: jaeger-cluster
node.name: jaeger-es-01
path.data: /var/lib/elasticsearch
path.logs: /var/log/elasticsearch
network.host: 0.0.0.0
http.port: 9200
discovery.type: single-node
xpack.security.enabled: false
xpack.monitoring.collection.enabled: false
action.destructive_requires_name: false
indices.query.bool.max_clause_count: 10000
bootstrap.memory_lock: true
-Xms2g
-Xmx2g
sudo systemctl enable elasticsearch
sudo systemctl start elasticsearch

Install and configure HAProxy for load balancing

Set up HAProxy to load balance traffic across multiple Jaeger collectors with health checks and automatic failover.

sudo apt install -y haproxy
sudo dnf install -y haproxy

For more advanced HAProxy configuration patterns, see our tutorial on HAProxy high availability load balancing.

Configure HAProxy for Jaeger collector load balancing

Configure HAProxy with health checks and failover for the Jaeger collector cluster.

global
    daemon
    chroot /var/lib/haproxy
    stats socket /run/haproxy/admin.sock mode 660 level admin
    stats timeout 30s
    user haproxy
    group haproxy

defaults
    mode http
    timeout connect 5000ms
    timeout client 50000ms
    timeout server 50000ms
    option httplog
    option dontlognull
    option redispatch
    retries 3

frontend jaeger-collector-grpc
    bind *:14250
    mode tcp
    default_backend jaeger-collectors-grpc

frontend jaeger-collector-http
    bind *:14268
    mode http
    default_backend jaeger-collectors-http

frontend jaeger-query
    bind *:16686
    mode http
    default_backend jaeger-query-servers

backend jaeger-collectors-grpc
    mode tcp
    balance roundrobin
    option tcp-check
    tcp-check connect port 14250
    server collector1 127.0.0.1:14251 check
    server collector2 127.0.0.1:14252 check
    server collector3 127.0.0.1:14253 check

backend jaeger-collectors-http
    mode http
    balance roundrobin
    option httpchk GET /
    server collector1 127.0.0.1:14269 check
    server collector2 127.0.0.1:14270 check
    server collector3 127.0.0.1:14271 check

backend jaeger-query-servers
    mode http
    balance roundrobin
    option httpchk GET /
    server query1 127.0.0.1:16687 check
    server query2 127.0.0.1:16688 check

listen stats
    bind *:8404
    stats enable
    stats uri /stats
    stats refresh 30s
    stats admin if TRUE

Create Jaeger collector configuration

Configure multiple Jaeger collectors with Elasticsearch backend and different ports for high availability.

collector:
  grpc-server:
    host-port: 127.0.0.1:14251
  http-server:
    host-port: 127.0.0.1:14269
  queue:
    size: 2000
  workers: 50
  
span-storage:
  type: elasticsearch
  
elasticsearch:
  server-urls: http://localhost:9200
  index-prefix: jaeger
  create-index-templates: true
  
log-level: info
collector:
  grpc-server:
    host-port: 127.0.0.1:14252
  http-server:
    host-port: 127.0.0.1:14270
  queue:
    size: 2000
  workers: 50
  
span-storage:
  type: elasticsearch
  
elasticsearch:
  server-urls: http://localhost:9200
  index-prefix: jaeger
  create-index-templates: true
  
log-level: info
collector:
  grpc-server:
    host-port: 127.0.0.1:14253
  http-server:
    host-port: 127.0.0.1:14271
  queue:
    size: 2000
  workers: 50
  
span-storage:
  type: elasticsearch
  
elasticsearch:
  server-urls: http://localhost:9200
  index-prefix: jaeger
  create-index-templates: true
  
log-level: info

Create Jaeger query configuration

Configure multiple Jaeger query instances for the web UI with load balancing and failover support.

query:
  base-path: /
  static-files: /usr/share/jaeger/
  ui-config: /etc/jaeger/ui.json
  
http-server:
  host-port: 127.0.0.1:16687
  
span-storage:
  type: elasticsearch
  
elasticsearch:
  server-urls: http://localhost:9200
  index-prefix: jaeger
  
log-level: info
query:
  base-path: /
  static-files: /usr/share/jaeger/
  ui-config: /etc/jaeger/ui.json
  
http-server:
  host-port: 127.0.0.1:16688
  
span-storage:
  type: elasticsearch
  
elasticsearch:
  server-urls: http://localhost:9200
  index-prefix: jaeger
  
log-level: info
{
  "monitor": {
    "menuEnabled": true
  },
  "dependencies": {
    "menuEnabled": true
  },
  "archiveEnabled": true
}

Create systemd services for Jaeger collectors

Create systemd service files for each Jaeger collector instance with proper resource limits and restart policies.

[Unit]
Description=Jaeger Collector 01
After=network.target elasticsearch.service
Requires=elasticsearch.service

[Service]
Type=simple
User=jaeger
Group=jaeger
ExecStart=/opt/jaeger/bin/jaeger-collector --config-file=/etc/jaeger/collector-01.yaml
Restart=always
RestartSec=5
LimitNOFILE=65536
WorkingDirectory=/opt/jaeger
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target
[Unit]
Description=Jaeger Collector 02
After=network.target elasticsearch.service
Requires=elasticsearch.service

[Service]
Type=simple
User=jaeger
Group=jaeger
ExecStart=/opt/jaeger/bin/jaeger-collector --config-file=/etc/jaeger/collector-02.yaml
Restart=always
RestartSec=5
LimitNOFILE=65536
WorkingDirectory=/opt/jaeger
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target
[Unit]
Description=Jaeger Collector 03
After=network.target elasticsearch.service
Requires=elasticsearch.service

[Service]
Type=simple
User=jaeger
Group=jaeger
ExecStart=/opt/jaeger/bin/jaeger-collector --config-file=/etc/jaeger/collector-03.yaml
Restart=always
RestartSec=5
LimitNOFILE=65536
WorkingDirectory=/opt/jaeger
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target

Create systemd services for Jaeger query instances

Create systemd service files for the Jaeger query web UI instances.

[Unit]
Description=Jaeger Query 01
After=network.target elasticsearch.service
Requires=elasticsearch.service

[Service]
Type=simple
User=jaeger
Group=jaeger
ExecStart=/opt/jaeger/bin/jaeger-query --config-file=/etc/jaeger/query-01.yaml
Restart=always
RestartSec=5
LimitNOFILE=65536
WorkingDirectory=/opt/jaeger
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target
[Unit]
Description=Jaeger Query 02
After=network.target elasticsearch.service
Requires=elasticsearch.service

[Service]
Type=simple
User=jaeger
Group=jaeger
ExecStart=/opt/jaeger/bin/jaeger-query --config-file=/etc/jaeger/query-02.yaml
Restart=always
RestartSec=5
LimitNOFILE=65536
WorkingDirectory=/opt/jaeger
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target

Set proper file ownership and permissions

Ensure all configuration files have the correct ownership and minimal required permissions.

Never use chmod 777. It gives every user on the system full access to your files. Instead, fix ownership with chown and use minimal permissions.
sudo chown jaeger:jaeger /etc/jaeger/.yaml /etc/jaeger/.json
sudo chmod 644 /etc/jaeger/.yaml /etc/jaeger/.json
sudo chown root:root /etc/systemd/system/jaeger-*.service
sudo chmod 644 /etc/systemd/system/jaeger-*.service

Configure firewall rules

Open the necessary firewall ports for Jaeger cluster communication and external access.

sudo ufw allow 14250/tcp comment "Jaeger Collector gRPC"
sudo ufw allow 14268/tcp comment "Jaeger Collector HTTP"
sudo ufw allow 16686/tcp comment "Jaeger Query UI"
sudo ufw allow 8404/tcp comment "HAProxy Stats"
sudo ufw allow 9200/tcp comment "Elasticsearch HTTP"
sudo ufw reload
sudo firewall-cmd --permanent --add-port=14250/tcp --zone=public
sudo firewall-cmd --permanent --add-port=14268/tcp --zone=public
sudo firewall-cmd --permanent --add-port=16686/tcp --zone=public
sudo firewall-cmd --permanent --add-port=8404/tcp --zone=public
sudo firewall-cmd --permanent --add-port=9200/tcp --zone=public
sudo firewall-cmd --reload

Enable and start all services

Start the Elasticsearch backend, Jaeger components, and HAProxy load balancer with proper startup order.

sudo systemctl daemon-reload
sudo systemctl enable --now elasticsearch
sudo systemctl enable --now jaeger-collector-01 jaeger-collector-02 jaeger-collector-03
sudo systemctl enable --now jaeger-query-01 jaeger-query-02
sudo systemctl enable --now haproxy

Create monitoring script for cluster health

Set up automated monitoring to check the health of all cluster components and alert on failures.

#!/bin/bash

set -euo pipefail

ELASTICSEARCH_URL="http://localhost:9200"
HAPROXY_STATS="http://localhost:8404/stats;csv"
LOG_FILE="/opt/jaeger/logs/health-check.log"

log() {
    echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOG_FILE"
}

Check Elasticsearch

if ! curl -s "$ELASTICSEARCH_URL/_cluster/health" | jq -e '.status == "green" or .status == "yellow"' > /dev/null; then log "ERROR: Elasticsearch cluster unhealthy" exit 1 fi

Check HAProxy backends

if ! curl -s "$HAPROXY_STATS" | grep -q "jaeger-collectors-grpc,collector1,.*,UP"; then log "WARNING: Collector 1 is down" fi

Check collector services

for i in {01..03}; do if ! systemctl is-active --quiet "jaeger-collector-$i"; then log "ERROR: jaeger-collector-$i service is not running" systemctl restart "jaeger-collector-$i" fi done

Check query services

for i in {01..02}; do if ! systemctl is-active --quiet "jaeger-query-$i"; then log "ERROR: jaeger-query-$i service is not running" systemctl restart "jaeger-query-$i" fi done log "Health check completed successfully"
sudo chmod 755 /opt/jaeger/bin/health-check.sh
sudo chown jaeger:jaeger /opt/jaeger/bin/health-check.sh

Set up automated health monitoring with cron

Schedule regular health checks to monitor cluster status and automatically restart failed components.

sudo crontab -u jaeger -e

Add the following line to run health checks every 5 minutes:

/5    * /opt/jaeger/bin/health-check.sh

Configure high availability and failover

Configure Elasticsearch index lifecycle management

Set up index lifecycle policies to manage Jaeger trace data retention and optimize storage performance. For detailed ILM configuration, see our tutorial on Elasticsearch index lifecycle management.

curl -X PUT "localhost:9200/_ilm/policy/jaeger-ilm-policy" -H "Content-Type: application/json" -d'
{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover": {
            "max_size": "1GB",
            "max_age": "1d"
          },
          "set_priority": {
            "priority": 100
          }
        }
      },
      "warm": {
        "min_age": "2d",
        "actions": {
          "set_priority": {
            "priority": 50
          },
          "allocate": {
            "number_of_replicas": 0
          }
        }
      },
      "delete": {
        "min_age": "7d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}'

Configure Jaeger span sampling strategies

Set up dynamic sampling strategies to manage trace volume and ensure consistent performance under high load.

{
  "service_strategies": [
    {
      "service": "frontend",
      "type": "probabilistic",
      "param": 1.0
    },
    {
      "service": "backend",
      "type": "probabilistic",
      "param": 0.5
    }
  ],
  "default_strategy": {
    "type": "probabilistic",
    "param": 0.1
  }
}

Update collector configurations to use sampling strategies:

echo "sampling:
  strategies-file: /etc/jaeger/sampling.json" | sudo tee -a /etc/jaeger/collector-01.yaml
echo "sampling:
  strategies-file: /etc/jaeger/sampling.json" | sudo tee -a /etc/jaeger/collector-02.yaml
echo "sampling:
  strategies-file: /etc/jaeger/sampling.json" | sudo tee -a /etc/jaeger/collector-03.yaml

Verify your setup

Check that all components are running and the cluster is operational:

# Check Elasticsearch cluster health
curl -s http://localhost:9200/_cluster/health | jq '.status'

Check all Jaeger services

sudo systemctl status jaeger-collector-01 jaeger-collector-02 jaeger-collector-03 sudo systemctl status jaeger-query-01 jaeger-query-02 sudo systemctl status haproxy

Test load balancer endpoints

curl -f http://localhost:14268/api/traces curl -f http://localhost:16686/

Check HAProxy stats

curl -s http://localhost:8404/stats

Verify Elasticsearch indices

curl -s http://localhost:9200/_cat/indices/jaeger*

Access the Jaeger UI at http://your-server-ip:16686 and verify that traces are being collected and stored properly.

Monitor cluster health and performance

Set up Prometheus metrics collection

Configure Jaeger components to expose Prometheus metrics for monitoring and alerting.

echo "metrics-backend: prometheus" | sudo tee -a /etc/jaeger/collector-01.yaml
echo "metrics-backend: prometheus" | sudo tee -a /etc/jaeger/collector-02.yaml
echo "metrics-backend: prometheus" | sudo tee -a /etc/jaeger/collector-03.yaml
echo "metrics-backend: prometheus" | sudo tee -a /etc/jaeger/query-01.yaml
echo "metrics-backend: prometheus" | sudo tee -a /etc/jaeger/query-02.yaml

For comprehensive monitoring setup, see our tutorial on Prometheus and Grafana monitoring.

Create performance monitoring dashboard

Set up key performance indicators to monitor cluster throughput and latency.

#!/bin/bash

set -euo pipefail

METRICS_FILE="/opt/jaeger/logs/performance.log"

Collect performance metrics

TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S') SPAN_COUNT=$(curl -s "http://localhost:9200/jaeger-span-*/_count" | jq '.count') INDEX_SIZE=$(curl -s "http://localhost:9200/_cat/indices/jaeger*" | awk '{sum+=$8} END {print sum}') echo "$TIMESTAMP,spans=$SPAN_COUNT,index_size_bytes=$INDEX_SIZE" >> "$METRICS_FILE"

Check for performance issues

if [ "$SPAN_COUNT" -gt 1000000 ]; then logger "WARNING: High span count detected: $SPAN_COUNT" fi
sudo chmod 755 /opt/jaeger/bin/performance-monitor.sh
sudo chown jaeger:jaeger /opt/jaeger/bin/performance-monitor.sh

Common issues

Symptom Cause Fix
Collector not starting Elasticsearch not accessible Check Elasticsearch service: sudo systemctl status elasticsearch
HAProxy backend servers down Collector ports not listening Verify collector binds: sudo netstat -tlnp | grep 142
High memory usage Large trace volumes Adjust sampling rates in /etc/jaeger/sampling.json
Spans not appearing in UI Index template not created Check Elasticsearch logs and recreate templates
Query timeout errors Large time range queries Limit query time ranges and add more query replicas
Disk space filling up No index lifecycle policy Configure ILM policy to auto-delete old indices

Next steps

Automated install script

Run this to automate the entire setup

Need help?

Don't want to manage this yourself?

We handle managed devops services for businesses that depend on uptime. From initial setup to ongoing operations.