Setup Jaeger sampling strategies for high-volume production tracing

Advanced 45 min Jun 11, 2026 52 views
Ubuntu 24.04 Debian 12 AlmaLinux 9 Rocky Linux 9

Configure advanced Jaeger sampling strategies to efficiently capture traces in high-traffic production environments while controlling storage costs and maintaining observability.

Prerequisites

  • Existing Jaeger installation
  • Prometheus for metrics
  • Root or sudo access
  • Basic understanding of distributed tracing

What this solves

In high-volume production environments, tracing every request creates overwhelming data volumes and storage costs. Jaeger sampling strategies help you capture meaningful traces while controlling resource usage. This tutorial shows you how to implement adaptive sampling, per-service policies, and remote sampling configuration for production-scale distributed tracing.

Prerequisites

You need a running Jaeger deployment with Elasticsearch or another storage backend. If you don't have this yet, follow our Jaeger Kubernetes deployment guide.

Understanding sampling strategies

Jaeger supports several sampling strategies that determine which traces to collect:

Strategy TypeUse CaseConfiguration
ConstFixed percentage samplingAlways sample X% of traces
ProbabilisticRandom samplingSample based on trace ID
RateLimitingMaximum traces per secondCap at N traces/second
AdaptiveDynamic adjustmentAdjust based on traffic patterns
PerServiceService-specific rulesDifferent rates per service

Step-by-step configuration

Create sampling strategies configuration

Create a JSON configuration file that defines your sampling strategies. This file tells Jaeger how to sample traces for different services and operations.

{
  "default_strategy": {
    "type": "probabilistic",
    "param": 0.1
  },
  "per_service_strategies": [
    {
      "service": "frontend-service",
      "type": "probabilistic",
      "param": 0.5,
      "max_traces_per_second": 100
    },
    {
      "service": "payment-service",
      "type": "probabilistic",
      "param": 1.0,
      "max_traces_per_second": 50
    },
    {
      "service": "logging-service",
      "type": "probabilistic",
      "param": 0.01,
      "max_traces_per_second": 10
    },
    {
      "service": "health-check",
      "type": "probabilistic",
      "param": 0.001
    }
  ],
  "per_operation_strategies": [
    {
      "service": "frontend-service",
      "operation": "GET /health",
      "type": "probabilistic",
      "param": 0.001
    },
    {
      "service": "api-gateway",
      "operation": "POST /api/orders",
      "type": "probabilistic",
      "param": 0.8,
      "max_traces_per_second": 200
    }
  ]
}

Configure Jaeger Collector with sampling strategies

Update your Jaeger Collector configuration to use the sampling strategies file. This enables remote sampling where the collector serves sampling decisions to clients.

sampling:
  strategies-file: /etc/jaeger/sampling_strategies.json
  strategies-reload-interval: 30s

http-server:
  host-port: :14268

grpc-server:
  host-port: :14250

processors:
  batch:
    timeout: 1s
    send-batch-size: 1024
    send-batch-max-size: 2048

Setup adaptive sampling with volume control

Create an advanced configuration that adapts sampling rates based on traffic volume and service importance.

{
  "default_strategy": {
    "type": "adaptive",
    "max_traces_per_second": 500,
    "param": 0.1
  },
  "per_service_strategies": [
    {
      "service": "user-service",
      "type": "adaptive",
      "param": 0.3,
      "max_traces_per_second": 100,
      "operation_strategies": [
        {
          "operation": "login",
          "type": "probabilistic",
          "param": 0.8
        },
        {
          "operation": "register",
          "type": "probabilistic",
          "param": 1.0
        }
      ]
    },
    {
      "service": "database-service",
      "type": "rate_limiting",
      "param": 50
    },
    {
      "service": "cache-service",
      "type": "probabilistic",
      "param": 0.05,
      "max_traces_per_second": 20
    }
  ]
}

Configure environment-specific sampling

Create different sampling configurations for development, staging, and production environments.

{
  "default_strategy": {
    "type": "probabilistic",
    "param": 0.01
  },
  "per_service_strategies": [
    {
      "service": "critical-payment-service",
      "type": "probabilistic",
      "param": 0.5,
      "max_traces_per_second": 1000
    },
    {
      "service": "user-analytics",
      "type": "probabilistic",
      "param": 0.001,
      "max_traces_per_second": 10
    }
  ]
}
{
  "default_strategy": {
    "type": "probabilistic",
    "param": 1.0
  },
  "per_service_strategies": [
    {
      "service": "test-service",
      "type": "probabilistic",
      "param": 1.0
    }
  ]
}

Enable remote sampling in Jaeger Collector

Configure the Jaeger Collector to serve sampling strategies to client applications over HTTP.

sudo systemctl stop jaeger-collector
[Unit]
Description=Jaeger Collector
After=network.target

[Service]
Type=simple
User=jaeger
Group=jaeger
ExecStart=/usr/local/bin/jaeger-collector \
  --config-file=/etc/jaeger/collector.yaml \
  --sampling.strategies-file=/etc/jaeger/production_sampling.json \
  --sampling.strategies-reload-interval=60s \
  --collector.http-server.host-port=:14268 \
  --collector.grpc-server.host-port=:14250
Restart=always
RestartSec=3

[Install]
WantedBy=multi-user.target
sudo systemctl daemon-reload
sudo systemctl start jaeger-collector
sudo systemctl status jaeger-collector

Configure client applications for remote sampling

Update your application configuration to fetch sampling strategies from the Jaeger Collector instead of using local configuration.

package main

import (
    "github.com/uber/jaeger-client-go/config"
    "github.com/uber/jaeger-client-go"
)

func initJaeger() {
    cfg := config.Configuration{
        ServiceName: "my-service",
        Sampler: &config.SamplerConfig{
            Type: jaeger.SamplerTypeRemote,
            Param: 0.1, // fallback sampling rate
            SamplingServerURL: "http://jaeger-collector:14268/api/sampling",
            SamplingRefreshInterval: 60,
        },
        Reporter: &config.ReporterConfig{
            LocalAgentHostPort: "jaeger-agent:6831",
        },
    }
    
    tracer, closer, err := cfg.NewTracer()
    if err != nil {
        panic(err)
    }
    defer closer.Close()
}

Setup sampling strategy monitoring

Create a monitoring script to track sampling effectiveness and adjust strategies based on metrics.

#!/bin/bash

Get sampling stats from Jaeger

SAMPLING_URL="http://localhost:14268/api/sampling" METRICS_URL="http://localhost:14269/metrics"

Check current sampling strategies

echo "Current sampling strategies:" curl -s $SAMPLING_URL | jq .

Get trace volume metrics

echo "\nTrace volume metrics:" curl -s $METRICS_URL | grep jaeger_collector_traces_received_total

Check storage usage

echo "\nStorage usage:" curl -s $METRICS_URL | grep jaeger_collector_spans_saved_total

Calculate sampling efficiency

RECEIVED=$(curl -s $METRICS_URL | grep jaeger_collector_traces_received_total | tail -1 | awk '{print $2}') SAVED=$(curl -s $METRICS_URL | grep jaeger_collector_spans_saved_total | tail -1 | awk '{print $2}') if [ "$RECEIVED" -gt 0 ]; then EFFICIENCY=$(echo "scale=2; $SAVED / $RECEIVED * 100" | bc) echo "\nSampling efficiency: $EFFICIENCY%" fi
sudo chmod +x /usr/local/bin/monitor-sampling.sh

Create automated sampling adjustment script

Implement a script that automatically adjusts sampling rates based on system load and storage capacity.

#!/usr/bin/env python3
import json
import requests
import time
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

class SamplingAdjuster:
    def __init__(self, collector_url, strategies_file):
        self.collector_url = collector_url
        self.strategies_file = strategies_file
        
    def get_current_load(self):
        """Get current trace volume from metrics"""
        try:
            response = requests.get(f"{self.collector_url}/metrics")
            metrics = response.text
            
            # Extract trace rate (traces per second)
            for line in metrics.split('\n'):
                if 'jaeger_collector_traces_received_total' in line:
                    return float(line.split()[-1])
        except Exception as e:
            logger.error(f"Failed to get metrics: {e}")
        return 0
        
    def adjust_sampling_rate(self, current_load):
        """Adjust sampling based on load"""
        with open(self.strategies_file, 'r') as f:
            strategies = json.load(f)
            
        # Adjust default strategy based on load
        if current_load > 10000:  # High load
            strategies['default_strategy']['param'] = 0.01
        elif current_load > 1000:  # Medium load  
            strategies['default_strategy']['param'] = 0.05
        else:  # Low load
            strategies['default_strategy']['param'] = 0.1
            
        # Write updated strategies
        with open(self.strategies_file, 'w') as f:
            json.dump(strategies, f, indent=2)
            
        logger.info(f"Adjusted sampling for load: {current_load}")
        
def main():
    adjuster = SamplingAdjuster(
        collector_url="http://localhost:14268",
        strategies_file="/etc/jaeger/production_sampling.json"
    )
    
    while True:
        load = adjuster.get_current_load()
        adjuster.adjust_sampling_rate(load)
        time.sleep(300)  # Check every 5 minutes
        
if __name__ == "__main__":
    main()
sudo chmod +x /usr/local/bin/adjust-sampling.py

Setup sampling strategy validation

Create a validation script to ensure sampling configurations are working correctly.

#!/bin/bash

JAEGER_COLLECTOR="http://localhost:14268"
JAEGER_QUERY="http://localhost:16686"

echo "Validating Jaeger sampling configuration..."

Test sampling endpoint

echo "1. Testing sampling endpoint:" SAMPLING_RESPONSE=$(curl -s -w "%{http_code}" $JAEGER_COLLECTOR/api/sampling) HTTP_CODE=${SAMPLING_RESPONSE: -3} if [ "$HTTP_CODE" = "200" ]; then echo "✓ Sampling endpoint accessible" else echo "✗ Sampling endpoint failed (HTTP $HTTP_CODE)" exit 1 fi

Validate JSON structure

echo "\n2. Validating sampling strategy JSON:" SAMPLING_JSON=$(curl -s $JAEGER_COLLECTOR/api/sampling) echo $SAMPLING_JSON | jq . > /dev/null 2>&1 if [ $? -eq 0 ]; then echo "✓ Valid JSON structure" else echo "✗ Invalid JSON structure" exit 1 fi

Check for required fields

echo "\n3. Checking required fields:" HAS_DEFAULT=$(echo $SAMPLING_JSON | jq -r '.default_strategy.type') if [ "$HAS_DEFAULT" != "null" ] && [ "$HAS_DEFAULT" != "" ]; then echo "✓ Default strategy configured" else echo "✗ Missing default strategy" fi

Test trace collection

echo "\n4. Testing trace collection:" TRACE_COUNT=$(curl -s "$JAEGER_QUERY/api/traces?limit=1" | jq -r '.data | length') if [ "$TRACE_COUNT" -gt 0 ]; then echo "✓ Traces are being collected" else echo "! No recent traces found (this may be normal)" fi echo "\nSampling validation complete."
sudo chmod +x /usr/local/bin/validate-sampling.sh

Configure per-service sampling policies

Create service-tier based sampling

Implement different sampling rates based on service criticality and business importance.

{
  "default_strategy": {
    "type": "probabilistic",
    "param": 0.1
  },
  "per_service_strategies": [
    {
      "service": "tier1-payment-gateway",
      "type": "probabilistic",
      "param": 0.8,
      "max_traces_per_second": 500,
      "operation_strategies": [
        {
          "operation": "process_payment",
          "type": "probabilistic",
          "param": 1.0
        },
        {
          "operation": "refund_payment",
          "type": "probabilistic",
          "param": 1.0
        }
      ]
    },
    {
      "service": "tier2-user-service",
      "type": "probabilistic",
      "param": 0.3,
      "max_traces_per_second": 200
    },
    {
      "service": "tier3-analytics",
      "type": "probabilistic",
      "param": 0.05,
      "max_traces_per_second": 50
    },
    {
      "service": "tier4-background-jobs",
      "type": "probabilistic",
      "param": 0.01,
      "max_traces_per_second": 10
    }
  ]
}

Setup error-based sampling boost

Configure higher sampling rates for services experiencing errors to improve debugging.

{
  "default_strategy": {
    "type": "probabilistic",
    "param": 0.1
  },
  "per_service_strategies": [
    {
      "service": "error-prone-service",
      "type": "probabilistic",
      "param": 0.5,
      "max_traces_per_second": 100,
      "operation_strategies": [
        {
          "operation": "failing_endpoint",
          "type": "probabilistic",
          "param": 1.0
        }
      ]
    }
  ],
  "per_operation_strategies": [
    {
      "service": "*",
      "operation": "error",
      "type": "probabilistic",
      "param": 0.8
    },
    {
      "service": "*", 
      "operation": "exception",
      "type": "probabilistic",
      "param": 0.8
    }
  ]
}

Setup remote sampling with Jaeger Collector

Configure collector for high availability

Setup multiple Jaeger Collectors with load balancing for sampling strategy distribution.

sampling:
  strategies-file: /etc/jaeger/production_sampling.json
  strategies-reload-interval: 30s

http-server:
  host-port: 0.0.0.0:14268

grpc-server: 
  host-port: 0.0.0.0:14250

span-storage:
  type: elasticsearch
  
elasticsearch:
  server-urls: http://elasticsearch-1:9200,http://elasticsearch-2:9200
  index-prefix: jaeger
  
processors:
  batch:
    timeout: 1s
    send-batch-size: 2048
    send-batch-max-size: 4096

metrics-storage:
  type: prometheus

Create sampling strategy hot reload

Implement a system to update sampling strategies without restarting the collector.

#!/bin/bash

STRATEGIES_FILE="/etc/jaeger/production_sampling.json"
COLLECTOR_PID_FILE="/var/run/jaeger-collector.pid"
BACKUP_DIR="/var/backups/jaeger"

Create backup

TIMESTAMP=$(date +%Y%m%d_%H%M%S) sudo mkdir -p $BACKUP_DIR sudo cp $STRATEGIES_FILE "$BACKUP_DIR/sampling_strategies_$TIMESTAMP.json"

Validate new configuration

echo "Validating new sampling configuration..." if ! jq . "$STRATEGIES_FILE" > /dev/null 2>&1; then echo "Error: Invalid JSON in strategies file" exit 1 fi

Send SIGHUP to collector for hot reload

if [ -f "$COLLECTOR_PID_FILE" ]; then PID=$(cat $COLLECTOR_PID_FILE) if kill -0 $PID 2>/dev/null; then echo "Reloading sampling strategies..." kill -HUP $PID echo "Sampling strategies reloaded successfully" else echo "Collector process not found, restarting service..." sudo systemctl restart jaeger-collector fi else echo "PID file not found, restarting service..." sudo systemctl restart jaeger-collector fi

Verify reload

sleep 2 echo "Verifying configuration reload..." curl -s http://localhost:14268/api/sampling | jq . > /dev/null if [ $? -eq 0 ]; then echo "✓ Sampling strategies successfully reloaded" else echo "✗ Failed to reload sampling strategies" exit 1 fi
sudo chmod +x /usr/local/bin/reload-sampling.sh

Monitor and optimize sampling performance

Setup Prometheus metrics collection

Configure Prometheus to scrape Jaeger metrics for sampling analysis. This helps you monitor sampling effectiveness and storage impact.

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'jaeger-collector'
    static_configs:
      - targets: ['localhost:14269']
    scrape_interval: 10s
    metrics_path: /metrics
    
  - job_name: 'jaeger-agent'
    static_configs:
      - targets: ['localhost:14271']
    scrape_interval: 30s
    
  - job_name: 'jaeger-query'
    static_configs:
      - targets: ['localhost:16687']
    scrape_interval: 30s

Create sampling performance dashboard

Setup Grafana dashboard to visualize sampling metrics and trace volumes.

{
  "dashboard": {
    "title": "Jaeger Sampling Performance",
    "panels": [
      {
        "title": "Traces Received vs Stored",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(jaeger_collector_traces_received_total[5m])",
            "legendFormat": "Traces Received/sec"
          },
          {
            "expr": "rate(jaeger_collector_spans_saved_total[5m])", 
            "legendFormat": "Spans Saved/sec"
          }
        ]
      },
      {
        "title": "Sampling Rate by Service",
        "type": "table",
        "targets": [
          {
            "expr": "jaeger_collector_sampling_rate by (service)",
            "format": "table"
          }
        ]
      },
      {
        "title": "Storage Growth Rate", 
        "type": "singlestat",
        "targets": [
          {
            "expr": "rate(jaeger_collector_spans_saved_total[1h])"
          }
        ]
      }
    ]
  }
}

Setup sampling alerting rules

Create alerts for sampling issues like excessive trace volumes or sampling failures.

groups:
  - name: jaeger-sampling
    rules:
      - alert: JaegerHighTraceVolume
        expr: rate(jaeger_collector_traces_received_total[5m]) > 5000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High trace volume detected"
          description: "Jaeger is receiving {{ $value }} traces/sec, consider reducing sampling rates"
          
      - alert: JaegerSamplingEfficiencyLow
        expr: (rate(jaeger_collector_spans_saved_total[5m]) / rate(jaeger_collector_spans_received_total[5m])) < 0.01
        for: 10m
        labels:
          severity: warning 
        annotations:
          summary: "Very low sampling efficiency"
          description: "Only {{ $value }}% of traces are being sampled"
          
      - alert: JaegerSamplingEndpointDown
        expr: up{job="jaeger-collector"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Jaeger sampling endpoint unavailable"
          description: "Applications cannot fetch sampling strategies"
          
      - alert: JaegerStorageGrowthHigh
        expr: rate(jaeger_collector_spans_saved_total[1h]) > 100000
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "High storage growth rate"
          description: "Storing {{ $value }} spans/hour, storage will fill quickly"

Create sampling optimization script

Implement automated sampling optimization based on observed patterns and storage constraints.

#!/usr/bin/env python3
import json
import requests
import logging
from datetime import datetime, timedelta

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

class SamplingOptimizer:
    def __init__(self, prometheus_url, collector_url, strategies_file):
        self.prometheus_url = prometheus_url
        self.collector_url = collector_url
        self.strategies_file = strategies_file
        
    def get_service_metrics(self, service_name, hours=24):
        """Get trace volume and error rate for a service"""
        end_time = datetime.now()
        start_time = end_time - timedelta(hours=hours)
        
        # Query trace volume
        volume_query = f'sum(rate(jaeger_collector_traces_received_total{{service="{service_name}"}}[1h]))'
        # Query error rate  
        error_query = f'sum(rate(jaeger_collector_spans_received_total{{service="{service_name}",error="true"}}[1h])) / sum(rate(jaeger_collector_spans_received_total{{service="{service_name}"}}[1h]))'
        
        try:
            volume_resp = requests.get(f"{self.prometheus_url}/api/v1/query", params={'query': volume_query})
            error_resp = requests.get(f"{self.prometheus_url}/api/v1/query", params={'query': error_query})
            
            volume = float(volume_resp.json()['data']['result'][0]['value'][1]) if volume_resp.json()['data']['result'] else 0
            error_rate = float(error_resp.json()['data']['result'][0]['value'][1]) if error_resp.json()['data']['result'] else 0
            
            return {'volume': volume, 'error_rate': error_rate}
        except Exception as e:
            logger.error(f"Failed to get metrics for {service_name}: {e}")
            return {'volume': 0, 'error_rate': 0}
            
    def calculate_optimal_rate(self, metrics):
        """Calculate optimal sampling rate based on metrics"""
        volume = metrics['volume']
        error_rate = metrics['error_rate']
        
        # Base rate calculation
        if volume > 1000:  # High volume service
            base_rate = 0.01
        elif volume > 100:  # Medium volume
            base_rate = 0.05
        else:  # Low volume
            base_rate = 0.2
            
        # Boost for high error rates
        if error_rate > 0.05:  # More than 5% errors
            base_rate = min(base_rate * 3, 0.5)
        elif error_rate > 0.01:  # More than 1% errors
            base_rate = min(base_rate * 2, 0.3)
            
        return round(base_rate, 3)
        
    def update_sampling_strategies(self):
        """Update sampling strategies based on observed metrics"""
        with open(self.strategies_file, 'r') as f:
            strategies = json.load(f)
            
        updated = False
        
        for strategy in strategies.get('per_service_strategies', []):
            service_name = strategy['service']
            metrics = self.get_service_metrics(service_name)
            optimal_rate = self.calculate_optimal_rate(metrics)
            
            if abs(strategy['param'] - optimal_rate) > 0.01:  # Significant change
                logger.info(f"Updating {service_name}: {strategy['param']} -> {optimal_rate}")
                strategy['param'] = optimal_rate
                updated = True
                
        if updated:
            # Backup current strategies
            backup_file = f"{self.strategies_file}.backup.{int(datetime.now().timestamp())}"
            with open(backup_file, 'w') as f:
                json.dump(strategies, f, indent=2)
                
            # Write updated strategies
            with open(self.strategies_file, 'w') as f:
                json.dump(strategies, f, indent=2)
                
            logger.info("Sampling strategies updated and backed up")
            return True
            
        logger.info("No sampling strategy changes needed")
        return False
        
def main():
    optimizer = SamplingOptimizer(
        prometheus_url="http://localhost:9090",
        collector_url="http://localhost:14268", 
        strategies_file="/etc/jaeger/production_sampling.json"
    )
    
    if optimizer.update_sampling_strategies():
        # Reload strategies if updated
        import subprocess
        subprocess.run(['/usr/local/bin/reload-sampling.sh'])
        
if __name__ == "__main__":
    main()
sudo chmod +x /usr/local/bin/optimize-sampling.py

Install Python dependencies

sudo pip3 install requests

Schedule automated optimization

Create a systemd timer to run sampling optimization periodically.

[Unit]
Description=Jaeger Sampling Optimizer
After=network.target

[Service]
Type=oneshot
User=jaeger
Group=jaeger
ExecStart=/usr/local/bin/optimize-sampling.py
StandardOutput=journal
StandardError=journal
[Unit]
Description=Run Jaeger Sampling Optimizer
Requires=jaeger-sampling-optimizer.service

[Timer]
OnCalendar=hourly
Persistent=true

[Install]
WantedBy=timers.target
sudo systemctl daemon-reload
sudo systemctl enable jaeger-sampling-optimizer.timer
sudo systemctl start jaeger-sampling-optimizer.timer
sudo systemctl status jaeger-sampling-optimizer.timer

Verify your setup

Test that your sampling strategies are working correctly:

# Check sampling endpoint
curl http://localhost:14268/api/sampling | jq .

Validate configuration

/usr/local/bin/validate-sampling.sh

Check collector status

sudo systemctl status jaeger-collector

Monitor trace volume

/usr/local/bin/monitor-sampling.sh

Test hot reload

sudo /usr/local/bin/reload-sampling.sh

The sampling endpoint should return your configured strategies, and the validation script should show all checks passing.

Common issues

SymptomCauseFix
Clients not getting sampling strategiesCollector not serving on :14268Check collector configuration and firewall rules
Sampling strategies not updatingInvalid JSON in strategies fileValidate JSON with jq . /etc/jaeger/sampling_strategies.json
Too many traces storedSampling rates too highReduce probabilistic parameters in configuration
Missing critical tracesSampling rates too lowIncrease sampling for important services/operations
High collector CPU usageProcessing too many tracesImplement rate limiting and reduce sampling rates
Storage growing too fastInsufficient sampling limitsAdd max_traces_per_second limits to services

Next steps

Running this in production?

Want this handled for you? Running high-volume tracing at scale adds complexity: capacity planning, cost optimization, storage management, and 24/7 response when sampling goes wrong. Our managed platform covers monitoring, alerting and optimization by default.

Need help?

Don't want to manage this yourself?

We handle managed devops services for businesses that depend on uptime. From initial setup to ongoing operations.