Configure advanced Jaeger sampling strategies to efficiently capture traces in high-traffic production environments while controlling storage costs and maintaining observability.
Prerequisites
- Existing Jaeger installation
- Prometheus for metrics
- Root or sudo access
- Basic understanding of distributed tracing
What this solves
In high-volume production environments, tracing every request creates overwhelming data volumes and storage costs. Jaeger sampling strategies help you capture meaningful traces while controlling resource usage. This tutorial shows you how to implement adaptive sampling, per-service policies, and remote sampling configuration for production-scale distributed tracing.
Prerequisites
You need a running Jaeger deployment with Elasticsearch or another storage backend. If you don't have this yet, follow our Jaeger Kubernetes deployment guide.
Understanding sampling strategies
Jaeger supports several sampling strategies that determine which traces to collect:
| Strategy Type | Use Case | Configuration |
|---|---|---|
| Const | Fixed percentage sampling | Always sample X% of traces |
| Probabilistic | Random sampling | Sample based on trace ID |
| RateLimiting | Maximum traces per second | Cap at N traces/second |
| Adaptive | Dynamic adjustment | Adjust based on traffic patterns |
| PerService | Service-specific rules | Different rates per service |
Step-by-step configuration
Create sampling strategies configuration
Create a JSON configuration file that defines your sampling strategies. This file tells Jaeger how to sample traces for different services and operations.
{
"default_strategy": {
"type": "probabilistic",
"param": 0.1
},
"per_service_strategies": [
{
"service": "frontend-service",
"type": "probabilistic",
"param": 0.5,
"max_traces_per_second": 100
},
{
"service": "payment-service",
"type": "probabilistic",
"param": 1.0,
"max_traces_per_second": 50
},
{
"service": "logging-service",
"type": "probabilistic",
"param": 0.01,
"max_traces_per_second": 10
},
{
"service": "health-check",
"type": "probabilistic",
"param": 0.001
}
],
"per_operation_strategies": [
{
"service": "frontend-service",
"operation": "GET /health",
"type": "probabilistic",
"param": 0.001
},
{
"service": "api-gateway",
"operation": "POST /api/orders",
"type": "probabilistic",
"param": 0.8,
"max_traces_per_second": 200
}
]
}
Configure Jaeger Collector with sampling strategies
Update your Jaeger Collector configuration to use the sampling strategies file. This enables remote sampling where the collector serves sampling decisions to clients.
sampling:
strategies-file: /etc/jaeger/sampling_strategies.json
strategies-reload-interval: 30s
http-server:
host-port: :14268
grpc-server:
host-port: :14250
processors:
batch:
timeout: 1s
send-batch-size: 1024
send-batch-max-size: 2048
Setup adaptive sampling with volume control
Create an advanced configuration that adapts sampling rates based on traffic volume and service importance.
{
"default_strategy": {
"type": "adaptive",
"max_traces_per_second": 500,
"param": 0.1
},
"per_service_strategies": [
{
"service": "user-service",
"type": "adaptive",
"param": 0.3,
"max_traces_per_second": 100,
"operation_strategies": [
{
"operation": "login",
"type": "probabilistic",
"param": 0.8
},
{
"operation": "register",
"type": "probabilistic",
"param": 1.0
}
]
},
{
"service": "database-service",
"type": "rate_limiting",
"param": 50
},
{
"service": "cache-service",
"type": "probabilistic",
"param": 0.05,
"max_traces_per_second": 20
}
]
}
Configure environment-specific sampling
Create different sampling configurations for development, staging, and production environments.
{
"default_strategy": {
"type": "probabilistic",
"param": 0.01
},
"per_service_strategies": [
{
"service": "critical-payment-service",
"type": "probabilistic",
"param": 0.5,
"max_traces_per_second": 1000
},
{
"service": "user-analytics",
"type": "probabilistic",
"param": 0.001,
"max_traces_per_second": 10
}
]
}
{
"default_strategy": {
"type": "probabilistic",
"param": 1.0
},
"per_service_strategies": [
{
"service": "test-service",
"type": "probabilistic",
"param": 1.0
}
]
}
Enable remote sampling in Jaeger Collector
Configure the Jaeger Collector to serve sampling strategies to client applications over HTTP.
sudo systemctl stop jaeger-collector
[Unit]
Description=Jaeger Collector
After=network.target
[Service]
Type=simple
User=jaeger
Group=jaeger
ExecStart=/usr/local/bin/jaeger-collector \
--config-file=/etc/jaeger/collector.yaml \
--sampling.strategies-file=/etc/jaeger/production_sampling.json \
--sampling.strategies-reload-interval=60s \
--collector.http-server.host-port=:14268 \
--collector.grpc-server.host-port=:14250
Restart=always
RestartSec=3
[Install]
WantedBy=multi-user.target
sudo systemctl daemon-reload
sudo systemctl start jaeger-collector
sudo systemctl status jaeger-collector
Configure client applications for remote sampling
Update your application configuration to fetch sampling strategies from the Jaeger Collector instead of using local configuration.
package main
import (
"github.com/uber/jaeger-client-go/config"
"github.com/uber/jaeger-client-go"
)
func initJaeger() {
cfg := config.Configuration{
ServiceName: "my-service",
Sampler: &config.SamplerConfig{
Type: jaeger.SamplerTypeRemote,
Param: 0.1, // fallback sampling rate
SamplingServerURL: "http://jaeger-collector:14268/api/sampling",
SamplingRefreshInterval: 60,
},
Reporter: &config.ReporterConfig{
LocalAgentHostPort: "jaeger-agent:6831",
},
}
tracer, closer, err := cfg.NewTracer()
if err != nil {
panic(err)
}
defer closer.Close()
}
Setup sampling strategy monitoring
Create a monitoring script to track sampling effectiveness and adjust strategies based on metrics.
#!/bin/bash
Get sampling stats from Jaeger
SAMPLING_URL="http://localhost:14268/api/sampling"
METRICS_URL="http://localhost:14269/metrics"
Check current sampling strategies
echo "Current sampling strategies:"
curl -s $SAMPLING_URL | jq .
Get trace volume metrics
echo "\nTrace volume metrics:"
curl -s $METRICS_URL | grep jaeger_collector_traces_received_total
Check storage usage
echo "\nStorage usage:"
curl -s $METRICS_URL | grep jaeger_collector_spans_saved_total
Calculate sampling efficiency
RECEIVED=$(curl -s $METRICS_URL | grep jaeger_collector_traces_received_total | tail -1 | awk '{print $2}')
SAVED=$(curl -s $METRICS_URL | grep jaeger_collector_spans_saved_total | tail -1 | awk '{print $2}')
if [ "$RECEIVED" -gt 0 ]; then
EFFICIENCY=$(echo "scale=2; $SAVED / $RECEIVED * 100" | bc)
echo "\nSampling efficiency: $EFFICIENCY%"
fi
sudo chmod +x /usr/local/bin/monitor-sampling.sh
Create automated sampling adjustment script
Implement a script that automatically adjusts sampling rates based on system load and storage capacity.
#!/usr/bin/env python3
import json
import requests
import time
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class SamplingAdjuster:
def __init__(self, collector_url, strategies_file):
self.collector_url = collector_url
self.strategies_file = strategies_file
def get_current_load(self):
"""Get current trace volume from metrics"""
try:
response = requests.get(f"{self.collector_url}/metrics")
metrics = response.text
# Extract trace rate (traces per second)
for line in metrics.split('\n'):
if 'jaeger_collector_traces_received_total' in line:
return float(line.split()[-1])
except Exception as e:
logger.error(f"Failed to get metrics: {e}")
return 0
def adjust_sampling_rate(self, current_load):
"""Adjust sampling based on load"""
with open(self.strategies_file, 'r') as f:
strategies = json.load(f)
# Adjust default strategy based on load
if current_load > 10000: # High load
strategies['default_strategy']['param'] = 0.01
elif current_load > 1000: # Medium load
strategies['default_strategy']['param'] = 0.05
else: # Low load
strategies['default_strategy']['param'] = 0.1
# Write updated strategies
with open(self.strategies_file, 'w') as f:
json.dump(strategies, f, indent=2)
logger.info(f"Adjusted sampling for load: {current_load}")
def main():
adjuster = SamplingAdjuster(
collector_url="http://localhost:14268",
strategies_file="/etc/jaeger/production_sampling.json"
)
while True:
load = adjuster.get_current_load()
adjuster.adjust_sampling_rate(load)
time.sleep(300) # Check every 5 minutes
if __name__ == "__main__":
main()
sudo chmod +x /usr/local/bin/adjust-sampling.py
Setup sampling strategy validation
Create a validation script to ensure sampling configurations are working correctly.
#!/bin/bash
JAEGER_COLLECTOR="http://localhost:14268"
JAEGER_QUERY="http://localhost:16686"
echo "Validating Jaeger sampling configuration..."
Test sampling endpoint
echo "1. Testing sampling endpoint:"
SAMPLING_RESPONSE=$(curl -s -w "%{http_code}" $JAEGER_COLLECTOR/api/sampling)
HTTP_CODE=${SAMPLING_RESPONSE: -3}
if [ "$HTTP_CODE" = "200" ]; then
echo "✓ Sampling endpoint accessible"
else
echo "✗ Sampling endpoint failed (HTTP $HTTP_CODE)"
exit 1
fi
Validate JSON structure
echo "\n2. Validating sampling strategy JSON:"
SAMPLING_JSON=$(curl -s $JAEGER_COLLECTOR/api/sampling)
echo $SAMPLING_JSON | jq . > /dev/null 2>&1
if [ $? -eq 0 ]; then
echo "✓ Valid JSON structure"
else
echo "✗ Invalid JSON structure"
exit 1
fi
Check for required fields
echo "\n3. Checking required fields:"
HAS_DEFAULT=$(echo $SAMPLING_JSON | jq -r '.default_strategy.type')
if [ "$HAS_DEFAULT" != "null" ] && [ "$HAS_DEFAULT" != "" ]; then
echo "✓ Default strategy configured"
else
echo "✗ Missing default strategy"
fi
Test trace collection
echo "\n4. Testing trace collection:"
TRACE_COUNT=$(curl -s "$JAEGER_QUERY/api/traces?limit=1" | jq -r '.data | length')
if [ "$TRACE_COUNT" -gt 0 ]; then
echo "✓ Traces are being collected"
else
echo "! No recent traces found (this may be normal)"
fi
echo "\nSampling validation complete."
sudo chmod +x /usr/local/bin/validate-sampling.sh
Configure per-service sampling policies
Create service-tier based sampling
Implement different sampling rates based on service criticality and business importance.
{
"default_strategy": {
"type": "probabilistic",
"param": 0.1
},
"per_service_strategies": [
{
"service": "tier1-payment-gateway",
"type": "probabilistic",
"param": 0.8,
"max_traces_per_second": 500,
"operation_strategies": [
{
"operation": "process_payment",
"type": "probabilistic",
"param": 1.0
},
{
"operation": "refund_payment",
"type": "probabilistic",
"param": 1.0
}
]
},
{
"service": "tier2-user-service",
"type": "probabilistic",
"param": 0.3,
"max_traces_per_second": 200
},
{
"service": "tier3-analytics",
"type": "probabilistic",
"param": 0.05,
"max_traces_per_second": 50
},
{
"service": "tier4-background-jobs",
"type": "probabilistic",
"param": 0.01,
"max_traces_per_second": 10
}
]
}
Setup error-based sampling boost
Configure higher sampling rates for services experiencing errors to improve debugging.
{
"default_strategy": {
"type": "probabilistic",
"param": 0.1
},
"per_service_strategies": [
{
"service": "error-prone-service",
"type": "probabilistic",
"param": 0.5,
"max_traces_per_second": 100,
"operation_strategies": [
{
"operation": "failing_endpoint",
"type": "probabilistic",
"param": 1.0
}
]
}
],
"per_operation_strategies": [
{
"service": "*",
"operation": "error",
"type": "probabilistic",
"param": 0.8
},
{
"service": "*",
"operation": "exception",
"type": "probabilistic",
"param": 0.8
}
]
}
Setup remote sampling with Jaeger Collector
Configure collector for high availability
Setup multiple Jaeger Collectors with load balancing for sampling strategy distribution.
sampling:
strategies-file: /etc/jaeger/production_sampling.json
strategies-reload-interval: 30s
http-server:
host-port: 0.0.0.0:14268
grpc-server:
host-port: 0.0.0.0:14250
span-storage:
type: elasticsearch
elasticsearch:
server-urls: http://elasticsearch-1:9200,http://elasticsearch-2:9200
index-prefix: jaeger
processors:
batch:
timeout: 1s
send-batch-size: 2048
send-batch-max-size: 4096
metrics-storage:
type: prometheus
Create sampling strategy hot reload
Implement a system to update sampling strategies without restarting the collector.
#!/bin/bash
STRATEGIES_FILE="/etc/jaeger/production_sampling.json"
COLLECTOR_PID_FILE="/var/run/jaeger-collector.pid"
BACKUP_DIR="/var/backups/jaeger"
Create backup
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
sudo mkdir -p $BACKUP_DIR
sudo cp $STRATEGIES_FILE "$BACKUP_DIR/sampling_strategies_$TIMESTAMP.json"
Validate new configuration
echo "Validating new sampling configuration..."
if ! jq . "$STRATEGIES_FILE" > /dev/null 2>&1; then
echo "Error: Invalid JSON in strategies file"
exit 1
fi
Send SIGHUP to collector for hot reload
if [ -f "$COLLECTOR_PID_FILE" ]; then
PID=$(cat $COLLECTOR_PID_FILE)
if kill -0 $PID 2>/dev/null; then
echo "Reloading sampling strategies..."
kill -HUP $PID
echo "Sampling strategies reloaded successfully"
else
echo "Collector process not found, restarting service..."
sudo systemctl restart jaeger-collector
fi
else
echo "PID file not found, restarting service..."
sudo systemctl restart jaeger-collector
fi
Verify reload
sleep 2
echo "Verifying configuration reload..."
curl -s http://localhost:14268/api/sampling | jq . > /dev/null
if [ $? -eq 0 ]; then
echo "✓ Sampling strategies successfully reloaded"
else
echo "✗ Failed to reload sampling strategies"
exit 1
fi
sudo chmod +x /usr/local/bin/reload-sampling.sh
Monitor and optimize sampling performance
Setup Prometheus metrics collection
Configure Prometheus to scrape Jaeger metrics for sampling analysis. This helps you monitor sampling effectiveness and storage impact.
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'jaeger-collector'
static_configs:
- targets: ['localhost:14269']
scrape_interval: 10s
metrics_path: /metrics
- job_name: 'jaeger-agent'
static_configs:
- targets: ['localhost:14271']
scrape_interval: 30s
- job_name: 'jaeger-query'
static_configs:
- targets: ['localhost:16687']
scrape_interval: 30s
Create sampling performance dashboard
Setup Grafana dashboard to visualize sampling metrics and trace volumes.
{
"dashboard": {
"title": "Jaeger Sampling Performance",
"panels": [
{
"title": "Traces Received vs Stored",
"type": "graph",
"targets": [
{
"expr": "rate(jaeger_collector_traces_received_total[5m])",
"legendFormat": "Traces Received/sec"
},
{
"expr": "rate(jaeger_collector_spans_saved_total[5m])",
"legendFormat": "Spans Saved/sec"
}
]
},
{
"title": "Sampling Rate by Service",
"type": "table",
"targets": [
{
"expr": "jaeger_collector_sampling_rate by (service)",
"format": "table"
}
]
},
{
"title": "Storage Growth Rate",
"type": "singlestat",
"targets": [
{
"expr": "rate(jaeger_collector_spans_saved_total[1h])"
}
]
}
]
}
}
Setup sampling alerting rules
Create alerts for sampling issues like excessive trace volumes or sampling failures.
groups:
- name: jaeger-sampling
rules:
- alert: JaegerHighTraceVolume
expr: rate(jaeger_collector_traces_received_total[5m]) > 5000
for: 5m
labels:
severity: warning
annotations:
summary: "High trace volume detected"
description: "Jaeger is receiving {{ $value }} traces/sec, consider reducing sampling rates"
- alert: JaegerSamplingEfficiencyLow
expr: (rate(jaeger_collector_spans_saved_total[5m]) / rate(jaeger_collector_spans_received_total[5m])) < 0.01
for: 10m
labels:
severity: warning
annotations:
summary: "Very low sampling efficiency"
description: "Only {{ $value }}% of traces are being sampled"
- alert: JaegerSamplingEndpointDown
expr: up{job="jaeger-collector"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Jaeger sampling endpoint unavailable"
description: "Applications cannot fetch sampling strategies"
- alert: JaegerStorageGrowthHigh
expr: rate(jaeger_collector_spans_saved_total[1h]) > 100000
for: 15m
labels:
severity: warning
annotations:
summary: "High storage growth rate"
description: "Storing {{ $value }} spans/hour, storage will fill quickly"
Create sampling optimization script
Implement automated sampling optimization based on observed patterns and storage constraints.
#!/usr/bin/env python3
import json
import requests
import logging
from datetime import datetime, timedelta
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class SamplingOptimizer:
def __init__(self, prometheus_url, collector_url, strategies_file):
self.prometheus_url = prometheus_url
self.collector_url = collector_url
self.strategies_file = strategies_file
def get_service_metrics(self, service_name, hours=24):
"""Get trace volume and error rate for a service"""
end_time = datetime.now()
start_time = end_time - timedelta(hours=hours)
# Query trace volume
volume_query = f'sum(rate(jaeger_collector_traces_received_total{{service="{service_name}"}}[1h]))'
# Query error rate
error_query = f'sum(rate(jaeger_collector_spans_received_total{{service="{service_name}",error="true"}}[1h])) / sum(rate(jaeger_collector_spans_received_total{{service="{service_name}"}}[1h]))'
try:
volume_resp = requests.get(f"{self.prometheus_url}/api/v1/query", params={'query': volume_query})
error_resp = requests.get(f"{self.prometheus_url}/api/v1/query", params={'query': error_query})
volume = float(volume_resp.json()['data']['result'][0]['value'][1]) if volume_resp.json()['data']['result'] else 0
error_rate = float(error_resp.json()['data']['result'][0]['value'][1]) if error_resp.json()['data']['result'] else 0
return {'volume': volume, 'error_rate': error_rate}
except Exception as e:
logger.error(f"Failed to get metrics for {service_name}: {e}")
return {'volume': 0, 'error_rate': 0}
def calculate_optimal_rate(self, metrics):
"""Calculate optimal sampling rate based on metrics"""
volume = metrics['volume']
error_rate = metrics['error_rate']
# Base rate calculation
if volume > 1000: # High volume service
base_rate = 0.01
elif volume > 100: # Medium volume
base_rate = 0.05
else: # Low volume
base_rate = 0.2
# Boost for high error rates
if error_rate > 0.05: # More than 5% errors
base_rate = min(base_rate * 3, 0.5)
elif error_rate > 0.01: # More than 1% errors
base_rate = min(base_rate * 2, 0.3)
return round(base_rate, 3)
def update_sampling_strategies(self):
"""Update sampling strategies based on observed metrics"""
with open(self.strategies_file, 'r') as f:
strategies = json.load(f)
updated = False
for strategy in strategies.get('per_service_strategies', []):
service_name = strategy['service']
metrics = self.get_service_metrics(service_name)
optimal_rate = self.calculate_optimal_rate(metrics)
if abs(strategy['param'] - optimal_rate) > 0.01: # Significant change
logger.info(f"Updating {service_name}: {strategy['param']} -> {optimal_rate}")
strategy['param'] = optimal_rate
updated = True
if updated:
# Backup current strategies
backup_file = f"{self.strategies_file}.backup.{int(datetime.now().timestamp())}"
with open(backup_file, 'w') as f:
json.dump(strategies, f, indent=2)
# Write updated strategies
with open(self.strategies_file, 'w') as f:
json.dump(strategies, f, indent=2)
logger.info("Sampling strategies updated and backed up")
return True
logger.info("No sampling strategy changes needed")
return False
def main():
optimizer = SamplingOptimizer(
prometheus_url="http://localhost:9090",
collector_url="http://localhost:14268",
strategies_file="/etc/jaeger/production_sampling.json"
)
if optimizer.update_sampling_strategies():
# Reload strategies if updated
import subprocess
subprocess.run(['/usr/local/bin/reload-sampling.sh'])
if __name__ == "__main__":
main()
sudo chmod +x /usr/local/bin/optimize-sampling.py
Install Python dependencies
sudo pip3 install requests
Schedule automated optimization
Create a systemd timer to run sampling optimization periodically.
[Unit]
Description=Jaeger Sampling Optimizer
After=network.target
[Service]
Type=oneshot
User=jaeger
Group=jaeger
ExecStart=/usr/local/bin/optimize-sampling.py
StandardOutput=journal
StandardError=journal
[Unit]
Description=Run Jaeger Sampling Optimizer
Requires=jaeger-sampling-optimizer.service
[Timer]
OnCalendar=hourly
Persistent=true
[Install]
WantedBy=timers.target
sudo systemctl daemon-reload
sudo systemctl enable jaeger-sampling-optimizer.timer
sudo systemctl start jaeger-sampling-optimizer.timer
sudo systemctl status jaeger-sampling-optimizer.timer
Verify your setup
Test that your sampling strategies are working correctly:
# Check sampling endpoint
curl http://localhost:14268/api/sampling | jq .
Validate configuration
/usr/local/bin/validate-sampling.sh
Check collector status
sudo systemctl status jaeger-collector
Monitor trace volume
/usr/local/bin/monitor-sampling.sh
Test hot reload
sudo /usr/local/bin/reload-sampling.sh
The sampling endpoint should return your configured strategies, and the validation script should show all checks passing.
Common issues
| Symptom | Cause | Fix |
|---|---|---|
| Clients not getting sampling strategies | Collector not serving on :14268 | Check collector configuration and firewall rules |
| Sampling strategies not updating | Invalid JSON in strategies file | Validate JSON with jq . /etc/jaeger/sampling_strategies.json |
| Too many traces stored | Sampling rates too high | Reduce probabilistic parameters in configuration |
| Missing critical traces | Sampling rates too low | Increase sampling for important services/operations |
| High collector CPU usage | Processing too many traces | Implement rate limiting and reduce sampling rates |
| Storage growing too fast | Insufficient sampling limits | Add max_traces_per_second limits to services |
Next steps
- Setup Jaeger alerting with Prometheus and Grafana for comprehensive monitoring
- Implement application monitoring with Prometheus for end-to-end observability
- Configure Jaeger data retention and archiving policies for long-term storage management
- Setup Jaeger high availability clustering for production resilience