Configure OpenTelemetry sampling strategies for high-traffic applications

Intermediate 25 min Apr 07, 2026 205 views
Ubuntu 24.04 Debian 12 AlmaLinux 9 Rocky Linux 9

Learn how to implement probabilistic, deterministic, and adaptive sampling strategies in OpenTelemetry to optimize distributed tracing performance and reduce storage costs in high-traffic production environments.

Prerequisites

  • Root or sudo access
  • At least 2GB RAM available
  • Basic understanding of distributed tracing concepts

What this solves

OpenTelemetry sampling strategies help you control the volume of trace data collected from your applications, reducing storage costs and performance overhead while maintaining observability insights. This tutorial shows you how to configure different sampling strategies including probabilistic, deterministic, and adaptive sampling for high-traffic applications that generate millions of traces daily.

Step-by-step configuration

Install OpenTelemetry Collector

Download and install the OpenTelemetry Collector which will handle trace sampling and forwarding.

curl -LO https://github.com/open-telemetry/opentelemetry-collector-releases/releases/download/v0.91.0/otelcol_0.91.0_linux_amd64.tar.gz
tar -xzf otelcol_0.91.0_linux_amd64.tar.gz
sudo mv otelcol /usr/local/bin/
sudo chmod +x /usr/local/bin/otelcol
curl -LO https://github.com/open-telemetry/opentelemetry-collector-releases/releases/download/v0.91.0/otelcol_0.91.0_linux_amd64.tar.gz
tar -xzf otelcol_0.91.0_linux_amd64.tar.gz
sudo mv otelcol /usr/local/bin/
sudo chmod +x /usr/local/bin/otelcol

Create OpenTelemetry user and directories

Create a dedicated user for running the collector and set up required directories with proper permissions.

sudo useradd --system --shell /bin/false otel
sudo mkdir -p /etc/otelcol /var/log/otelcol /var/lib/otelcol
sudo chown -R otel:otel /etc/otelcol /var/log/otelcol /var/lib/otelcol
sudo chmod 755 /etc/otelcol /var/log/otelcol /var/lib/otelcol

Configure probabilistic sampling

Set up probabilistic sampling which randomly samples a percentage of traces. This is ideal for consistent sampling across all services.

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
  jaeger:
    protocols:
      grpc:
        endpoint: 0.0.0.0:14250
      thrift_http:
        endpoint: 0.0.0.0:14268

processors:
  probabilistic_sampler:
    sampling_percentage: 10.0
  batch:
    timeout: 1s
    send_batch_size: 1024
    send_batch_max_size: 2048
  memory_limiter:
    limit_mib: 512
    spike_limit_mib: 128
    check_interval: 5s

exporters:
  jaeger:
    endpoint: localhost:14250
    tls:
      insecure: true
  logging:
    loglevel: debug

service:
  pipelines:
    traces:
      receivers: [otlp, jaeger]
      processors: [memory_limiter, probabilistic_sampler, batch]
      exporters: [jaeger, logging]
  telemetry:
    logs:
      level: "info"
      development: false
      sampling:
        initial: 5
        thereafter: 200
      output_paths:
        - "/var/log/otelcol/collector.log"
      error_output_paths:
        - "/var/log/otelcol/collector-error.log"

Configure tail-based sampling

Implement tail-based sampling for more intelligent trace selection based on errors, latency, and other criteria.

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  tail_sampling:
    decision_wait: 30s
    num_traces: 100000
    expected_new_traces_per_sec: 1000
    policies:
      - name: errors_policy
        type: status_code
        status_code:
          status_codes: [ERROR]
      - name: latency_policy
        type: latency
        latency:
          threshold_ms: 5000
      - name: probabilistic_policy
        type: probabilistic
        probabilistic:
          sampling_percentage: 5.0
      - name: rate_limiting_policy
        type: rate_limiting
        rate_limiting:
          spans_per_second: 100
  batch:
    timeout: 1s
    send_batch_size: 1024
    send_batch_max_size: 2048
  memory_limiter:
    limit_mib: 1024
    spike_limit_mib: 256
    check_interval: 5s

exporters:
  jaeger:
    endpoint: localhost:14250
    tls:
      insecure: true
  logging:
    loglevel: info

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, tail_sampling, batch]
      exporters: [jaeger, logging]
  telemetry:
    logs:
      level: "info"
      output_paths:
        - "/var/log/otelcol/collector.log"

Configure adaptive sampling with remote configuration

Set up adaptive sampling that adjusts sampling rates dynamically based on traffic patterns and service behavior.

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  groupbytrace:
    wait_duration: 10s
    num_traces: 100000
    num_workers: 4
  tail_sampling:
    decision_wait: 30s
    num_traces: 100000
    expected_new_traces_per_sec: 2000
    policies:
      - name: always_sample_errors
        type: status_code
        status_code:
          status_codes: [ERROR]
      - name: high_latency_traces
        type: latency
        latency:
          threshold_ms: 2000
      - name: service_name_policy
        type: string_attribute
        string_attribute:
          key: service.name
          values: ["critical-service", "payment-service"]
          enabled_regex_matching: false
          invert_match: false
      - name: adaptive_rate_policy
        type: rate_limiting
        rate_limiting:
          spans_per_second: 200
      - name: probabilistic_fallback
        type: probabilistic
        probabilistic:
          sampling_percentage: 1.0
  batch:
    timeout: 1s
    send_batch_size: 2048
    send_batch_max_size: 4096
  memory_limiter:
    limit_mib: 2048
    spike_limit_mib: 512
    check_interval: 5s

exporters:
  jaeger:
    endpoint: localhost:14250
    tls:
      insecure: true
  prometheus:
    endpoint: "0.0.0.0:8889"
    const_labels:
      collector: "otelcol-adaptive"

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, groupbytrace, tail_sampling, batch]
      exporters: [jaeger]
  telemetry:
    logs:
      level: "info"
    metrics:
      level: "detailed"
      address: "0.0.0.0:8888"

Create systemd service file

Configure the OpenTelemetry Collector as a systemd service with proper security settings.

[Unit]
Description=OpenTelemetry Collector
After=network.target
Wants=network-online.target

[Service]
Type=simple
User=otel
Group=otel
ExecStart=/usr/local/bin/otelcol --config=/etc/otelcol/otelcol-adaptive.yaml
Restart=always
RestartSec=10
Environment=OTEL_LOG_LEVEL=info
WorkingDirectory=/var/lib/otelcol
PrivateTmp=true
ProtectSystem=strict
ProtectHome=true
ReadWritePaths=/var/log/otelcol /var/lib/otelcol
NoNewPrivileges=true
LimitNOFILE=65536

[Install]
WantedBy=multi-user.target

Configure log rotation

Set up log rotation to prevent collector logs from consuming too much disk space.

/var/log/otelcol/*.log {
    daily
    missingok
    rotate 30
    compress
    delaycompress
    notifempty
    create 644 otel otel
    postrotate
        systemctl reload otelcol
    endscript
}

Start and enable the service

Enable and start the OpenTelemetry Collector service with your chosen sampling configuration.

sudo systemctl daemon-reload
sudo systemctl enable otelcol
sudo systemctl start otelcol
sudo systemctl status otelcol

Configure application instrumentation

Update your application to send traces to the OpenTelemetry Collector with appropriate service name and attributes.

export OTEL_EXPORTER_OTLP_ENDPOINT="http://localhost:4318"
export OTEL_EXPORTER_OTLP_PROTOCOL="http/protobuf"
export OTEL_SERVICE_NAME="your-service-name"
export OTEL_RESOURCE_ATTRIBUTES="service.version=1.0.0,deployment.environment=production"

Set up monitoring and alerting

Configure Prometheus to scrape collector metrics for monitoring sampling effectiveness.

  - job_name: 'otelcol'
    static_configs:
      - targets: ['localhost:8888']
    scrape_interval: 30s
    metrics_path: /metrics

Understanding OpenTelemetry sampling concepts

OpenTelemetry supports multiple sampling strategies, each with specific use cases. Probabilistic sampling uses a fixed percentage across all traces, while deterministic sampling makes decisions based on trace IDs. Tail-based sampling analyzes complete traces before deciding whether to keep them, allowing for more intelligent decisions based on errors, latency, or custom attributes.

The sampling decision propagates through your distributed system via trace context headers. When a service makes a sampling decision, downstream services inherit that decision, ensuring complete traces are either fully sampled or fully dropped. This prevents incomplete traces that would be difficult to analyze.

Sampling TypeUse CasePerformance ImpactDecision Point
ProbabilisticConsistent percentage across all servicesLow CPU, immediate decisionAt trace start
Tail-basedSample based on trace characteristicsHigher memory usageAfter trace completion
AdaptiveDynamic adjustment based on trafficModerate CPU overheadReal-time adjustment

Monitor and optimize sampling performance

Use the collector's built-in metrics to monitor sampling effectiveness. Key metrics include otelcol_processor_sampled_spans_total, otelcol_processor_dropped_spans_total, and otelcol_processor_tail_sampling_policy_decision_total. These help you understand how much data you're reducing and which policies are most active.

Monitor memory usage closely when using tail-based sampling, as it requires buffering traces until sampling decisions are made. Adjust num_traces and decision_wait parameters based on your traffic patterns and available resources. For applications with high trace volume, consider implementing multiple collector instances with load balancing.

Note: Tail-based sampling requires complete traces to make decisions, so ensure your decision_wait parameter is longer than your longest expected trace duration.

Verify your setup

sudo systemctl status otelcol
curl -s http://localhost:8888/metrics | grep sampling
journalctl -u otelcol -f --lines=20

Check that your application is sending traces and that sampling decisions are being logged:

tail -f /var/log/otelcol/collector.log | grep -E "(sampled|dropped)"
curl -s http://localhost:8888/metrics | grep -E "(sampled_spans|dropped_spans)"

Common issues

SymptomCauseFix
High memory usageToo many traces buffered for tail samplingReduce num_traces or decision_wait parameters
Incomplete traces in backenddecision_wait too short for long tracesIncrease decision_wait to match your longest trace duration
No sampling decisions loggedWrong processor order in pipelineEnsure sampling processors come before batch processor
Service won't startConfiguration syntax errorRun otelcol --config=/path/to/config --dry-run to validate
Traces not being forwardedExporter connectivity issuesCheck network connectivity and exporter endpoint configuration

Next steps

Automated install script

Run this to automate the entire setup

Need help?

Don't want to manage this yourself?

We handle managed devops services for businesses that depend on uptime. From initial setup to ongoing operations.