Set up OpenTelemetry SDK to collect custom application metrics, export them to Prometheus for storage, and visualize performance data in Grafana dashboards with automated alerting.
Prerequisites
- Root or sudo access
- At least 2GB RAM
- Python 3.8+ (for examples)
- Basic understanding of Prometheus and Grafana
- Network connectivity for package downloads
What this solves
OpenTelemetry custom metrics give you detailed insights into your application's performance beyond basic system metrics. You can track business-specific metrics like user sign-ups, order completion rates, or API response times. This tutorial shows you how to instrument applications with OpenTelemetry, send metrics to Prometheus, and build Grafana dashboards for monitoring and alerting.
Step-by-step installation
Install OpenTelemetry Collector
The OpenTelemetry Collector receives metrics from your applications and forwards them to Prometheus. Download and install the latest collector binary.
wget https://github.com/open-telemetry/opentelemetry-collector-releases/releases/download/v0.91.0/otelcol_0.91.0_linux_amd64.tar.gz
tar -xzf otelcol_0.91.0_linux_amd64.tar.gz
sudo mv otelcol /usr/local/bin/
sudo chmod +x /usr/local/bin/otelcol
Create collector configuration
Configure the collector to receive OTLP metrics and export them to Prometheus format. This config enables metric collection on port 4318 and serves Prometheus metrics on port 8889.
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 1s
send_batch_size: 1024
memory_limiter:
limit_mib: 512
exporters:
prometheus:
endpoint: "0.0.0.0:8889"
namespace: "app"
const_labels:
environment: "production"
service:
pipelines:
metrics:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [prometheus]
telemetry:
logs:
level: info
Create systemd service for collector
Set up the collector as a systemd service for automatic startup and management.
[Unit]
Description=OpenTelemetry Collector
After=network.target
[Service]
Type=simple
User=nobody
Group=nogroup
ExecStart=/usr/local/bin/otelcol --config=/etc/otelcol-config.yaml
Restart=on-failure
RestartSec=5
StandardOutput=journal
StandardError=journal
[Install]
WantedBy=multi-user.target
Start OpenTelemetry Collector
Enable and start the collector service to begin accepting metrics from your applications.
sudo systemctl daemon-reload
sudo systemctl enable --now otelcol
sudo systemctl status otelcol
Install Prometheus
Install Prometheus to scrape metrics from the OpenTelemetry Collector and store them for querying.
sudo apt update
sudo apt install -y prometheus
Configure Prometheus to scrape OpenTelemetry metrics
Add the OpenTelemetry Collector as a scrape target in Prometheus configuration. This tells Prometheus to collect metrics from the collector's Prometheus endpoint.
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "/etc/prometheus/rules/*.yml"
alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9093
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'otel-collector'
static_configs:
- targets: ['localhost:8889']
scrape_interval: 10s
metrics_path: /metrics
- job_name: 'node-exporter'
static_configs:
- targets: ['localhost:9100']
Create Prometheus alerting rules
Set up alerting rules for custom metrics to notify you when application performance degrades.
sudo mkdir -p /etc/prometheus/rules
groups:
- name: application_metrics
rules:
- alert: HighErrorRate
expr: rate(app_http_requests_total{status=~"5.."}[5m]) > 0.1
for: 2m
labels:
severity: warning
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value }} errors per second"
- alert: SlowResponseTime
expr: histogram_quantile(0.95, rate(app_http_request_duration_seconds_bucket[5m])) > 1.0
for: 5m
labels:
severity: critical
annotations:
summary: "Slow response times detected"
description: "95th percentile response time is {{ $value }} seconds"
- alert: LowThroughput
expr: rate(app_http_requests_total[5m]) < 1.0
for: 10m
labels:
severity: warning
annotations:
summary: "Low request throughput"
description: "Request rate is {{ $value }} requests per second"
Start Prometheus
Enable and start Prometheus to begin collecting metrics from the OpenTelemetry Collector.
sudo systemctl enable --now prometheus
sudo systemctl status prometheus
Install Grafana
Install Grafana to create dashboards and visualizations for your OpenTelemetry metrics.
sudo apt install -y software-properties-common
sudo add-apt-repository "deb https://packages.grafana.com/oss/deb stable main"
wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -
sudo apt update
sudo apt install -y grafana
Configure Grafana data source
Add Prometheus as a data source in Grafana to query your OpenTelemetry metrics.
sudo systemctl enable --now grafana-server
sudo systemctl status grafana-server
Access Grafana at http://your-server:3000 with username admin and password admin. Navigate to Configuration > Data Sources and add Prometheus with URL http://localhost:9090.
Install OpenTelemetry SDK in your application
Add OpenTelemetry instrumentation to your application. This example shows Python implementation with custom metrics.
pip install opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp
from opentelemetry import metrics
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.exporter.otlp.proto.http.metric_exporter import OTLPMetricExporter
import time
import random
Configure OpenTelemetry
metric_exporter = OTLPMetricExporter(
endpoint="http://localhost:4318/v1/metrics",
headers={}
)
metric_reader = PeriodicExportingMetricReader(
exporter=metric_exporter,
export_interval_millis=5000
)
metrics.set_meter_provider(MeterProvider(metric_readers=[metric_reader]))
meter = metrics.get_meter("app_metrics", "1.0.0")
Create custom metrics
request_counter = meter.create_counter(
name="http_requests_total",
description="Total number of HTTP requests",
unit="1"
)
response_time_histogram = meter.create_histogram(
name="http_request_duration_seconds",
description="HTTP request duration in seconds",
unit="s"
)
active_connections_gauge = meter.create_up_down_counter(
name="active_connections",
description="Number of active connections",
unit="1"
)
Example usage
def handle_request(endpoint, status_code):
start_time = time.time()
# Simulate request processing
processing_time = random.uniform(0.1, 2.0)
time.sleep(processing_time)
# Record metrics
request_counter.add(1, {"endpoint": endpoint, "status": str(status_code)})
response_time_histogram.record(processing_time, {"endpoint": endpoint})
return f"Processed {endpoint} in {processing_time:.2f}s"
Simulate application traffic
if __name__ == "__main__":
endpoints = ["/api/users", "/api/orders", "/api/products"]
for i in range(100):
endpoint = random.choice(endpoints)
status = random.choices([200, 404, 500], weights=[85, 10, 5])[0]
active_connections_gauge.add(1)
result = handle_request(endpoint, status)
active_connections_gauge.add(-1)
print(f"Request {i+1}: {result}")
time.sleep(0.1)
Create Grafana dashboard
Import a custom dashboard configuration to visualize your OpenTelemetry metrics with panels for request rates, response times, and error rates.
{
"dashboard": {
"id": null,
"title": "OpenTelemetry Application Metrics",
"tags": ["opentelemetry", "monitoring"],
"timezone": "browser",
"panels": [
{
"id": 1,
"title": "Request Rate",
"type": "stat",
"targets": [
{
"expr": "rate(app_http_requests_total[5m])",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "reqps",
"min": 0
}
},
"gridPos": {"h": 8, "w": 6, "x": 0, "y": 0}
},
{
"id": 2,
"title": "Error Rate",
"type": "stat",
"targets": [
{
"expr": "rate(app_http_requests_total{status=~\"5..\"}[5m]) / rate(app_http_requests_total[5m])",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "percentunit",
"min": 0,
"max": 1
}
},
"gridPos": {"h": 8, "w": 6, "x": 6, "y": 0}
},
{
"id": 3,
"title": "Response Time (95th percentile)",
"type": "stat",
"targets": [
{
"expr": "histogram_quantile(0.95, rate(app_http_request_duration_seconds_bucket[5m]))",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "s",
"min": 0
}
},
"gridPos": {"h": 8, "w": 6, "x": 12, "y": 0}
},
{
"id": 4,
"title": "Active Connections",
"type": "stat",
"targets": [
{
"expr": "app_active_connections",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "short",
"min": 0
}
},
"gridPos": {"h": 8, "w": 6, "x": 18, "y": 0}
}
],
"time": {
"from": "now-1h",
"to": "now"
},
"refresh": "10s"
}
}
Configure firewall rules
Open the necessary ports for OpenTelemetry Collector, Prometheus, and Grafana to communicate properly.
sudo ufw allow 3000/tcp
sudo ufw allow 9090/tcp
sudo ufw allow 4317/tcp
sudo ufw allow 4318/tcp
sudo ufw allow 8889/tcp
Configure custom metrics collection
Add business metrics to your application
Extend your application with business-specific metrics like user signups, revenue, or feature usage. These metrics provide insights into application performance from a business perspective.
# Additional business metrics
user_signups = meter.create_counter(
name="user_signups_total",
description="Total number of user signups",
unit="1"
)
order_value_histogram = meter.create_histogram(
name="order_value_dollars",
description="Order value in dollars",
unit="USD"
)
feature_usage = meter.create_counter(
name="feature_usage_total",
description="Feature usage by type",
unit="1"
)
Example business event tracking
def track_user_signup(user_type, source):
user_signups.add(1, {
"user_type": user_type,
"source": source
})
def track_order(order_value, product_category):
order_value_histogram.record(order_value, {
"category": product_category
})
def track_feature_use(feature_name, user_tier):
feature_usage.add(1, {
"feature": feature_name,
"tier": user_tier
})
Set up metric sampling and filtering
Configure the collector to sample high-volume metrics and filter irrelevant data to reduce storage costs and improve query performance.
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 1s
send_batch_size: 1024
memory_limiter:
limit_mib: 512
filter/drop_debug:
metrics:
exclude:
match_type: regexp
metric_names:
- "._debug."
- "._test."
probabilistic_sampler:
sampling_percentage: 10
exporters:
prometheus:
endpoint: "0.0.0.0:8889"
namespace: "app"
const_labels:
environment: "production"
version: "1.0.0"
metric_expiration: 180s
enable_open_metrics: true
service:
pipelines:
metrics:
receivers: [otlp]
processors: [memory_limiter, filter/drop_debug, batch]
exporters: [prometheus]
telemetry:
logs:
level: info
metrics:
address: 0.0.0.0:8888
Set up Grafana dashboards and alerting
Create alerting rules in Grafana
Set up Grafana alerts that trigger notifications when metrics exceed thresholds. This provides proactive monitoring for your application performance.
curl -X POST http://admin:admin@localhost:3000/api/alert-rules \
-H "Content-Type: application/json" \
-d '{
"title": "High Error Rate Alert",
"condition": "B",
"data": [
{
"refId": "A",
"queryType": "",
"relativeTimeRange": {
"from": 300,
"to": 0
},
"model": {
"expr": "rate(app_http_requests_total{status=~\"5..\"}[5m])",
"refId": "A"
}
},
{
"refId": "B",
"queryType": "",
"model": {
"conditions": [
{
"evaluator": {
"params": [0.1],
"type": "gt"
},
"operator": {
"type": "and"
},
"query": {
"params": ["A"]
},
"reducer": {
"params": [],
"type": "avg"
},
"type": "query"
}
],
"refId": "B"
}
}
],
"intervalSeconds": 60,
"noDataState": "NoData",
"execErrState": "Alerting",
"for": "2m"
}'
Configure notification channels
Set up Slack or email notifications for alerts. This ensures your team gets notified when application issues occur.
curl -X POST http://admin:admin@localhost:3000/api/alert-notifications \
-H "Content-Type: application/json" \
-d '{
"name": "slack-alerts",
"type": "slack",
"settings": {
"url": "https://hooks.slack.com/services/YOUR/WEBHOOK/URL",
"username": "Grafana",
"channel": "#alerts",
"title": "Application Alert",
"text": "{{ range .Alerts }}{{ .Annotations.summary }}\n{{ .Annotations.description }}{{ end }}"
}
}'
Import pre-built dashboard
Load a comprehensive dashboard template that includes panels for all common OpenTelemetry metrics and alerts.
curl -X POST http://admin:admin@localhost:3000/api/dashboards/db \
-H "Content-Type: application/json" \
-d '@otel_dashboard.json'
Verify your setup
# Check OpenTelemetry Collector status
sudo systemctl status otelcol
Verify collector is receiving metrics
curl http://localhost:8889/metrics | grep app_
Check Prometheus targets
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.job=="otel-collector")'
Test metric ingestion
python3 app_metrics.py
Query metrics in Prometheus
curl -G http://localhost:9090/api/v1/query --data-urlencode 'query=app_http_requests_total'
Check Grafana data source
curl http://admin:admin@localhost:3000/api/datasources
Common issues
| Symptom | Cause | Fix |
|---|---|---|
| Metrics not appearing in Prometheus | Collector endpoint misconfigured | Check /etc/otelcol-config.yaml endpoint settings and firewall rules |
| High memory usage in collector | No memory limits configured | Add memory_limiter processor with appropriate limit_mib |
| Missing metrics in Grafana | Wrong Prometheus URL | Verify data source URL is http://localhost:9090 |
| Alerts not triggering | Alert rule query syntax error | Test queries in Prometheus before adding to Grafana alerts |
| Application metrics not exported | OTLP exporter endpoint wrong | Ensure application sends to http://localhost:4318/v1/metrics |
Next steps
- Configure Jaeger distributed tracing on Kubernetes cluster with Helm charts and Elasticsearch backend
- Implement Node.js application monitoring with Prometheus metrics collection and Grafana dashboards
- Set up OpenTelemetry distributed tracing with Kubernetes and Jaeger for microservices
- Configure Prometheus Alertmanager with custom webhook integrations for Slack, Microsoft Teams, and PagerDuty notifications
- Implement OpenTelemetry auto-instrumentation for Java, Python, and Node.js applications
Running this in production?
Automated install script
Run this to automate the entire setup
#!/usr/bin/env bash
set -euo pipefail
# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m'
# Global variables
OTEL_VERSION="0.91.0"
TOTAL_STEPS=8
print_status() {
echo -e "${GREEN}[INFO]${NC} $1"
}
print_warning() {
echo -e "${YELLOW}[WARN]${NC} $1"
}
print_error() {
echo -e "${RED}[ERROR]${NC} $1"
}
usage() {
echo "Usage: $0 [OPTIONS]"
echo "Install OpenTelemetry Collector with Prometheus and configure monitoring stack"
echo ""
echo "Options:"
echo " -h, --help Show this help message"
echo " --cleanup Remove installed components"
exit 1
}
cleanup() {
print_warning "Cleaning up due to error..."
systemctl stop otelcol 2>/dev/null || true
systemctl disable otelcol 2>/dev/null || true
rm -f /etc/systemd/system/otelcol.service
rm -f /usr/local/bin/otelcol
rm -f /etc/otelcol-config.yaml
systemctl daemon-reload
}
remove_components() {
print_status "Removing OpenTelemetry components..."
systemctl stop otelcol 2>/dev/null || true
systemctl disable otelcol 2>/dev/null || true
rm -f /etc/systemd/system/otelcol.service
rm -f /usr/local/bin/otelcol
rm -f /etc/otelcol-config.yaml
rm -rf /etc/prometheus/rules
systemctl daemon-reload
print_status "Cleanup completed"
exit 0
}
trap cleanup ERR
# Parse arguments
while [[ $# -gt 0 ]]; do
case $1 in
-h|--help) usage ;;
--cleanup) remove_components ;;
*) print_error "Unknown option: $1"; usage ;;
esac
done
# Check if running as root or with sudo
if [[ $EUID -ne 0 ]]; then
print_error "This script must be run as root or with sudo"
exit 1
fi
# Detect distribution
echo "[1/$TOTAL_STEPS] Detecting distribution..."
if [ -f /etc/os-release ]; then
. /etc/os-release
case "$ID" in
ubuntu|debian)
PKG_MGR="apt"
PKG_INSTALL="apt install -y"
PKG_UPDATE="apt update"
PROMETHEUS_PKG="prometheus"
PROM_CONFIG="/etc/prometheus/prometheus.yml"
;;
almalinux|rocky|centos|rhel|ol|fedora)
PKG_MGR="dnf"
PKG_INSTALL="dnf install -y"
PKG_UPDATE="dnf update -y"
PROMETHEUS_PKG="golang-github-prometheus"
PROM_CONFIG="/etc/prometheus/prometheus.yml"
;;
amzn)
PKG_MGR="yum"
PKG_INSTALL="yum install -y"
PKG_UPDATE="yum update -y"
PROMETHEUS_PKG="prometheus"
PROM_CONFIG="/etc/prometheus/prometheus.yml"
;;
*)
print_error "Unsupported distribution: $ID"
exit 1
;;
esac
print_status "Detected $PRETTY_NAME"
else
print_error "Cannot detect distribution"
exit 1
fi
# Install prerequisites
echo "[2/$TOTAL_STEPS] Installing prerequisites..."
$PKG_UPDATE > /dev/null 2>&1
$PKG_INSTALL wget curl tar > /dev/null 2>&1
# Download and install OpenTelemetry Collector
echo "[3/$TOTAL_STEPS] Installing OpenTelemetry Collector..."
cd /tmp
wget -q "https://github.com/open-telemetry/opentelemetry-collector-releases/releases/download/v${OTEL_VERSION}/otelcol_${OTEL_VERSION}_linux_amd64.tar.gz"
tar -xzf "otelcol_${OTEL_VERSION}_linux_amd64.tar.gz"
mv otelcol /usr/local/bin/
chmod 755 /usr/local/bin/otelcol
chown root:root /usr/local/bin/otelcol
rm -f "otelcol_${OTEL_VERSION}_linux_amd64.tar.gz"
# Create OpenTelemetry Collector configuration
echo "[4/$TOTAL_STEPS] Creating collector configuration..."
cat > /etc/otelcol-config.yaml << 'EOF'
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 1s
send_batch_size: 1024
memory_limiter:
limit_mib: 512
exporters:
prometheus:
endpoint: "0.0.0.0:8889"
namespace: "app"
const_labels:
environment: "production"
service:
pipelines:
metrics:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [prometheus]
telemetry:
logs:
level: info
EOF
chmod 644 /etc/otelcol-config.yaml
chown root:root /etc/otelcol-config.yaml
# Create systemd service
echo "[5/$TOTAL_STEPS] Creating systemd service..."
cat > /etc/systemd/system/otelcol.service << 'EOF'
[Unit]
Description=OpenTelemetry Collector
After=network.target
[Service]
Type=simple
User=nobody
Group=nogroup
ExecStart=/usr/local/bin/otelcol --config=/etc/otelcol-config.yaml
Restart=on-failure
RestartSec=5
StandardOutput=journal
StandardError=journal
[Install]
WantedBy=multi-user.target
EOF
chmod 644 /etc/systemd/system/otelcol.service
systemctl daemon-reload
systemctl enable otelcol
systemctl start otelcol
# Install Prometheus
echo "[6/$TOTAL_STEPS] Installing Prometheus..."
if [[ "$PKG_MGR" == "dnf" || "$PKG_MGR" == "yum" ]]; then
if [[ "$PKG_MGR" == "dnf" ]]; then
dnf install -y epel-release > /dev/null 2>&1
else
yum install -y epel-release > /dev/null 2>&1
fi
fi
$PKG_INSTALL $PROMETHEUS_PKG > /dev/null 2>&1
# Configure Prometheus
echo "[7/$TOTAL_STEPS] Configuring Prometheus..."
mkdir -p /etc/prometheus/rules
chown prometheus:prometheus /etc/prometheus/rules
chmod 755 /etc/prometheus/rules
cat > $PROM_CONFIG << 'EOF'
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "/etc/prometheus/rules/*.yml"
alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9093
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'otel-collector'
static_configs:
- targets: ['localhost:8889']
scrape_interval: 10s
metrics_path: /metrics
- job_name: 'node-exporter'
static_configs:
- targets: ['localhost:9100']
EOF
chown prometheus:prometheus $PROM_CONFIG
chmod 644 $PROM_CONFIG
# Create alerting rules
cat > /etc/prometheus/rules/application.yml << 'EOF'
groups:
- name: application_metrics
rules:
- alert: HighErrorRate
expr: rate(app_http_requests_total{status=~"5.."}[5m]) > 0.1
for: 2m
labels:
severity: warning
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value }} errors per second"
- alert: SlowResponseTime
expr: histogram_quantile(0.95, rate(app_http_request_duration_seconds_bucket[5m])) > 1.0
for: 5m
labels:
severity: critical
annotations:
summary: "Slow response time detected"
description: "95th percentile response time is {{ $value }} seconds"
EOF
chown prometheus:prometheus /etc/prometheus/rules/application.yml
chmod 644 /etc/prometheus/rules/application.yml
systemctl enable prometheus
systemctl restart prometheus
# Configure firewall
if command -v firewall-cmd > /dev/null 2>&1; then
firewall-cmd --permanent --add-port=4317/tcp --add-port=4318/tcp --add-port=8889/tcp --add-port=9090/tcp > /dev/null 2>&1
firewall-cmd --reload > /dev/null 2>&1
elif command -v ufw > /dev/null 2>&1; then
ufw allow 4317/tcp > /dev/null 2>&1
ufw allow 4318/tcp > /dev/null 2>&1
ufw allow 8889/tcp > /dev/null 2>&1
ufw allow 9090/tcp > /dev/null 2>&1
fi
# Verification
echo "[8/$TOTAL_STEPS] Verifying installation..."
sleep 3
if ! systemctl is-active --quiet otelcol; then
print_error "OpenTelemetry Collector service is not running"
exit 1
fi
if ! systemctl is-active --quiet prometheus; then
print_error "Prometheus service is not running"
exit 1
fi
if ! curl -s http://localhost:8889/metrics > /dev/null; then
print_error "OpenTelemetry Collector metrics endpoint not accessible"
exit 1
fi
if ! curl -s http://localhost:9090/-/ready > /dev/null; then
print_error "Prometheus not ready"
exit 1
fi
print_status "Installation completed successfully!"
print_status "OpenTelemetry Collector: http://localhost:8889/metrics"
print_status "Prometheus: http://localhost:9090"
print_status "OTLP endpoint: http://localhost:4318 (HTTP), localhost:4317 (gRPC)"
Review the script before running. Execute with: bash install.sh