Set up Thanos components across multiple Kubernetes clusters to enable global metrics federation, long-term storage, and unified querying of Prometheus data with high availability and unlimited retention.
Prerequisites
- Multiple Kubernetes clusters
- Object storage (MinIO/S3)
- kubectl and Helm installed
- Ingress controller configured
What this solves
Thanos multi-cluster federation addresses the limitations of single Prometheus deployments by providing global metrics aggregation across multiple clusters. This setup enables unlimited data retention with object storage, horizontal scaling of query workloads, and centralized monitoring dashboards that span your entire infrastructure. Use this when you need to monitor multiple Kubernetes clusters, require long-term metrics storage beyond local disk limits, or want to reduce storage costs while maintaining query performance.
Step-by-step configuration
Install required dependencies
Update your system and install kubectl, helm, and other tools needed for Thanos deployment.
sudo apt update && sudo apt upgrade -y
sudo apt install -y wget curl unzip
Install kubectl and helm
Install the Kubernetes command-line tool and Helm package manager for deploying Thanos components.
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl
curl https://get.helm.sh/helm-v3.14.0-linux-amd64.tar.gz -o helm.tar.gz
tar -zxvf helm.tar.gz
sudo mv linux-amd64/helm /usr/local/bin/helm
Create object storage configuration
Configure MinIO or S3-compatible storage for Thanos long-term storage. Create the storage bucket and access credentials.
kubectl create namespace thanos
kubectl create secret generic thanos-storage-config -n thanos --from-literal=config.yaml='
type: s3
config:
bucket: "thanos-metrics"
endpoint: "minio.example.com:9000"
access_key: "thanos-access-key"
secret_key: "thanos-secret-key"
insecure: false
signature_version2: false
encrypt_sse: false
put_user_metadata: {}
http_config:
idle_conn_timeout: 90s
response_header_timeout: 2m
trace:
enable: false
part_size: 134217728
'
s3.amazonaws.com as the endpoint.Configure Prometheus with Thanos Sidecar
Deploy Prometheus with Thanos Sidecar in each cluster. The sidecar uploads metrics to object storage and enables remote queries.
prometheus:
prometheusSpec:
thanos:
image: quay.io/thanos/thanos:v0.32.5
version: v0.32.5
objectStorageConfig:
secretName: thanos-storage-config
secretKey: config.yaml
baseImage: quay.io/thanos/thanos
resources:
requests:
memory: 512Mi
cpu: 500m
limits:
memory: 1Gi
cpu: 1000m
retention: 2h
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: fast-ssd
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 50Gi
serviceMonitorSelectorNilUsesHelmValues: false
ruleNamespaceSelector: {}
ruleSelectorNilUsesHelmValues: false
Deploy Prometheus with Thanos Sidecar
Install Prometheus using Helm with the Thanos sidecar configuration for each cluster.
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus-cluster-1 prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--values prometheus-values.yaml \
--set prometheus.prometheusSpec.externalLabels.cluster="cluster-1" \
--set prometheus.prometheusSpec.externalLabels.region="us-east-1"
Configure Thanos Query component
Deploy Thanos Query to aggregate metrics from multiple clusters and provide a unified query interface.
apiVersion: apps/v1
kind: Deployment
metadata:
name: thanos-query
namespace: thanos
labels:
app: thanos-query
spec:
replicas: 2
selector:
matchLabels:
app: thanos-query
template:
metadata:
labels:
app: thanos-query
spec:
containers:
- name: thanos-query
image: quay.io/thanos/thanos:v0.32.5
args:
- query
- --log.level=info
- --query.replica-label=replica
- --query.replica-label=prometheus_replica
- --store=thanos-store-gateway:10901
- --store=prometheus-cluster-1-prometheus.monitoring:10901
- --store=prometheus-cluster-2-prometheus.monitoring:10901
- --query.auto-downsampling
- --query.partial-response
- --query.max-concurrent=20
- --query.timeout=2m
- --query.lookback-delta=15m
ports:
- name: http
containerPort: 10902
- name: grpc
containerPort: 10901
resources:
requests:
memory: 512Mi
cpu: 500m
limits:
memory: 2Gi
cpu: 1000m
livenessProbe:
httpGet:
path: /-/healthy
port: 10902
initialDelaySeconds: 30
readinessProbe:
httpGet:
path: /-/ready
port: 10902
initialDelaySeconds: 10
Deploy Thanos Query service
Create a Kubernetes service to expose Thanos Query for internal cluster access and external queries.
apiVersion: v1
kind: Service
metadata:
name: thanos-query
namespace: thanos
labels:
app: thanos-query
spec:
type: ClusterIP
ports:
- name: http
port: 9090
targetPort: 10902
protocol: TCP
- name: grpc
port: 10901
targetPort: 10901
protocol: TCP
selector:
app: thanos-query
Apply Thanos Query configuration
Deploy the Thanos Query components to your Kubernetes cluster.
kubectl apply -f thanos-query.yaml
kubectl apply -f thanos-query-service.yaml
Configure Thanos Store Gateway
Deploy Thanos Store Gateway to serve historical metrics data from object storage with caching for improved query performance.
apiVersion: apps/v1
kind: Deployment
metadata:
name: thanos-store-gateway
namespace: thanos
labels:
app: thanos-store-gateway
spec:
replicas: 2
selector:
matchLabels:
app: thanos-store-gateway
template:
metadata:
labels:
app: thanos-store-gateway
spec:
containers:
- name: thanos-store
image: quay.io/thanos/thanos:v0.32.5
args:
- store
- --log.level=info
- --data-dir=/var/thanos/store
- --objstore.config-file=/etc/thanos/config.yaml
- --index-cache-size=2GB
- --chunk-pool-size=2GB
- --store.grpc.series-max-concurrency=20
- --sync-block-duration=3m
- --block-sync-concurrency=20
ports:
- name: http
containerPort: 10902
- name: grpc
containerPort: 10901
volumeMounts:
- name: storage-config
mountPath: /etc/thanos
readOnly: true
- name: data
mountPath: /var/thanos/store
resources:
requests:
memory: 2Gi
cpu: 1000m
limits:
memory: 8Gi
cpu: 2000m
livenessProbe:
httpGet:
path: /-/healthy
port: 10902
initialDelaySeconds: 30
readinessProbe:
httpGet:
path: /-/ready
port: 10902
initialDelaySeconds: 10
volumes:
- name: storage-config
secret:
secretName: thanos-storage-config
- name: data
persistentVolumeClaim:
claimName: thanos-store-data
Create persistent volume for Store Gateway
Create a persistent volume claim for Thanos Store Gateway cache and metadata storage.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: thanos-store-data
namespace: thanos
spec:
accessModes:
- ReadWriteOnce
storageClassName: fast-ssd
resources:
requests:
storage: 100Gi
Deploy Store Gateway service
Create a service for Thanos Store Gateway to enable communication with Thanos Query components.
apiVersion: v1
kind: Service
metadata:
name: thanos-store-gateway
namespace: thanos
labels:
app: thanos-store-gateway
spec:
type: ClusterIP
ports:
- name: http
port: 10902
targetPort: 10902
protocol: TCP
- name: grpc
port: 10901
targetPort: 10901
protocol: TCP
selector:
app: thanos-store-gateway
Apply Store Gateway configuration
Deploy all Store Gateway components to your cluster.
kubectl apply -f thanos-store-pvc.yaml
kubectl apply -f thanos-store.yaml
kubectl apply -f thanos-store-service.yaml
Configure Thanos Query Frontend
Deploy Query Frontend for query caching, splitting, and retry logic to improve query performance and reliability.
apiVersion: apps/v1
kind: Deployment
metadata:
name: thanos-query-frontend
namespace: thanos
labels:
app: thanos-query-frontend
spec:
replicas: 2
selector:
matchLabels:
app: thanos-query-frontend
template:
metadata:
labels:
app: thanos-query-frontend
spec:
containers:
- name: thanos-query-frontend
image: quay.io/thanos/thanos:v0.32.5
args:
- query-frontend
- --log.level=info
- --query-frontend.downstream-url=http://thanos-query:9090
- --query-range.split-interval=24h
- --query-range.max-retries-per-request=3
- --query-frontend.log-queries-longer-than=10s
- --cache-compression-type=snappy
- |
--query-range.response-cache-config=
type: REDIS
config:
addr: redis:6379
db: 0
dial_timeout: 5s
read_timeout: 3s
write_timeout: 3s
max_get_multi_concurrency: 100
get_multi_batch_size: 100
max_set_multi_concurrency: 100
set_multi_batch_size: 100
ports:
- name: http
containerPort: 10902
resources:
requests:
memory: 512Mi
cpu: 500m
limits:
memory: 1Gi
cpu: 1000m
livenessProbe:
httpGet:
path: /-/healthy
port: 10902
initialDelaySeconds: 30
readinessProbe:
httpGet:
path: /-/ready
port: 10902
initialDelaySeconds: 10
Deploy Redis for query caching
Deploy Redis to cache query results and improve Thanos Query Frontend performance.
helm repo add bitnami https://charts.bitnami.com/bitnami
helm install redis bitnami/redis \
--namespace thanos \
--set auth.enabled=false \
--set replica.replicaCount=1 \
--set master.persistence.size=10Gi
Create Query Frontend service and deploy
Apply the Query Frontend configuration and create a service for external access.
apiVersion: v1
kind: Service
metadata:
name: thanos-query-frontend
namespace: thanos
labels:
app: thanos-query-frontend
spec:
type: LoadBalancer
ports:
- name: http
port: 9090
targetPort: 10902
protocol: TCP
selector:
app: thanos-query-frontend
kubectl apply -f thanos-query-frontend.yaml
kubectl apply -f thanos-query-frontend-service.yaml
Configure Thanos Compactor
Deploy Thanos Compactor to downsample and compact metrics data in object storage, reducing storage costs and query times.
apiVersion: apps/v1
kind: Deployment
metadata:
name: thanos-compactor
namespace: thanos
labels:
app: thanos-compactor
spec:
replicas: 1
selector:
matchLabels:
app: thanos-compactor
template:
metadata:
labels:
app: thanos-compactor
spec:
containers:
- name: thanos-compactor
image: quay.io/thanos/thanos:v0.32.5
args:
- compact
- --log.level=info
- --data-dir=/var/thanos/compactor
- --objstore.config-file=/etc/thanos/config.yaml
- --consistency-delay=30m
- --retention.resolution-raw=7d
- --retention.resolution-5m=30d
- --retention.resolution-1h=180d
- --compact.concurrency=1
- --downsample.concurrency=1
volumeMounts:
- name: storage-config
mountPath: /etc/thanos
readOnly: true
- name: data
mountPath: /var/thanos/compactor
resources:
requests:
memory: 1Gi
cpu: 1000m
limits:
memory: 4Gi
cpu: 2000m
volumes:
- name: storage-config
secret:
secretName: thanos-storage-config
- name: data
persistentVolumeClaim:
claimName: thanos-compactor-data
Deploy Compactor with storage
Create persistent storage for the Compactor and apply the configuration.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: thanos-compactor-data
namespace: thanos
spec:
accessModes:
- ReadWriteOnce
storageClassName: fast-ssd
resources:
requests:
storage: 200Gi
kubectl apply -f thanos-compactor-pvc.yaml
kubectl apply -f thanos-compactor.yaml
Configure external access with Ingress
Create an Ingress resource to expose Thanos Query Frontend with SSL termination for external access.
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: thanos-query-ingress
namespace: thanos
annotations:
kubernetes.io/ingress.class: nginx
cert-manager.io/cluster-issuer: letsencrypt-prod
nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
spec:
tls:
- hosts:
- thanos.example.com
secretName: thanos-tls
rules:
- host: thanos.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: thanos-query-frontend
port:
number: 9090
kubectl apply -f thanos-ingress.yaml
Verify your setup
Check the status of all Thanos components and verify metrics federation is working correctly.
kubectl get pods -n thanos
kubectl get pods -n monitoring
kubectl logs -n thanos deployment/thanos-query --tail=50
kubectl logs -n thanos deployment/thanos-store-gateway --tail=50
curl -s http://thanos-query-frontend.thanos:9090/api/v1/stores | jq
curl -s "http://thanos-query-frontend.thanos:9090/api/v1/query?query=up" | jq
Common issues
| Symptom | Cause | Fix |
|---|---|---|
| Store Gateway not showing data | Object storage credentials incorrect | Verify secret with kubectl get secret thanos-storage-config -o yaml |
| Query returns no data from some clusters | Sidecar not uploading to storage | Check sidecar logs: kubectl logs prometheus-pod -c thanos-sidecar |
| High memory usage on Store Gateway | Index cache too large | Reduce --index-cache-size parameter |
| Slow queries on historical data | No downsampling configured | Wait for Compactor to process data or check retention settings |
| Compactor fails with permission errors | Insufficient object storage permissions | Grant read/write/delete permissions to storage bucket |
| Query Frontend cache not working | Redis connection failed | Check Redis deployment: kubectl get pods redis-master |
Next steps
- Configure Thanos Ruler for distributed alerting across multiple Prometheus clusters
- Monitor Kubernetes clusters with Prometheus and Grafana for container orchestration insights
- Implement Prometheus federation for multi-cluster monitoring with centralized metrics aggregation
- Configure Prometheus long-term storage with Thanos for unlimited data retention
- Set up Thanos Receiver for remote write scalability with Prometheus integration
Automated install script
Run this to automate the entire setup
#!/usr/bin/env bash
set -euo pipefail
# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
BLUE='\033[0;34m'
NC='\033[0m'
# Default values
CLUSTER_NAME="${1:-cluster-1}"
REGION="${2:-us-east-1}"
STORAGE_ENDPOINT="${3:-minio.example.com:9000}"
STORAGE_BUCKET="${4:-thanos-metrics}"
ACCESS_KEY="${5:-}"
SECRET_KEY="${6:-}"
usage() {
echo "Usage: $0 [cluster_name] [region] [storage_endpoint] [bucket] [access_key] [secret_key]"
echo "Example: $0 cluster-1 us-east-1 s3.amazonaws.com thanos-metrics AKIAIOSFODNN7EXAMPLE wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
exit 1
}
cleanup() {
echo -e "${RED}[ERROR]${NC} Installation failed. Cleaning up..."
rm -f /tmp/kubectl /tmp/helm.tar.gz
rm -rf /tmp/linux-amd64
}
trap cleanup ERR
log_info() {
echo -e "${BLUE}[INFO]${NC} $1"
}
log_success() {
echo -e "${GREEN}[SUCCESS]${NC} $1"
}
log_warning() {
echo -e "${YELLOW}[WARNING]${NC} $1"
}
log_error() {
echo -e "${RED}[ERROR]${NC} $1"
}
# Check if running as root or with sudo
if [[ $EUID -ne 0 ]]; then
log_error "This script must be run as root or with sudo"
exit 1
fi
# Auto-detect distribution
if [ -f /etc/os-release ]; then
. /etc/os-release
case "$ID" in
ubuntu|debian)
PKG_MGR="apt"
PKG_UPDATE="apt update && apt upgrade -y"
PKG_INSTALL="apt install -y"
;;
almalinux|rocky|centos|rhel|ol|fedora)
PKG_MGR="dnf"
PKG_UPDATE="dnf update -y"
PKG_INSTALL="dnf install -y"
;;
amzn)
PKG_MGR="yum"
PKG_UPDATE="yum update -y"
PKG_INSTALL="yum install -y"
;;
*)
log_error "Unsupported distribution: $ID"
exit 1
;;
esac
else
log_error "Cannot detect distribution"
exit 1
fi
echo "[1/8] Updating system packages..."
$PKG_UPDATE
echo "[2/8] Installing required dependencies..."
$PKG_INSTALL wget curl unzip
echo "[3/8] Installing kubectl..."
KUBECTL_VERSION=$(curl -L -s https://dl.k8s.io/release/stable.txt)
curl -LO "https://dl.k8s.io/release/${KUBECTL_VERSION}/bin/linux/amd64/kubectl"
install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl
rm -f kubectl
echo "[4/8] Installing Helm..."
curl -fsSL https://get.helm.sh/helm-v3.14.0-linux-amd64.tar.gz -o /tmp/helm.tar.gz
cd /tmp
tar -zxf helm.tar.gz
install -o root -g root -m 0755 linux-amd64/helm /usr/local/bin/helm
rm -rf /tmp/helm.tar.gz /tmp/linux-amd64
echo "[5/8] Creating Thanos namespace..."
kubectl create namespace thanos --dry-run=client -o yaml | kubectl apply -f -
kubectl create namespace monitoring --dry-run=client -o yaml | kubectl apply -f -
echo "[6/8] Creating storage configuration..."
if [[ -z "$ACCESS_KEY" ]] || [[ -z "$SECRET_KEY" ]]; then
log_warning "Access key and secret key not provided. Using example credentials."
log_warning "Please update the thanos-storage-config secret with your actual credentials."
ACCESS_KEY="thanos-access-key"
SECRET_KEY="thanos-secret-key"
fi
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Secret
metadata:
name: thanos-storage-config
namespace: thanos
type: Opaque
stringData:
config.yaml: |
type: s3
config:
bucket: "${STORAGE_BUCKET}"
endpoint: "${STORAGE_ENDPOINT}"
access_key: "${ACCESS_KEY}"
secret_key: "${SECRET_KEY}"
insecure: false
signature_version2: false
encrypt_sse: false
put_user_metadata: {}
http_config:
idle_conn_timeout: 90s
response_header_timeout: 2m
trace:
enable: false
part_size: 134217728
EOF
echo "[7/8] Creating Prometheus values configuration..."
cat <<EOF > /tmp/prometheus-values.yaml
prometheus:
prometheusSpec:
thanos:
image: quay.io/thanos/thanos:v0.32.5
version: v0.32.5
objectStorageConfig:
secretName: thanos-storage-config
secretKey: config.yaml
baseImage: quay.io/thanos/thanos
resources:
requests:
memory: 512Mi
cpu: 500m
limits:
memory: 1Gi
cpu: 1000m
retention: 2h
storageSpec:
volumeClaimTemplate:
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 50Gi
serviceMonitorSelectorNilUsesHelmValues: false
ruleNamespaceSelector: {}
ruleSelectorNilUsesHelmValues: false
externalLabels:
cluster: "${CLUSTER_NAME}"
region: "${REGION}"
EOF
echo "[8/8] Installing Prometheus with Thanos sidecar..."
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm upgrade --install prometheus-${CLUSTER_NAME} prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--values /tmp/prometheus-values.yaml \
--wait --timeout=10m
log_success "Thanos multi-cluster federation setup completed!"
echo ""
log_info "Verification steps:"
echo "1. Check Prometheus pods: kubectl get pods -n monitoring"
echo "2. Check Thanos namespace: kubectl get all -n thanos"
echo "3. Verify sidecar logs: kubectl logs -n monitoring -l app.kubernetes.io/name=prometheus -c thanos-sidecar"
echo ""
log_info "Next steps:"
echo "1. Deploy Thanos Query component for unified querying"
echo "2. Configure additional clusters with different cluster labels"
echo "3. Set up Grafana dashboards pointing to Thanos Query endpoint"
echo "4. Update storage credentials in thanos-storage-config secret if needed"
rm -f /tmp/prometheus-values.yaml
Review the script before running. Execute with: bash install.sh