Thanos Multi-Cluster Federation Setup

Set up Thanos components across multiple Kubernetes clusters to enable global metrics federation, long-term storage, and unified querying of Prometheus data with high availability and unlimited retention.

Prerequisites

Multiple Kubernetes clusters
Object storage (MinIO/S3)
kubectl and Helm installed
Ingress controller configured

What this solves

Thanos multi-cluster federation addresses the limitations of single Prometheus deployments by providing global metrics aggregation across multiple clusters. This setup enables unlimited data retention with object storage, horizontal scaling of query workloads, and centralized monitoring dashboards that span your entire infrastructure. Use this when you need to monitor multiple Kubernetes clusters, require long-term metrics storage beyond local disk limits, or want to reduce storage costs while maintaining query performance.

Step-by-step configuration

Install required dependencies

Update your system and install kubectl, helm, and other tools needed for Thanos deployment.

sudo apt update && sudo apt upgrade -y
sudo apt install -y wget curl unzip

sudo dnf update -y
sudo dnf install -y wget curl unzip

Install kubectl and helm

Install the Kubernetes command-line tool and Helm package manager for deploying Thanos components.

curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl

curl https://get.helm.sh/helm-v3.14.0-linux-amd64.tar.gz -o helm.tar.gz
tar -zxvf helm.tar.gz
sudo mv linux-amd64/helm /usr/local/bin/helm

Create object storage configuration

Configure MinIO or S3-compatible storage for Thanos long-term storage. Create the storage bucket and access credentials.

kubectl create namespace thanos

kubectl create secret generic thanos-storage-config -n thanos --from-literal=config.yaml='
type: s3
config:
  bucket: "thanos-metrics"
  endpoint: "minio.example.com:9000"
  access_key: "thanos-access-key"
  secret_key: "thanos-secret-key"
  insecure: false
  signature_version2: false
  encrypt_sse: false
  put_user_metadata: {}
  http_config:
    idle_conn_timeout: 90s
    response_header_timeout: 2m
  trace:
    enable: false
  part_size: 134217728
'

Note: Replace the MinIO endpoint, bucket name, and credentials with your actual object storage configuration. For AWS S3, use s3.amazonaws.com as the endpoint.

Configure Prometheus with Thanos Sidecar

Deploy Prometheus with Thanos Sidecar in each cluster. The sidecar uploads metrics to object storage and enables remote queries.

prometheus:
  prometheusSpec:
    thanos:
      image: quay.io/thanos/thanos:v0.32.5
      version: v0.32.5
      objectStorageConfig:
        secretName: thanos-storage-config
        secretKey: config.yaml
      baseImage: quay.io/thanos/thanos
      resources:
        requests:
          memory: 512Mi
          cpu: 500m
        limits:
          memory: 1Gi
          cpu: 1000m
    retention: 2h
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: fast-ssd
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 50Gi
    serviceMonitorSelectorNilUsesHelmValues: false
    ruleNamespaceSelector: {}
    ruleSelectorNilUsesHelmValues: false

Deploy Prometheus with Thanos Sidecar

Install Prometheus using Helm with the Thanos sidecar configuration for each cluster.

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

helm install prometheus-cluster-1 prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --values prometheus-values.yaml \
  --set prometheus.prometheusSpec.externalLabels.cluster="cluster-1" \
  --set prometheus.prometheusSpec.externalLabels.region="us-east-1"

Configure Thanos Query component

Deploy Thanos Query to aggregate metrics from multiple clusters and provide a unified query interface.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: thanos-query
  namespace: thanos
  labels:
    app: thanos-query
spec:
  replicas: 2
  selector:
    matchLabels:
      app: thanos-query
  template:
    metadata:
      labels:
        app: thanos-query
    spec:
      containers:
      - name: thanos-query
        image: quay.io/thanos/thanos:v0.32.5
        args:
        - query
        - --log.level=info
        - --query.replica-label=replica
        - --query.replica-label=prometheus_replica
        - --store=thanos-store-gateway:10901
        - --store=prometheus-cluster-1-prometheus.monitoring:10901
        - --store=prometheus-cluster-2-prometheus.monitoring:10901
        - --query.auto-downsampling
        - --query.partial-response
        - --query.max-concurrent=20
        - --query.timeout=2m
        - --query.lookback-delta=15m
        ports:
        - name: http
          containerPort: 10902
        - name: grpc
          containerPort: 10901
        resources:
          requests:
            memory: 512Mi
            cpu: 500m
          limits:
            memory: 2Gi
            cpu: 1000m
        livenessProbe:
          httpGet:
            path: /-/healthy
            port: 10902
          initialDelaySeconds: 30
        readinessProbe:
          httpGet:
            path: /-/ready
            port: 10902
          initialDelaySeconds: 10

Deploy Thanos Query service

Create a Kubernetes service to expose Thanos Query for internal cluster access and external queries.

apiVersion: v1
kind: Service
metadata:
  name: thanos-query
  namespace: thanos
  labels:
    app: thanos-query
spec:
  type: ClusterIP
  ports:
  - name: http
    port: 9090
    targetPort: 10902
    protocol: TCP
  - name: grpc
    port: 10901
    targetPort: 10901
    protocol: TCP
  selector:
    app: thanos-query

Apply Thanos Query configuration

Deploy the Thanos Query components to your Kubernetes cluster.

kubectl apply -f thanos-query.yaml
kubectl apply -f thanos-query-service.yaml

Configure Thanos Store Gateway

Deploy Thanos Store Gateway to serve historical metrics data from object storage with caching for improved query performance.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: thanos-store-gateway
  namespace: thanos
  labels:
    app: thanos-store-gateway
spec:
  replicas: 2
  selector:
    matchLabels:
      app: thanos-store-gateway
  template:
    metadata:
      labels:
        app: thanos-store-gateway
    spec:
      containers:
      - name: thanos-store
        image: quay.io/thanos/thanos:v0.32.5
        args:
        - store
        - --log.level=info
        - --data-dir=/var/thanos/store
        - --objstore.config-file=/etc/thanos/config.yaml
        - --index-cache-size=2GB
        - --chunk-pool-size=2GB
        - --store.grpc.series-max-concurrency=20
        - --sync-block-duration=3m
        - --block-sync-concurrency=20
        ports:
        - name: http
          containerPort: 10902
        - name: grpc
          containerPort: 10901
        volumeMounts:
        - name: storage-config
          mountPath: /etc/thanos
          readOnly: true
        - name: data
          mountPath: /var/thanos/store
        resources:
          requests:
            memory: 2Gi
            cpu: 1000m
          limits:
            memory: 8Gi
            cpu: 2000m
        livenessProbe:
          httpGet:
            path: /-/healthy
            port: 10902
          initialDelaySeconds: 30
        readinessProbe:
          httpGet:
            path: /-/ready
            port: 10902
          initialDelaySeconds: 10
      volumes:
      - name: storage-config
        secret:
          secretName: thanos-storage-config
      - name: data
        persistentVolumeClaim:
          claimName: thanos-store-data

Create persistent volume for Store Gateway

Create a persistent volume claim for Thanos Store Gateway cache and metadata storage.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: thanos-store-data
  namespace: thanos
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: fast-ssd
  resources:
    requests:
      storage: 100Gi

Deploy Store Gateway service

Create a service for Thanos Store Gateway to enable communication with Thanos Query components.

apiVersion: v1
kind: Service
metadata:
  name: thanos-store-gateway
  namespace: thanos
  labels:
    app: thanos-store-gateway
spec:
  type: ClusterIP
  ports:
  - name: http
    port: 10902
    targetPort: 10902
    protocol: TCP
  - name: grpc
    port: 10901
    targetPort: 10901
    protocol: TCP
  selector:
    app: thanos-store-gateway

Apply Store Gateway configuration

Deploy all Store Gateway components to your cluster.

kubectl apply -f thanos-store-pvc.yaml
kubectl apply -f thanos-store.yaml
kubectl apply -f thanos-store-service.yaml

Configure Thanos Query Frontend

Deploy Query Frontend for query caching, splitting, and retry logic to improve query performance and reliability.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: thanos-query-frontend
  namespace: thanos
  labels:
    app: thanos-query-frontend
spec:
  replicas: 2
  selector:
    matchLabels:
      app: thanos-query-frontend
  template:
    metadata:
      labels:
        app: thanos-query-frontend
    spec:
      containers:
      - name: thanos-query-frontend
        image: quay.io/thanos/thanos:v0.32.5
        args:
        - query-frontend
        - --log.level=info
        - --query-frontend.downstream-url=http://thanos-query:9090
        - --query-range.split-interval=24h
        - --query-range.max-retries-per-request=3
        - --query-frontend.log-queries-longer-than=10s
        - --cache-compression-type=snappy
        - |
          --query-range.response-cache-config=
          type: REDIS
          config:
            addr: redis:6379
            db: 0
            dial_timeout: 5s
            read_timeout: 3s
            write_timeout: 3s
            max_get_multi_concurrency: 100
            get_multi_batch_size: 100
            max_set_multi_concurrency: 100
            set_multi_batch_size: 100
        ports:
        - name: http
          containerPort: 10902
        resources:
          requests:
            memory: 512Mi
            cpu: 500m
          limits:
            memory: 1Gi
            cpu: 1000m
        livenessProbe:
          httpGet:
            path: /-/healthy
            port: 10902
          initialDelaySeconds: 30
        readinessProbe:
          httpGet:
            path: /-/ready
            port: 10902
          initialDelaySeconds: 10

Deploy Redis for query caching

Deploy Redis to cache query results and improve Thanos Query Frontend performance.

helm repo add bitnami https://charts.bitnami.com/bitnami
helm install redis bitnami/redis \
  --namespace thanos \
  --set auth.enabled=false \
  --set replica.replicaCount=1 \
  --set master.persistence.size=10Gi

Create Query Frontend service and deploy

Apply the Query Frontend configuration and create a service for external access.

apiVersion: v1
kind: Service
metadata:
  name: thanos-query-frontend
  namespace: thanos
  labels:
    app: thanos-query-frontend
spec:
  type: LoadBalancer
  ports:
  - name: http
    port: 9090
    targetPort: 10902
    protocol: TCP
  selector:
    app: thanos-query-frontend

kubectl apply -f thanos-query-frontend.yaml
kubectl apply -f thanos-query-frontend-service.yaml

Configure Thanos Compactor

Deploy Thanos Compactor to downsample and compact metrics data in object storage, reducing storage costs and query times.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: thanos-compactor
  namespace: thanos
  labels:
    app: thanos-compactor
spec:
  replicas: 1
  selector:
    matchLabels:
      app: thanos-compactor
  template:
    metadata:
      labels:
        app: thanos-compactor
    spec:
      containers:
      - name: thanos-compactor
        image: quay.io/thanos/thanos:v0.32.5
        args:
        - compact
        - --log.level=info
        - --data-dir=/var/thanos/compactor
        - --objstore.config-file=/etc/thanos/config.yaml
        - --consistency-delay=30m
        - --retention.resolution-raw=7d
        - --retention.resolution-5m=30d
        - --retention.resolution-1h=180d
        - --compact.concurrency=1
        - --downsample.concurrency=1
        volumeMounts:
        - name: storage-config
          mountPath: /etc/thanos
          readOnly: true
        - name: data
          mountPath: /var/thanos/compactor
        resources:
          requests:
            memory: 1Gi
            cpu: 1000m
          limits:
            memory: 4Gi
            cpu: 2000m
      volumes:
      - name: storage-config
        secret:
          secretName: thanos-storage-config
      - name: data
        persistentVolumeClaim:
          claimName: thanos-compactor-data

Deploy Compactor with storage

Create persistent storage for the Compactor and apply the configuration.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: thanos-compactor-data
  namespace: thanos
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: fast-ssd
  resources:
    requests:
      storage: 200Gi

kubectl apply -f thanos-compactor-pvc.yaml
kubectl apply -f thanos-compactor.yaml

Configure external access with Ingress

Create an Ingress resource to expose Thanos Query Frontend with SSL termination for external access.

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: thanos-query-ingress
  namespace: thanos
  annotations:
    kubernetes.io/ingress.class: nginx
    cert-manager.io/cluster-issuer: letsencrypt-prod
    nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
spec:
  tls:
  - hosts:
    - thanos.example.com
    secretName: thanos-tls
  rules:
  - host: thanos.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: thanos-query-frontend
            port:
              number: 9090

kubectl apply -f thanos-ingress.yaml

Verify your setup

Check the status of all Thanos components and verify metrics federation is working correctly.

kubectl get pods -n thanos
kubectl get pods -n monitoring

kubectl logs -n thanos deployment/thanos-query --tail=50
kubectl logs -n thanos deployment/thanos-store-gateway --tail=50

curl -s http://thanos-query-frontend.thanos:9090/api/v1/stores | jq
curl -s "http://thanos-query-frontend.thanos:9090/api/v1/query?query=up" | jq

Note: The stores endpoint should show all connected Prometheus sidecars and Store Gateways. The query endpoint should return metrics from all federated clusters.

Common issues

Symptom	Cause	Fix
Store Gateway not showing data	Object storage credentials incorrect	Verify secret with `kubectl get secret thanos-storage-config -o yaml`
Query returns no data from some clusters	Sidecar not uploading to storage	Check sidecar logs: `kubectl logs prometheus-pod -c thanos-sidecar`
High memory usage on Store Gateway	Index cache too large	Reduce `--index-cache-size` parameter
Slow queries on historical data	No downsampling configured	Wait for Compactor to process data or check retention settings
Compactor fails with permission errors	Insufficient object storage permissions	Grant read/write/delete permissions to storage bucket
Query Frontend cache not working	Redis connection failed	Check Redis deployment: `kubectl get pods redis-master`

Next steps

Automated install script

Run this to automate the entire setup

install.sh

#!/usr/bin/env bash

set -euo pipefail

# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
BLUE='\033[0;34m'
NC='\033[0m'

# Default values
CLUSTER_NAME="${1:-cluster-1}"
REGION="${2:-us-east-1}"
STORAGE_ENDPOINT="${3:-minio.example.com:9000}"
STORAGE_BUCKET="${4:-thanos-metrics}"
ACCESS_KEY="${5:-}"
SECRET_KEY="${6:-}"

usage() {
    echo "Usage: $0 [cluster_name] [region] [storage_endpoint] [bucket] [access_key] [secret_key]"
    echo "Example: $0 cluster-1 us-east-1 s3.amazonaws.com thanos-metrics AKIAIOSFODNN7EXAMPLE wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
    exit 1
}

cleanup() {
    echo -e "${RED}[ERROR]${NC} Installation failed. Cleaning up..."
    rm -f /tmp/kubectl /tmp/helm.tar.gz
    rm -rf /tmp/linux-amd64
}

trap cleanup ERR

log_info() {
    echo -e "${BLUE}[INFO]${NC} $1"
}

log_success() {
    echo -e "${GREEN}[SUCCESS]${NC} $1"
}

log_warning() {
    echo -e "${YELLOW}[WARNING]${NC} $1"
}

log_error() {
    echo -e "${RED}[ERROR]${NC} $1"
}

# Check if running as root or with sudo
if [[ $EUID -ne 0 ]]; then
    log_error "This script must be run as root or with sudo"
    exit 1
fi

# Auto-detect distribution
if [ -f /etc/os-release ]; then
    . /etc/os-release
    case "$ID" in
        ubuntu|debian) 
            PKG_MGR="apt"
            PKG_UPDATE="apt update && apt upgrade -y"
            PKG_INSTALL="apt install -y"
            ;;
        almalinux|rocky|centos|rhel|ol|fedora) 
            PKG_MGR="dnf"
            PKG_UPDATE="dnf update -y"
            PKG_INSTALL="dnf install -y"
            ;;
        amzn) 
            PKG_MGR="yum"
            PKG_UPDATE="yum update -y"
            PKG_INSTALL="yum install -y"
            ;;
        *) 
            log_error "Unsupported distribution: $ID"
            exit 1
            ;;
    esac
else
    log_error "Cannot detect distribution"
    exit 1
fi

echo "[1/8] Updating system packages..."
$PKG_UPDATE

echo "[2/8] Installing required dependencies..."
$PKG_INSTALL wget curl unzip

echo "[3/8] Installing kubectl..."
KUBECTL_VERSION=$(curl -L -s https://dl.k8s.io/release/stable.txt)
curl -LO "https://dl.k8s.io/release/${KUBECTL_VERSION}/bin/linux/amd64/kubectl"
install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl
rm -f kubectl

echo "[4/8] Installing Helm..."
curl -fsSL https://get.helm.sh/helm-v3.14.0-linux-amd64.tar.gz -o /tmp/helm.tar.gz
cd /tmp
tar -zxf helm.tar.gz
install -o root -g root -m 0755 linux-amd64/helm /usr/local/bin/helm
rm -rf /tmp/helm.tar.gz /tmp/linux-amd64

echo "[5/8] Creating Thanos namespace..."
kubectl create namespace thanos --dry-run=client -o yaml | kubectl apply -f -
kubectl create namespace monitoring --dry-run=client -o yaml | kubectl apply -f -

echo "[6/8] Creating storage configuration..."
if [[ -z "$ACCESS_KEY" ]] || [[ -z "$SECRET_KEY" ]]; then
    log_warning "Access key and secret key not provided. Using example credentials."
    log_warning "Please update the thanos-storage-config secret with your actual credentials."
    ACCESS_KEY="thanos-access-key"
    SECRET_KEY="thanos-secret-key"
fi

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Secret
metadata:
  name: thanos-storage-config
  namespace: thanos
type: Opaque
stringData:
  config.yaml: |
    type: s3
    config:
      bucket: "${STORAGE_BUCKET}"
      endpoint: "${STORAGE_ENDPOINT}"
      access_key: "${ACCESS_KEY}"
      secret_key: "${SECRET_KEY}"
      insecure: false
      signature_version2: false
      encrypt_sse: false
      put_user_metadata: {}
      http_config:
        idle_conn_timeout: 90s
        response_header_timeout: 2m
      trace:
        enable: false
      part_size: 134217728
EOF

echo "[7/8] Creating Prometheus values configuration..."
cat <<EOF > /tmp/prometheus-values.yaml
prometheus:
  prometheusSpec:
    thanos:
      image: quay.io/thanos/thanos:v0.32.5
      version: v0.32.5
      objectStorageConfig:
        secretName: thanos-storage-config
        secretKey: config.yaml
      baseImage: quay.io/thanos/thanos
      resources:
        requests:
          memory: 512Mi
          cpu: 500m
        limits:
          memory: 1Gi
          cpu: 1000m
    retention: 2h
    storageSpec:
      volumeClaimTemplate:
        spec:
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 50Gi
    serviceMonitorSelectorNilUsesHelmValues: false
    ruleNamespaceSelector: {}
    ruleSelectorNilUsesHelmValues: false
    externalLabels:
      cluster: "${CLUSTER_NAME}"
      region: "${REGION}"
EOF

echo "[8/8] Installing Prometheus with Thanos sidecar..."
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

helm upgrade --install prometheus-${CLUSTER_NAME} prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --values /tmp/prometheus-values.yaml \
  --wait --timeout=10m

log_success "Thanos multi-cluster federation setup completed!"

echo ""
log_info "Verification steps:"
echo "1. Check Prometheus pods: kubectl get pods -n monitoring"
echo "2. Check Thanos namespace: kubectl get all -n thanos"
echo "3. Verify sidecar logs: kubectl logs -n monitoring -l app.kubernetes.io/name=prometheus -c thanos-sidecar"

echo ""
log_info "Next steps:"
echo "1. Deploy Thanos Query component for unified querying"
echo "2. Configure additional clusters with different cluster labels"
echo "3. Set up Grafana dashboards pointing to Thanos Query endpoint"
echo "4. Update storage credentials in thanos-storage-config secret if needed"

rm -f /tmp/prometheus-values.yaml

Review the script before running. Execute with: bash install.sh

#thanos #prometheus #multi-cluster #metrics #federation

Implement Thanos multi-cluster federation for global Prometheus metrics aggregation