Implement Thanos multi-cluster federation for global Prometheus metrics aggregation

Advanced 45 min Apr 13, 2026 232 views
Ubuntu 24.04 Debian 12 AlmaLinux 9 Rocky Linux 9

Set up Thanos components across multiple Kubernetes clusters to enable global metrics federation, long-term storage, and unified querying of Prometheus data with high availability and unlimited retention.

Prerequisites

  • Multiple Kubernetes clusters
  • Object storage (MinIO/S3)
  • kubectl and Helm installed
  • Ingress controller configured

What this solves

Thanos multi-cluster federation addresses the limitations of single Prometheus deployments by providing global metrics aggregation across multiple clusters. This setup enables unlimited data retention with object storage, horizontal scaling of query workloads, and centralized monitoring dashboards that span your entire infrastructure. Use this when you need to monitor multiple Kubernetes clusters, require long-term metrics storage beyond local disk limits, or want to reduce storage costs while maintaining query performance.

Step-by-step configuration

Install required dependencies

Update your system and install kubectl, helm, and other tools needed for Thanos deployment.

sudo apt update && sudo apt upgrade -y
sudo apt install -y wget curl unzip
sudo dnf update -y
sudo dnf install -y wget curl unzip

Install kubectl and helm

Install the Kubernetes command-line tool and Helm package manager for deploying Thanos components.

curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl

curl https://get.helm.sh/helm-v3.14.0-linux-amd64.tar.gz -o helm.tar.gz
tar -zxvf helm.tar.gz
sudo mv linux-amd64/helm /usr/local/bin/helm

Create object storage configuration

Configure MinIO or S3-compatible storage for Thanos long-term storage. Create the storage bucket and access credentials.

kubectl create namespace thanos

kubectl create secret generic thanos-storage-config -n thanos --from-literal=config.yaml='
type: s3
config:
  bucket: "thanos-metrics"
  endpoint: "minio.example.com:9000"
  access_key: "thanos-access-key"
  secret_key: "thanos-secret-key"
  insecure: false
  signature_version2: false
  encrypt_sse: false
  put_user_metadata: {}
  http_config:
    idle_conn_timeout: 90s
    response_header_timeout: 2m
  trace:
    enable: false
  part_size: 134217728
'
Note: Replace the MinIO endpoint, bucket name, and credentials with your actual object storage configuration. For AWS S3, use s3.amazonaws.com as the endpoint.

Configure Prometheus with Thanos Sidecar

Deploy Prometheus with Thanos Sidecar in each cluster. The sidecar uploads metrics to object storage and enables remote queries.

prometheus:
  prometheusSpec:
    thanos:
      image: quay.io/thanos/thanos:v0.32.5
      version: v0.32.5
      objectStorageConfig:
        secretName: thanos-storage-config
        secretKey: config.yaml
      baseImage: quay.io/thanos/thanos
      resources:
        requests:
          memory: 512Mi
          cpu: 500m
        limits:
          memory: 1Gi
          cpu: 1000m
    retention: 2h
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: fast-ssd
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 50Gi
    serviceMonitorSelectorNilUsesHelmValues: false
    ruleNamespaceSelector: {}
    ruleSelectorNilUsesHelmValues: false

Deploy Prometheus with Thanos Sidecar

Install Prometheus using Helm with the Thanos sidecar configuration for each cluster.

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

helm install prometheus-cluster-1 prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --values prometheus-values.yaml \
  --set prometheus.prometheusSpec.externalLabels.cluster="cluster-1" \
  --set prometheus.prometheusSpec.externalLabels.region="us-east-1"

Configure Thanos Query component

Deploy Thanos Query to aggregate metrics from multiple clusters and provide a unified query interface.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: thanos-query
  namespace: thanos
  labels:
    app: thanos-query
spec:
  replicas: 2
  selector:
    matchLabels:
      app: thanos-query
  template:
    metadata:
      labels:
        app: thanos-query
    spec:
      containers:
      - name: thanos-query
        image: quay.io/thanos/thanos:v0.32.5
        args:
        - query
        - --log.level=info
        - --query.replica-label=replica
        - --query.replica-label=prometheus_replica
        - --store=thanos-store-gateway:10901
        - --store=prometheus-cluster-1-prometheus.monitoring:10901
        - --store=prometheus-cluster-2-prometheus.monitoring:10901
        - --query.auto-downsampling
        - --query.partial-response
        - --query.max-concurrent=20
        - --query.timeout=2m
        - --query.lookback-delta=15m
        ports:
        - name: http
          containerPort: 10902
        - name: grpc
          containerPort: 10901
        resources:
          requests:
            memory: 512Mi
            cpu: 500m
          limits:
            memory: 2Gi
            cpu: 1000m
        livenessProbe:
          httpGet:
            path: /-/healthy
            port: 10902
          initialDelaySeconds: 30
        readinessProbe:
          httpGet:
            path: /-/ready
            port: 10902
          initialDelaySeconds: 10

Deploy Thanos Query service

Create a Kubernetes service to expose Thanos Query for internal cluster access and external queries.

apiVersion: v1
kind: Service
metadata:
  name: thanos-query
  namespace: thanos
  labels:
    app: thanos-query
spec:
  type: ClusterIP
  ports:
  - name: http
    port: 9090
    targetPort: 10902
    protocol: TCP
  - name: grpc
    port: 10901
    targetPort: 10901
    protocol: TCP
  selector:
    app: thanos-query

Apply Thanos Query configuration

Deploy the Thanos Query components to your Kubernetes cluster.

kubectl apply -f thanos-query.yaml
kubectl apply -f thanos-query-service.yaml

Configure Thanos Store Gateway

Deploy Thanos Store Gateway to serve historical metrics data from object storage with caching for improved query performance.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: thanos-store-gateway
  namespace: thanos
  labels:
    app: thanos-store-gateway
spec:
  replicas: 2
  selector:
    matchLabels:
      app: thanos-store-gateway
  template:
    metadata:
      labels:
        app: thanos-store-gateway
    spec:
      containers:
      - name: thanos-store
        image: quay.io/thanos/thanos:v0.32.5
        args:
        - store
        - --log.level=info
        - --data-dir=/var/thanos/store
        - --objstore.config-file=/etc/thanos/config.yaml
        - --index-cache-size=2GB
        - --chunk-pool-size=2GB
        - --store.grpc.series-max-concurrency=20
        - --sync-block-duration=3m
        - --block-sync-concurrency=20
        ports:
        - name: http
          containerPort: 10902
        - name: grpc
          containerPort: 10901
        volumeMounts:
        - name: storage-config
          mountPath: /etc/thanos
          readOnly: true
        - name: data
          mountPath: /var/thanos/store
        resources:
          requests:
            memory: 2Gi
            cpu: 1000m
          limits:
            memory: 8Gi
            cpu: 2000m
        livenessProbe:
          httpGet:
            path: /-/healthy
            port: 10902
          initialDelaySeconds: 30
        readinessProbe:
          httpGet:
            path: /-/ready
            port: 10902
          initialDelaySeconds: 10
      volumes:
      - name: storage-config
        secret:
          secretName: thanos-storage-config
      - name: data
        persistentVolumeClaim:
          claimName: thanos-store-data

Create persistent volume for Store Gateway

Create a persistent volume claim for Thanos Store Gateway cache and metadata storage.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: thanos-store-data
  namespace: thanos
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: fast-ssd
  resources:
    requests:
      storage: 100Gi

Deploy Store Gateway service

Create a service for Thanos Store Gateway to enable communication with Thanos Query components.

apiVersion: v1
kind: Service
metadata:
  name: thanos-store-gateway
  namespace: thanos
  labels:
    app: thanos-store-gateway
spec:
  type: ClusterIP
  ports:
  - name: http
    port: 10902
    targetPort: 10902
    protocol: TCP
  - name: grpc
    port: 10901
    targetPort: 10901
    protocol: TCP
  selector:
    app: thanos-store-gateway

Apply Store Gateway configuration

Deploy all Store Gateway components to your cluster.

kubectl apply -f thanos-store-pvc.yaml
kubectl apply -f thanos-store.yaml
kubectl apply -f thanos-store-service.yaml

Configure Thanos Query Frontend

Deploy Query Frontend for query caching, splitting, and retry logic to improve query performance and reliability.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: thanos-query-frontend
  namespace: thanos
  labels:
    app: thanos-query-frontend
spec:
  replicas: 2
  selector:
    matchLabels:
      app: thanos-query-frontend
  template:
    metadata:
      labels:
        app: thanos-query-frontend
    spec:
      containers:
      - name: thanos-query-frontend
        image: quay.io/thanos/thanos:v0.32.5
        args:
        - query-frontend
        - --log.level=info
        - --query-frontend.downstream-url=http://thanos-query:9090
        - --query-range.split-interval=24h
        - --query-range.max-retries-per-request=3
        - --query-frontend.log-queries-longer-than=10s
        - --cache-compression-type=snappy
        - |
          --query-range.response-cache-config=
          type: REDIS
          config:
            addr: redis:6379
            db: 0
            dial_timeout: 5s
            read_timeout: 3s
            write_timeout: 3s
            max_get_multi_concurrency: 100
            get_multi_batch_size: 100
            max_set_multi_concurrency: 100
            set_multi_batch_size: 100
        ports:
        - name: http
          containerPort: 10902
        resources:
          requests:
            memory: 512Mi
            cpu: 500m
          limits:
            memory: 1Gi
            cpu: 1000m
        livenessProbe:
          httpGet:
            path: /-/healthy
            port: 10902
          initialDelaySeconds: 30
        readinessProbe:
          httpGet:
            path: /-/ready
            port: 10902
          initialDelaySeconds: 10

Deploy Redis for query caching

Deploy Redis to cache query results and improve Thanos Query Frontend performance.

helm repo add bitnami https://charts.bitnami.com/bitnami
helm install redis bitnami/redis \
  --namespace thanos \
  --set auth.enabled=false \
  --set replica.replicaCount=1 \
  --set master.persistence.size=10Gi

Create Query Frontend service and deploy

Apply the Query Frontend configuration and create a service for external access.

apiVersion: v1
kind: Service
metadata:
  name: thanos-query-frontend
  namespace: thanos
  labels:
    app: thanos-query-frontend
spec:
  type: LoadBalancer
  ports:
  - name: http
    port: 9090
    targetPort: 10902
    protocol: TCP
  selector:
    app: thanos-query-frontend
kubectl apply -f thanos-query-frontend.yaml
kubectl apply -f thanos-query-frontend-service.yaml

Configure Thanos Compactor

Deploy Thanos Compactor to downsample and compact metrics data in object storage, reducing storage costs and query times.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: thanos-compactor
  namespace: thanos
  labels:
    app: thanos-compactor
spec:
  replicas: 1
  selector:
    matchLabels:
      app: thanos-compactor
  template:
    metadata:
      labels:
        app: thanos-compactor
    spec:
      containers:
      - name: thanos-compactor
        image: quay.io/thanos/thanos:v0.32.5
        args:
        - compact
        - --log.level=info
        - --data-dir=/var/thanos/compactor
        - --objstore.config-file=/etc/thanos/config.yaml
        - --consistency-delay=30m
        - --retention.resolution-raw=7d
        - --retention.resolution-5m=30d
        - --retention.resolution-1h=180d
        - --compact.concurrency=1
        - --downsample.concurrency=1
        volumeMounts:
        - name: storage-config
          mountPath: /etc/thanos
          readOnly: true
        - name: data
          mountPath: /var/thanos/compactor
        resources:
          requests:
            memory: 1Gi
            cpu: 1000m
          limits:
            memory: 4Gi
            cpu: 2000m
      volumes:
      - name: storage-config
        secret:
          secretName: thanos-storage-config
      - name: data
        persistentVolumeClaim:
          claimName: thanos-compactor-data

Deploy Compactor with storage

Create persistent storage for the Compactor and apply the configuration.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: thanos-compactor-data
  namespace: thanos
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: fast-ssd
  resources:
    requests:
      storage: 200Gi
kubectl apply -f thanos-compactor-pvc.yaml
kubectl apply -f thanos-compactor.yaml

Configure external access with Ingress

Create an Ingress resource to expose Thanos Query Frontend with SSL termination for external access.

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: thanos-query-ingress
  namespace: thanos
  annotations:
    kubernetes.io/ingress.class: nginx
    cert-manager.io/cluster-issuer: letsencrypt-prod
    nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
spec:
  tls:
  - hosts:
    - thanos.example.com
    secretName: thanos-tls
  rules:
  - host: thanos.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: thanos-query-frontend
            port:
              number: 9090
kubectl apply -f thanos-ingress.yaml

Verify your setup

Check the status of all Thanos components and verify metrics federation is working correctly.

kubectl get pods -n thanos
kubectl get pods -n monitoring

kubectl logs -n thanos deployment/thanos-query --tail=50
kubectl logs -n thanos deployment/thanos-store-gateway --tail=50

curl -s http://thanos-query-frontend.thanos:9090/api/v1/stores | jq
curl -s "http://thanos-query-frontend.thanos:9090/api/v1/query?query=up" | jq
Note: The stores endpoint should show all connected Prometheus sidecars and Store Gateways. The query endpoint should return metrics from all federated clusters.

Common issues

SymptomCauseFix
Store Gateway not showing dataObject storage credentials incorrectVerify secret with kubectl get secret thanos-storage-config -o yaml
Query returns no data from some clustersSidecar not uploading to storageCheck sidecar logs: kubectl logs prometheus-pod -c thanos-sidecar
High memory usage on Store GatewayIndex cache too largeReduce --index-cache-size parameter
Slow queries on historical dataNo downsampling configuredWait for Compactor to process data or check retention settings
Compactor fails with permission errorsInsufficient object storage permissionsGrant read/write/delete permissions to storage bucket
Query Frontend cache not workingRedis connection failedCheck Redis deployment: kubectl get pods redis-master

Next steps

Automated install script

Run this to automate the entire setup

Need help?

Don't want to manage this yourself?

We handle managed devops services for businesses that depend on uptime. From initial setup to ongoing operations.