k8s

ADR-002: Use Thanos for Multi-Cluster Metrics Aggregation

Accepted

I operate multiple Kubernetes clusters (dev, staging, prod, ops) and need a unified metrics platform that provides:

Long-term retention: Store metrics for months/years for capacity planning and historical analysis
Global querying: Single query interface across all clusters without maintaining separate Prometheus instances
High availability: Survive individual Prometheus pod failures
Cost efficiency: Store compressed, downsampled data in object storage
Scalability: Handle growing metric cardinality without constant Prometheus scaling
Consistency: Same query language (PromQL) across all environments

Traditional Prometheus-only approaches have limitations:

Limited retention: Prometheus local storage typically holds 2-4 weeks of data
No global view: Each cluster has isolated Prometheus, requiring manual federation
Single point of failure: Prometheus pod failure loses recent data
Storage costs: Replicating full-resolution data across clusters is expensive
Query complexity: Federating queries across multiple Prometheus instances is cumbersome

I implemented Thanos in a sidecar architecture with the following components:

Thanos Sidecar: Attached to each Prometheus pod, uploads blocks to object storage (GCS/S3)
Thanos Store Gateway: Serves historical data from object storage
Thanos Query (Querier): Global query endpoint that aggregates real-time (Prometheus) and historical (Store) data
Thanos Compactor: Runs in all clusters, downsamples and compresses historical data
Thanos Ruler: Optional, for global alerting rules

Architecture:

Each cluster runs Prometheus + Thanos Sidecar (real-time metrics)
Each cluster runs Thanos Compactor (downsamples and compresses data)
Ops cluster hosts Thanos Query and Store Gateway (global aggregation)
All clusters write to shared object storage bucket (per-environment buckets for isolation)
Grafana queries Thanos Query, not individual Prometheus instances

Key Configuration:

Object storage: GCS buckets with lifecycle policies
Retention: 2h raw, 6h 5m downsampled, 24h 1h downsampled, 30d 1h downsampled
External labels: cluster, environment for multi-cluster queries
gRPC exposure: Thanos Query exposed via Envoy Gateway for secure cross-cluster access

Unified query interface: Single endpoint (Thanos Query) for all metrics across all clusters
Long-term retention: Years of historical data in cost-effective object storage
High availability: Prometheus pod failure doesn’t lose data (already in object storage)
Cost efficiency: Downsampling reduces storage by 90%+ for older data
Scalability: Can add clusters without changing query infrastructure
Global alerts: Thanos Ruler enables cross-cluster alerting rules
No data loss: Sidecar uploads blocks continuously, not on shutdown
Standard PromQL: Developers use familiar Prometheus query language

Complexity: More components to operate and monitor (Sidecar, Store, Query, Compactor)
Operational overhead: Object storage bucket management, lifecycle policies, IAM
Query latency: Historical queries may be slower (object storage vs local SSD)
Storage costs: Object storage still incurs costs (though much lower than full retention)
Learning curve: Requires understanding Thanos concepts (blocks, downsampling, external labels)
Debugging: Issues span multiple components (Prometheus, Sidecar, Store, Query)

Rejected because:

Rejected because:

Cortex is a robust solution with excellent scalability features
The architecture is more complex (object storage, chunks storage, query frontend) than needed for this use case
Query model differs from pure PromQL, which would require team retraining
Operational overhead is higher than Thanos for this specific multi-cluster scenario

Rejected because:

Rejected because:

Simple and straightforward approach, but lacks unified querying
Each cluster must be queried separately, making cross-cluster analysis difficult
Limited retention (2-4 weeks typical) doesn’t meet long-term analysis needs
Storage costs scale linearly with each cluster
No global alerting capabilities across clusters

Rejected because:

Managed solutions offer excellent features and reduce operational overhead
For this use case, maintaining full control over the observability stack was important
Cost considerations at scale favor self-hosted solutions
Data sovereignty and GitOps-managed infrastructure were key requirements
Self-hosted approach provides more flexibility for custom retention and query patterns

Thanos Query exposed via Envoy Gateway with mTLS for secure cross-cluster access
Separate object storage buckets per environment for isolation
External labels (cluster, environment) enable filtering in queries
Compactor runs in all clusters to process data locally before aggregation
Store Gateway can be scaled horizontally if query load increases

This site is open source. Improve this page.