k8s

ADR-002: Use Thanos for Multi-Cluster Metrics Aggregation

Status

Accepted

Context

I operate multiple Kubernetes clusters (dev, staging, prod, ops) and need a unified metrics platform that provides:

Traditional Prometheus-only approaches have limitations:

Decision

I implemented Thanos in a sidecar architecture with the following components:

  1. Thanos Sidecar: Attached to each Prometheus pod, uploads blocks to object storage (GCS/S3)
  2. Thanos Store Gateway: Serves historical data from object storage
  3. Thanos Query (Querier): Global query endpoint that aggregates real-time (Prometheus) and historical (Store) data
  4. Thanos Compactor: Runs in all clusters, downsamples and compresses historical data
  5. Thanos Ruler: Optional, for global alerting rules

Architecture:

Key Configuration:

Consequences

Positive

Negative

Mitigations

Alternatives Considered

1. Prometheus Federation

Rejected because:

2. Cortex

Rejected because:

3. VictoriaMetrics

Rejected because:

4. Multiple Independent Prometheus Instances

Rejected because:

5. Managed Solutions (Datadog, New Relic, etc.)

Rejected because:

Implementation Notes

References