A cohesive observability platform unifies metrics, logs, and traces so operators and developers can answer: Is it slow? Why is it slow? Where did it fail? This document describes an opinionated, production‑grade, multi‑cluster design using:
Goal: A single logical pane of glass (Grafana + Kibana + Jaeger Query UI) without tightly coupling clusters or sacrificing blast radius isolation.
The design (and the steps below) assume the following platform components are already deployed and healthy in each relevant cluster (usually via your App‑of‑Apps / GitOps root):
Ensure cluster trust roots / CA issuers are consistent (or mapped) so mTLS pairs validate correctly across clusters.
This platform is bootstrapped by the existing Helm chart at charts/app-of-apps. That chart renders Argo CD Application objects for every core and observability component based on simple boolean enable flags and per‑environment value files (values.dev-01.yaml, values.stag-01.yaml, values.prod-01.yaml, values.ops-01.yaml).
source.targetRevision) allowing progressive promotion (e.g. HEAD → staging → stable)<component>.enable<cluster.name>-<component> (e.g. dev-01-monitoring)team: ops, cluster: <cluster>)CreateNamespace=true, validation, self‑heal)Example (charts/app-of-apps/values.dev-01.yaml):
cluster:
name: dev-01
server: https://dev-01-k8s.endpoint.kubernetes
source:
targetRevision: HEAD
helm:
valueFiles: values.dev-01.yaml
monitoring:
enable: true
logging:
enable: true
jaeger:
enable: true
# ... other components enabled ...
Switching promotion simply changes source.targetRevision in values.prod-01.yaml (e.g. stable) while dev tracks HEAD.
charts/monitoring).HEAD).values.stag-01.yaml source.targetRevision to new tag/branch → merge → Argo sync.Ensure these are consistent across all environment files:
monitoring chart external labels (cluster, environment) aligned with object store bucket naming.logging chart index prefix or ILM policy references (if templated).jaeger chart mode (collectors only vs collectors+query) differentiated by environment (ops cluster hosts query UI).The app-of-apps chart abstracts: naming, repo URLs, finalizers, sync options, and per‑cluster targeting—removing repetitive Argo CD YAML boilerplate and letting platform engineers focus on chart evolution (monitoring, elastic-stack, jaeger, etc.) rather than Application plumbing.
Argo CD UI groups Applications by name prefix <cluster>-; quick visual scan shows which cluster component diverged. Label selectors (team=ops, cluster=dev-01) allow automated tooling (metrics export or notifications) to surface drift or sync lag per environment.
High‑level order minimizes rework and validates each layer before global federation.
thanos-dev, thanos-prod) with lifecycle rules (downsample retention tiers).logs-*, jaeger-span-*).thanos-query.example.com, kibana.example.com, logs-ingest.example.com, jaeger.example.com).Use Sealed Secrets (or secret management pipeline) to commit:
Apply sealed secrets first so subsequent chart syncs mount them cleanly.
elastic-stack chart in Ops cluster (ECK operator + ES + Kibana).logs-*, jaeger-span-*).elastic-stack values for remote mode or a lightweight shipper chart) in each non‑Ops cluster.logs-ingest.example.com:443).logs-dev-01-*).monitoring chart in each cluster with prometheus.replicaCount=2, thanos.enabled=true.thanos_sidecar_uploads_total).up{cluster="dev-01"} across clusters.jaeger.query.enabled=true).| Check | Command / Method | Success Criteria |
|——-|——————|——————|
| Prometheus HA | kubectl get pods -n monitoring | 2 Ready pods per cluster + sidecars running |
| Thanos shipping | Grafana: thanos_shipper_uploads_total | Steady increase; no persistent failures |
| Global metrics | Grafana query across cluster label | Results from all clusters |
| Log ingestion | Kibana Discover | Logs from each cluster index present |
| Traces written | Jaeger UI search | Spans visible from test app |
| mTLS integrity | Inspect sidecar / Filebeat logs | No handshake / cert errors |
| ILM rotation | ES _ilm/explain | Hot → Warm transitions appear as policy dictates |
| Activity | Frequency | Notes | |———-|———–|——-| | TLS cert rotation | 60–90 days | Automate via cert-manager + Sealed Secret updates | | Object storage lifecycle review | Quarterly | Optimize retention versus cost | | ES shard / index review | Monthly | Prevent oversharding; adjust templates | | Capacity planning (metrics / logs) | Quarterly | Track growth; forecast storage & ingestion rates | | Policy review (ILM / retention) | Quarterly | Align with compliance & cost constraints |
Each workload cluster runs: (a) an HA Prometheus pair (each replica with its own Sidecar), (b) a per‑cluster Compactor, (c) a per‑cluster Store Gateway, and (d) a per‑cluster local Thanos Query that federates ONLY in‑cluster components. A single Global Thanos Query (in the Ops cluster) fans out over mTLS to the local Query instances (one hop), rather than directly to every sidecar / store gateway. This reduces the number of cross‑cluster mTLS targets and cleanly encapsulates each cluster’s internal topology.
graph LR
subgraph Dev[Dev Cluster]
PDv[(Prometheus Dev R1/ R2)] --> SDv[Sidecars]
SDv --> BDv[(GCS Bucket Dev)]
CDv[Compactor Dev] --> BDv
SGDv[Store GW Dev] --> BDv
QDv[Local Query Dev] --> SGDv
QDv --> SDv
end
subgraph Prod[Prod Cluster]
PPr[(Prometheus Prod R1/ R2)] --> SPr[Sidecars]
SPr --> BPr[(GCS Bucket Prod)]
CPr[Compactor Prod] --> BPr
SGPr[Store GW Prod] --> BPr
QPr[Local Query Prod] --> SGPr
QPr --> SPr
end
subgraph Stage[Stage Cluster]
PSt[(Prometheus Stage R1/ R2)] --> SSt[Sidecars]
SSt --> BSt[(GCS Bucket Stage)]
CSt[Compactor Stage] --> BSt
SGSt[Store GW Stage] --> BSt
QSt[Local Query Stage] --> SGSt
QSt --> SSt
end
subgraph Ops[Ops Cluster]
PO[(Prometheus Ops R1/ R2)] --> SO[Sidecars]
SO --> BO[(GCS Bucket Ops)]
CO[Compactor Ops] --> BO
SGO[Store GW Ops] --> BO
QO[Local Query Ops] --> SGO
QO --> SO
GF[Grafana] --> QG[Global Thanos Query]
end
%% Global federation (single hop to local queries)
QG -. mTLS StoreAPI .-> QDv
QG -. mTLS StoreAPI .-> QPr
QG -. mTLS StoreAPI .-> QSt
QG -. mTLS StoreAPI .-> QO
Key Points:
cluster and replica).Legend: Local Query = in‑cluster fan‑out; Global Query = cross‑cluster aggregator; Sidecars = live + block shipping; Store GW = historical block index; Compactor = cluster‑scoped downsampling.
Logs from all clusters flow to a single Elasticsearch domain in the Ops cluster. Filebeat (or other shippers) in remote clusters sends over mTLS through an Envoy Gateway. Kibana provides exploration & dashboarding; optional Cerebro for cluster internals.
flowchart LR
subgraph Ops[Ops Cluster]
GW[Envoy Gateway]
ES[(Elasticsearch)]
KB[Kibana]
CBR[(Cerebro Optional)]
KB --> ES
CBR --> ES
end
subgraph Dev[Dev Cluster]
FB_DEV[Filebeat DS]
end
subgraph Stag[Staging Cluster]
FB_STAG[Filebeat DS]
end
subgraph Prod[Prod Cluster]
FB_PROD[Filebeat DS]
end
FB_DEV -- mTLS --> GW
FB_STAG -- mTLS --> GW
FB_PROD -- mTLS --> GW
GW --> ES
Key Points:
Jaeger reuses the Elasticsearch cluster used for logs (operational & cost efficiency). Collectors are deployed in every cluster; the Query UI only in the Ops cluster. Agents are optional: prefer direct OTLP export when possible.
flowchart LR
subgraph Ops[Ops Cluster]
JQ[Jaeger Query]
JC_O[Collectors Ops]
ES_SHARED[(Elasticsearch Shared)]
end
subgraph Dev[Dev Cluster]
JC_D[Collectors Dev]
end
subgraph Stag[Staging Cluster]
JC_S[Collectors Staging]
end
subgraph Prod[Prod Cluster]
JC_P[Collectors Prod]
end
JC_D --> ES_SHARED
JC_S --> ES_SHARED
JC_P --> ES_SHARED
JC_O --> ES_SHARED
JQ --> ES_SHARED
Key Points:
| Concern | Metrics | Logs | Traces | |———|——–|——|——–| | Hot Path | Local Prometheus TSDB | Ingest pipeline (Filebeat -> ES) | Collector batching -> ES | | Long‑Term | Thanos (object storage) | ES ILM warm tier | ES ILM indices | | Global Aggregation | Thanos Query | Kibana (single ES) | Jaeger Query (same ES) | | Transport Security | mTLS StoreAPI / gRPC | mTLS (Gateway) | mTLS to ES | | User Interface | Grafana | Kibana | Jaeger UI |
Sharing Elasticsearch for logs & traces decreases system sprawl but requires ILM planning (separate index prefixes: e.g., logs-*, jaeger-span-*).
sequenceDiagram
participant Q as Thanos Query
participant SC as Sidecar (Cluster)
participant FB as Filebeat
participant GW as Envoy Gateway
participant ES as Elasticsearch
participant JC as Jaeger Collector
participant JQ as Jaeger Query
Q->>SC: StoreAPI (mTLS cert mutual verify)
FB->>GW: HTTPS (mTLS client cert)
GW->>ES: Internal TLS (cluster CA)
JC->>ES: HTTPS (mTLS / credentials)
JQ->>ES: HTTPS (search queries)
Principles:
| Task | Action | Tooling | |——|——–|———| | Add new cluster | Deploy Prometheus + Sidecar + Filebeat + Jaeger Collector; register certs; add DNS for Gateway | Helm / Argo CD | | Rotate certs | Update SealedSecrets / cert-manager Issuer; allow rolling restart | cert-manager / Argo sync | | Expand retention (metrics) | Adjust Thanos compactor retention flags & object store lifecycle | Thanos / GCS console | | Expand retention (logs/traces) | Update ES ILM policies & storage class sizing | ECK / ILM APIs | | Investigate latency | Use Grafana (SLO panels) → Jump to Kibana logs → Correlate trace in Jaeger | Cross-tool drilldown | | Outage isolation | Thanos Query continues with healthy clusters; ES ingest unaffected by one cluster’s failure | Redundant topology |
replicaCount=2, anti‑affinity, remote write avoided (use Thanos instead).objstore-dev.yaml). External labels must include cluster + replica; ensure local Queries and global Query share consistent relabeling.logs-*, jaeger-span-* with tuned shard counts.es.use-ilm=true and consistent index-prefix per environment (e.g., jaeger-dev, jaeger-prod).| Component | Failure | Impact | Mitigation | |———–|———|——–|———–| | Single Prometheus replica | Pod loss | Minor scrape gap | Run 2 replicas + sidecar per replica | | Thanos Sidecar outage | Fresh data from that replica missing + partial global view | Alerts on ship lag; restart sidecar | | Per‑cluster Compactor outage | Delayed downsampling / higher bucket usage (that cluster only) | Alert on compaction lag; restart | | Per‑cluster Store Gateway outage | Loss of historical blocks (> local retention) for that cluster only | Query still hits sidecars; alert & restart gateway | | Per‑cluster object storage outage | No new block uploads + historical unavailable for that cluster | Local Prometheus hot data still queryable; retry after restore | | Envoy Gateway down (ingress) | Remote log shipping halts | Buffer limited (Filebeat); HA Gateway / multi AZ | | ES hot node failure | Potential write throttling | Multi‑AZ nodeSets + ILM; autoscale / shard relocation | | Jaeger Collector down | Span loss if app side buffering low | Horizontal collectors + backpressure tuning |
These practices ensure consistency, resilience, and clean multi‑cluster operations.
cluster) across Prometheus external labels, Loki / Elasticsearch index prefixes, and trace resource attributes (OTLP resource.attributes.cluster).environment label (dev, stage, prod, ops) for queries, routing, and alert fan‑out.global:
labels:
cluster: dev-01
environment: dev
Propagate via chart templates (Prometheus externalLabels, Filebeat / Log shipper processors, Jaeger collector OTEL_RESOURCE_ATTRIBUTES).
PrometheusRule CRs applied identically in all environments (eliminates drift & “it only alerts in prod” surprises).severity=warning|critical, team=<owner>, environment=<from externalLabels>.environment=prod first for granular receivers; fallback route for everything else points to a consolidated non‑prod receiver.alertmanager_notifications_total, custom counters) where possible.filter: term: cluster: <cluster>).filebeat with is_write_index: true) before ILM policy activation._ilm/explain) & shard counts per index to avoid oversharding (target 20–40GB primary size, not <1GB).record: job:http_request_error_rate:ratio) and alert off the recording rule.maintenance=true)—Alertmanager inhibition rules silence non‑critical alerts during controlled operations.otelcol_processor_batch_batch_send_size) for capacity tuning.| Check | Target | Pass Criteria |
|——-|——–|—————|
| External labels | Prometheus | cluster & environment present globally |
| Thanos sidecar | Metrics upload | Blocks appear in correct bucket prefix |
| Alert routing | Alertmanager | Test alert hits expected receiver (prod vs non‑prod) |
| ILM policy | Elasticsearch | Policy attached & rollover alias exists |
| Log alert engine | ElastAlert2 | Test rule fires & resolves |
| Trace ingestion | Jaeger / ES | Sampled spans visible with cluster label |
This design delivers:
Adopt incrementally: start with metrics federation, then centralize logging, finally converge tracing onto shared Elasticsearch with ILM governance.