k8s

Building a Global Observability Platform (Metrics, Logs, Traces) Across Multi‑Cluster Kubernetes

A cohesive observability platform unifies metrics, logs, and traces so operators and developers can answer: Is it slow? Why is it slow? Where did it fail? This document describes an opinionated, production‑grade, multi‑cluster design using:

Metrics: HA Prometheus + Thanos (real‑time + long‑term + global aggregation)
Logs: Elastic Stack (Elasticsearch + Kibana + Filebeat) centralized with secure multi‑cluster shipping
Traces: Jaeger (distributed collection, centralized query) backed by the SAME Elasticsearch cluster (shared storage layer)
Secure transport everywhere via mTLS (Gateway API / gRPC / StoreAPI)

Goal: A single logical pane of glass (Grafana + Kibana + Jaeger Query UI) without tightly coupling clusters or sacrificing blast radius isolation.

Prerequisites (Assumed Running)

The design (and the steps below) assume the following platform components are already deployed and healthy in each relevant cluster (usually via your App‑of‑Apps / GitOps root):

Envoy Gateway (Gateway API) – for secure ingress (gRPC / HTTPS) endpoints used by Thanos Query (optional), log shipping (Filebeat -> Gateway), and potential Jaeger Query exposure.
cert-manager (+ reflector if you mirror certs) – for issuing TLS server certificates and mTLS client certs.
Sealed Secrets controller – for distributing encrypted credentials (object storage creds, ES auth, TLS private keys, client certs).
external-dns – to automate DNS records for exposed observability endpoints (Grafana, Kibana, Thanos Query, Jaeger Query, log ingest, etc.).

Ensure cluster trust roots / CA issuers are consistent (or mapped) so mTLS pairs validate correctly across clusters.

GitOps App-of-Apps Integration (Foundational)

This platform is bootstrapped by the existing Helm chart at charts/app-of-apps. That chart renders Argo CD Application objects for every core and observability component based on simple boolean enable flags and per‑environment value files (values.dev-01.yaml, values.stag-01.yaml, values.prod-01.yaml, values.ops-01.yaml).

Key Capabilities

Single helm install creates (or updates) all Argo CD Applications for a cluster
Per‑environment git revision selection (source.targetRevision) allowing progressive promotion (e.g. HEAD → staging → stable)
Toggle components on/off via <component>.enable
Uniform naming convention: <cluster.name>-<component> (e.g. dev-01-monitoring)
Centralized team/cluster labeling for governance & search (labels: team: ops, cluster: <cluster>)
Safe sync (auto CreateNamespace=true, validation, self‑heal)

Environment Value Files

Example (charts/app-of-apps/values.dev-01.yaml):

cluster:
  name: dev-01
  server: https://dev-01-k8s.endpoint.kubernetes
source:
  targetRevision: HEAD
  helm:
    valueFiles: values.dev-01.yaml
monitoring:
  enable: true
logging:
  enable: true
jaeger:
  enable: true
# ... other components enabled ...

Switching promotion simply changes source.targetRevision in values.prod-01.yaml (e.g. stable) while dev tracks HEAD.

Promotion Workflow (End-to-End)

Commit change to underlying component chart (e.g. new alert rule in charts/monitoring).
Bump chart version or just rely on git SHA (Argo CD detects diff) for dev (HEAD).
Validate in dev cluster dashboards.
Update values.stag-01.yaml source.targetRevision to new tag/branch → merge → Argo sync.
After soak, update prod environment file to same revision.

Observability Specific Value Alignment

Ensure these are consistent across all environment files:

monitoring chart external labels (cluster, environment) aligned with object store bucket naming.
logging chart index prefix or ILM policy references (if templated).
jaeger chart mode (collectors only vs collectors+query) differentiated by environment (ops cluster hosts query UI).

Why This Chart Matters

The app-of-apps chart abstracts: naming, repo URLs, finalizers, sync options, and per‑cluster targeting—removing repetitive Argo CD YAML boilerplate and letting platform engineers focus on chart evolution (monitoring, elastic-stack, jaeger, etc.) rather than Application plumbing.

Health / Drift Observability

Argo CD UI groups Applications by name prefix <cluster>-; quick visual scan shows which cluster component diverged. Label selectors (team=ops, cluster=dev-01) allow automated tooling (metrics export or notifications) to surface drift or sync lag per environment.

Getting Started: Deployment & Integration Steps

High‑level order minimizes rework and validates each layer before global federation.

0. Plan & Prepare

Choose environment naming convention (e.g. dev-01, stag-01, prod-01, ops-01).
Create (or confirm) per-cluster object storage buckets for Thanos (e.g. thanos-dev, thanos-prod) with lifecycle rules (downsample retention tiers).
Decide Elasticsearch topology (node counts, storage class, ILM policies, index prefixes: logs-*, jaeger-span-*).
Define DNS hostnames (e.g. thanos-query.example.com, kibana.example.com, logs-ingest.example.com, jaeger.example.com).
Generate (or configure cert-manager Issuers for) TLS certificates + client certs. For mTLS you will need:
- Server cert(s) for Gateway listeners
- Client cert/key per shipping component (Filebeat) and per Thanos Query (if cross-cluster gRPC)
- CA bundle for verification

1. Secrets & Credentials (All Clusters Where Needed)

Use Sealed Secrets (or secret management pipeline) to commit:

Thanos object storage credentials (service account JSON / access keys) – each cluster.
Elasticsearch credentials (user/password or API key) – clusters running Jaeger Collectors and Filebeat (if using ES auth besides mTLS).
TLS materials: server cert/key (Gateway) + client cert/key + CA bundle (per function: metrics, logs, traces).
Optional OAuth / SSO secrets (Jaeger Query, Grafana, Kibana if behind auth proxy).

Apply sealed secrets first so subsequent chart syncs mount them cleanly.

2. Deploy Logging (Ops Cluster First)

Sync elastic-stack chart in Ops cluster (ECK operator + ES + Kibana).
Verify ES green status & Kibana reachable (internal first).
Apply ILM policies & index templates (logs-*, jaeger-span-*).
(Optional) Expose Kibana via Gateway & confirm external-dns created DNS record.

3. Enable Log Shipping (Remote Clusters)

Deploy Filebeat DaemonSet (via elastic-stack values for remote mode or a lightweight shipper chart) in each non‑Ops cluster.
Provide mTLS client cert + CA secret references.
Point Filebeat output to Gateway hostname (e.g. logs-ingest.example.com:443).
Validate ingestion: indices for each environment appear (e.g. logs-dev-01-*).

4. Deploy Metrics Stack (All Clusters)

Sync monitoring chart in each cluster with prometheus.replicaCount=2, thanos.enabled=true.
Ensure sidecars start and initial blocks appear in that cluster’s object storage bucket (check bucket path or Thanos metrics: thanos_sidecar_uploads_total).
Deploy per‑cluster Thanos Compactor (chart values) – operates only on that cluster’s bucket.
Deploy per‑cluster Store Gateway.
Deploy per‑cluster (local) Thanos Query (optional external exposure; primarily for internal fan‑out to sidecar + store gateway).
Deploy a single Global Thanos Query (Ops) – secure via mTLS to the per‑cluster local Query endpoints ONLY (not directly to sidecars / store gateways).
Configure Grafana datasource pointing to Global Thanos Query endpoint.
Validate multi‑cluster view: query up{cluster="dev-01"} across clusters.

5. Deploy Tracing (Jaeger)

In Ops cluster: Deploy Jaeger chart with Query + Collectors (jaeger.query.enabled=true).
In each non‑Ops cluster: Deploy Collectors only (Query disabled).
Configure all Collectors to use Elasticsearch (host, credential secret, mTLS secret).
Optionally deploy Agents only where apps lack native OTLP.
Run Jaeger ES rollover / ILM init job (once) if not already executed.
Expose Jaeger Query via Gateway (TLS) and confirm DNS & cert.

6. Cross‑Cutting Validation

7. Hardening & Production Readiness

Add alert rules (PrometheusRule) for: sidecar upload failures, ES disk >80%, filebeat output errors, jaeger collector queue saturation.
Implement NetworkPolicies restricting access to Gateway ingest ports.
Enforce TLS minimum versions & strong cipher suites at Gateway.
Implement role‑based access: limit Kibana & Jaeger UI to SSO groups.
Configure Grafana folder structure + RBAC (dashboards per domain/team).

8. Ongoing Operations

1. Metrics Architecture (Prometheus + Thanos)

Each workload cluster runs: (a) an HA Prometheus pair (each replica with its own Sidecar), (b) a per‑cluster Compactor, (c) a per‑cluster Store Gateway, and (d) a per‑cluster local Thanos Query that federates ONLY in‑cluster components. A single Global Thanos Query (in the Ops cluster) fans out over mTLS to the local Query instances (one hop), rather than directly to every sidecar / store gateway. This reduces the number of cross‑cluster mTLS targets and cleanly encapsulates each cluster’s internal topology.

graph LR
  subgraph Dev[Dev Cluster]
    PDv[(Prometheus Dev R1/ R2)] --> SDv[Sidecars]
    SDv --> BDv[(GCS Bucket Dev)]
    CDv[Compactor Dev] --> BDv
    SGDv[Store GW Dev] --> BDv
    QDv[Local Query Dev] --> SGDv
    QDv --> SDv
  end
  subgraph Prod[Prod Cluster]
    PPr[(Prometheus Prod R1/ R2)] --> SPr[Sidecars]
    SPr --> BPr[(GCS Bucket Prod)]
    CPr[Compactor Prod] --> BPr
    SGPr[Store GW Prod] --> BPr
    QPr[Local Query Prod] --> SGPr
    QPr --> SPr
  end
  subgraph Stage[Stage Cluster]
    PSt[(Prometheus Stage R1/ R2)] --> SSt[Sidecars]
    SSt --> BSt[(GCS Bucket Stage)]
    CSt[Compactor Stage] --> BSt
    SGSt[Store GW Stage] --> BSt
    QSt[Local Query Stage] --> SGSt
    QSt --> SSt
  end
  subgraph Ops[Ops Cluster]
    PO[(Prometheus Ops R1/ R2)] --> SO[Sidecars]
    SO --> BO[(GCS Bucket Ops)]
    CO[Compactor Ops] --> BO
    SGO[Store GW Ops] --> BO
    QO[Local Query Ops] --> SGO
    QO --> SO
    GF[Grafana] --> QG[Global Thanos Query]
  end

  %% Global federation (single hop to local queries)
  QG -. mTLS StoreAPI .-> QDv
  QG -. mTLS StoreAPI .-> QPr
  QG -. mTLS StoreAPI .-> QSt
  QG -. mTLS StoreAPI .-> QO

Key Points:

Local Query per cluster encapsulates in‑cluster sidecars & store gateway; only one mTLS endpoint per cluster is exposed globally.
Global Query (QG) fans out to local queries, reducing connection surface vs. directly addressing every sidecar / store gateway.
Per‑cluster buckets isolate credentials, retention & lifecycle; compactors & store gateways never cross cluster boundaries.
Deduplication still functions (external labels must uniquely identify cluster and replica).
Grafana points ONLY to Global Query for a unified pane.

Legend: Local Query = in‑cluster fan‑out; Global Query = cross‑cluster aggregator; Sidecars = live + block shipping; Store GW = historical block index; Compactor = cluster‑scoped downsampling.

2. Logging Architecture (Elastic Stack Centralization)

Logs from all clusters flow to a single Elasticsearch domain in the Ops cluster. Filebeat (or other shippers) in remote clusters sends over mTLS through an Envoy Gateway. Kibana provides exploration & dashboarding; optional Cerebro for cluster internals.

flowchart LR
  subgraph Ops[Ops Cluster]
    GW[Envoy Gateway]
    ES[(Elasticsearch)]
    KB[Kibana]
    CBR[(Cerebro Optional)]
    KB --> ES
    CBR --> ES
  end

  subgraph Dev[Dev Cluster]
    FB_DEV[Filebeat DS]
  end
  subgraph Stag[Staging Cluster]
    FB_STAG[Filebeat DS]
  end
  subgraph Prod[Prod Cluster]
    FB_PROD[Filebeat DS]
  end

  FB_DEV -- mTLS --> GW
  FB_STAG -- mTLS --> GW
  FB_PROD -- mTLS --> GW
  GW --> ES

Key Points:

Only a controlled Gateway listener is exposed; ES internals stay inside Ops cluster.
Per‑cluster client certificates allow revocation & isolation.
ILM policies manage retention tiers (hot → warm → delete) controlling cost.

3. Tracing Architecture (Jaeger + Shared Elasticsearch)

Jaeger reuses the Elasticsearch cluster used for logs (operational & cost efficiency). Collectors are deployed in every cluster; the Query UI only in the Ops cluster. Agents are optional: prefer direct OTLP export when possible.

flowchart LR
  subgraph Ops[Ops Cluster]
    JQ[Jaeger Query]
    JC_O[Collectors Ops]
    ES_SHARED[(Elasticsearch Shared)]
  end
  subgraph Dev[Dev Cluster]
    JC_D[Collectors Dev]
  end
  subgraph Stag[Staging Cluster]
    JC_S[Collectors Staging]
  end
  subgraph Prod[Prod Cluster]
    JC_P[Collectors Prod]
  end

  JC_D --> ES_SHARED
  JC_S --> ES_SHARED
  JC_P --> ES_SHARED
  JC_O --> ES_SHARED
  JQ --> ES_SHARED

Key Points:

Single storage backend reduces operational overhead (one index lifecycle strategy for both logs & traces).
mTLS / credentials isolate ingest path (collectors authenticate to ES via sealed secrets + client certs).
Query UI can be fronted by OAuth proxy for SSO.

4. Unified Data Layer & Access Strategy

Sharing Elasticsearch for logs & traces decreases system sprawl but requires ILM planning (separate index prefixes: e.g., logs-*, jaeger-span-*).

5. Security & mTLS Overview

sequenceDiagram
  participant Q as Thanos Query
  participant SC as Sidecar (Cluster)
  participant FB as Filebeat
  participant GW as Envoy Gateway
  participant ES as Elasticsearch
  participant JC as Jaeger Collector
  participant JQ as Jaeger Query

  Q->>SC: StoreAPI (mTLS cert mutual verify)
  FB->>GW: HTTPS (mTLS client cert)
  GW->>ES: Internal TLS (cluster CA)
  JC->>ES: HTTPS (mTLS / credentials)
  JQ->>ES: HTTPS (search queries)

Principles:

Separate PKI roots for metrics vs log/trace pathways if revocation domains differ.
Short lifetime certs + automated renewal (cert-manager + reflector) for dynamic endpoints.
Least privilege: distinct ES users / API keys for collectors vs query components.

6. Operational Playbook (Highlights)

7. Value Overrides & Configuration Tips

Prometheus: replicaCount=2, anti‑affinity, remote write avoided (use Thanos instead).
Thanos: Per cluster: Sidecars (per replica), Compactor, Store Gateway, Local Query. Global: one aggregator Query (no compactor / store GW). Distinct object storage config secret per cluster (e.g., objstore-dev.yaml). External labels must include cluster + replica; ensure local Queries and global Query share consistent relabeling.
Bucket Layout: Keep per‑cluster bucket names or prefixes; enforce retention (object lifecycle) aligned with compliance; optional different downsampling windows per environment.
Elasticsearch: Distinct index templates: logs-*, jaeger-span-* with tuned shard counts.
Jaeger: Set es.use-ilm=true and consistent index-prefix per environment (e.g., jaeger-dev, jaeger-prod).
Certificates: Use wildcard + dedicated mTLS client certs (separate SealedSecrets) to narrow blast radius.

8. Failure Modes & Resilience

9. Drilldown Workflow Example

Alert fires (Grafana) for increased p99 latency.
Open related dashboard → Identify service & timeframe.
Pivot to Kibana with label filters (correlated trace ID if propagated).
Open Jaeger trace from log field to visualize critical path.
Identify downstream dependency slowness → escalate owning team.

10. Best Practices (Field-Proven Patterns)

These practices ensure consistency, resilience, and clean multi‑cluster operations.

10.1 Consistent Cluster & Environment Labeling

Use a single canonical cluster label (e.g. cluster) across Prometheus external labels, Loki / Elasticsearch index prefixes, and trace resource attributes (OTLP resource.attributes.cluster).
Pair with an environment label (dev, stage, prod, ops) for queries, routing, and alert fan‑out.
Helm convention example (global values):
```
global:
  labels:
    cluster: dev-01
    environment: dev
```
Propagate via chart templates (Prometheus externalLabels, Filebeat / Log shipper processors, Jaeger collector OTEL_RESOURCE_ATTRIBUTES).

10.2 Shared PrometheusRules, Environment-Specific Alertmanager Config

Keep one versioned set of PrometheusRule CRs applied identically in all environments (eliminates drift & “it only alerts in prod” surprises).
Special‑case notification routing per environment in Alertmanager config (e.g. non‑prod routes collapse to a single Slack / Teams channel; prod uses team‑specific receivers).
Pattern:
- Labels on alerts: severity=warning|critical, team=<owner>, environment=<from externalLabels>.
- Alertmanager route tree matches environment=prod first for granular receivers; fallback route for everything else points to a consolidated non‑prod receiver.
Benefits: identical detection logic; only delivery differs.

10.3 Per‑Cluster Alertmanager Instances

Run one Alertmanager pair (HA) per cluster (usually inside the monitoring chart) to avoid a single central SPOF halting alert delivery.
Optional: Federate notification dedup through upstream tools (PagerDuty, OpsGenie) instead of centralizing Alertmanager.
Keep configs declarative & environment‑scoped (Helm values or sealed secret for webhook URLs).

10.4 Thanos Object Storage Strategy

Isolation beats over‑aggregation: per‑cluster bucket (or prefix) reduces blast radius, compactor contention, and credential scope.
Global aggregation stays cheap: Global Query only needs mTLS to each local Query; buckets stay regional.
Regional bucket placement: create bucket in same region / zone family as the cluster to minimize sidecar upload latency & egress.
Capacity & compaction tuning per environment (e.g. shorter retention for dev; longer for prod) without cross‑impact.
Only aggregate buckets logically (via Global Query), not physically (no single mega bucket unless compliance mandates).

10.5 Log-Based Alerting (Elasticsearch)

Use ElastAlert2 (https://github.com/jertel/elastalert2) or OpenSearch Alerting equivalent for pattern / threshold log alerts not expressible easily as metrics.
Keep log alerts few & surgical—prefer transforming recurring log signals into metrics (Prometheus alertmanager_notifications_total, custom counters) where possible.
Version ElastAlert rule YAML in Git; inject cluster/environment filters automatically (filter: term: cluster: <cluster>).

10.6 Rollover & ILM Hygiene

Align ILM phases with expected search window: majority queries hit “hot” (SSD), occasional retro queries hit “warm”. Delete phase strictly > warm + hot aggregate period.
Verify rollover alias existence (e.g. filebeat with is_write_index: true) before ILM policy activation.
Alert on ILM errors (ES _ilm/explain) & shard counts per index to avoid oversharding (target 20–40GB primary size, not <1GB).

10.7 Metrics & Alert Noise Reduction

Apply recording rules for expensive cross‑cluster queries (e.g. record: job:http_request_error_rate:ratio) and alert off the recording rule.
Introduce a suppression label for known maintenance (maintenance=true)—Alertmanager inhibition rules silence non‑critical alerts during controlled operations.

10.8 Trace Sampling & Cost

Use consistent head sampling per environment (e.g. 10% dev, 5% stage, 1–5% prod) with adaptive sampling if high volume. Inject sampling decision early (sidecar / SDK) to avoid wasted export overhead.
Ensure sampled trace ratio is exposed as a metric (e.g. otelcol_processor_batch_batch_send_size) for capacity tuning.

10.9 Security & mTLS Reuse

Separate PKI roots if you need revocation domain boundaries (metrics vs logs) — otherwise a unified internal CA reduces operational overhead.
Rotate client certs automatically (short TTL) and alert on impending expiry (cert expiration metrics or custom job).

10.10 DR & Failure Isolation

Each cluster retains local Prometheus TSDB; if object storage is unavailable, only historical blocks temporarily regress.
Filebeat / log shippers buffer (disk) for brief ingest endpoint outages; alert before buffers overflow.
No alert delivery blackout if one cluster dies (per‑cluster Alertmanager model) — only that cluster’s new alerts cease.

10.11 Validation Checklist (Per New Cluster Onboarding)

10.12 Helm / Git Integration Tips

DRY via global values; per‑env overrides only for scale / retention.
Keep Alertmanager config Helm‑templated with environment conditionals for receiver blocks; avoid per‑env hand edits.
Validate policy & rule changes in dev (synthetic alerts) before promotion.

11. Summary

This design delivers:

High availability + horizontal scalability for metrics.
Centralized, secure log & trace storage with shared operational surface area.
Strong encryption-in-transit boundaries (mTLS everywhere).
Cost efficiency via object storage (metrics) + ILM (logs/traces).
Clear multi‑cluster federation without compromising isolation.

Adopt incrementally: start with metrics federation, then centralize logging, finally converge tracing onto shared Elasticsearch with ILM governance.

This site is open source. Improve this page.