These charts implement a “Platform in a Box”, a batteries‑included, GitOps driven foundation for operating a Kubernetes platform using the Argo CD App‑of‑Apps pattern. They compose the core traffic, security, observability, data, and enablement layers so teams can onboard applications quickly with consistent guardrails.
Introductory article: Bootstrapping a Production-Grade Kubernetes Platform, narrative walkthrough of goals, architecture choices, and bootstrap flow.
kubectl configured with cluster accesshelm v3.8+kubeseal CLI (for sealed secrets)Note: This repository focuses on platform components and assumes you have already provisioned your Kubernetes clusters and cloud infrastructure (VPCs, object storage, DNS zones, etc.) using tools like Terraform, Pulumi, or cloud provider CLIs.
We plan to add opinionated guides for Kubernetes-native infrastructure provisioning using Crossplane or similar tools in the future. For now, please provision your infrastructure using your preferred IaC tool before following the bootstrap steps below.
git clone <repo-url>
cd k8s
charts/app-of-apps/values.ops-01.yamlsealedSecrets:
enable: true
certManager:
enable: true
envoyGateway:
enable: true
kubeseal --fetch-cert > pub-cert.pemkubectl apply -f argocd-bootstrap-apps/ops-01.yaml
# Wait for Argo CD to sync (usually 1-2 minutes)
kubectl get applications -n argocd
# Check specific app status
argocd app get ops-01-app-of-apps
# Or via Argo CD UI
argocd app list
# In values.ops-01.yaml or values.dev-01.yaml
monitoring:
enable: true # For metrics (Prometheus/Thanos)
logging:
enable: true # For centralized logs (Elasticsearch)
envoyGateway:
enable: true # For ingress (Gateway API)
kyverno:
enable: true # For policy enforcement
This top‑level document inventories charts, their relationships, and recommended installation / reconciliation order.
The docs/ folder contains deep‑dive guidance, reference architectures, and operational playbooks for major platform pillars. Use these for design decisions, hardening steps, and lifecycle operations beyond the high‑level overview in this README:
These documents evolve independently of this summary; always consult them first for implementation specifics, security hardening steps, and operational playbooks.
| Chart | Category | Purpose | Depends On / Cooperates With | Key Notes |
|——-|———-|———|——————————|———–|
| app-of-apps | GitOps Orchestration | Argo CD App‑of‑Apps root that defines Argo CD Application objects for platform components (monitoring, ingress, gateway, secrets, policies, data services, logging). | Argo CD CRDs present in cluster. Optionally Sealed Secrets controller if you enable secret management here. | Toggle components via values: sealedSecrets, ingressController, envoyGateway, externalDns, certManager, monitoring, kyverno, redis, valkey, logging, jaeger. |
| sealed-secrets | Secrets Management | Vendors upstream Bitnami Sealed Secrets controller and (optionally) renders shared/global sealed secrets. | Installed via app-of-apps (if sealedSecrets.enable=true). Consumed by charts needing encrypted creds (monitoring, external-dns, others). | Supports user‑defined controller key; global secrets only. |
| cert-manager | TLS & Certificates | Issues TLS certs via ACME (DNS‑01 GCP Cloud DNS example) and reflects cert Secrets cluster‑wide using reflector. | Sealed Secrets (for DNS svc acct), ExternalDNS (aligned DNS zones), consumers: envoy-gateway, logging, jaeger, ingress. | Upstream cert-manager + reflector; wildcard cert reuse via annotations. |
| envoy-gateway | Traffic Management & Routing | Deploys Envoy Gateway (Gateway API) plus custom GatewayClasses, Gateways, Routes, security & proxy policies. | Kubernetes >=1.27, optionally ExternalDNS & monitoring. | Deployed in every cluster (local ingress + policy attachment) but managed centrally. |
| external-dns | Traffic Management & Routing | Manages DNS records in Google Cloud DNS for Services & Gateway API (HTTPRoute/GRPCRoute). | GCP service account (sealed credentials), Gateway / Services to watch. | Multi‑domain filters, TXT registry, environment isolation. |
| monitoring | Observability: Metrics | Prometheus + Thanos components for HA metrics and global aggregation. | envoy-gateway (if gRPC exposure), Sealed Secrets, object storage. | Values control Thanos, replicas, routes, TLS. |
| nginx-ingress-controller | Traffic Management & Routing | Traditional NGINX ingress controller for legacy ingress use cases. | None (cluster only). | Prefer Gateway API for new services. |
| kyverno | Compliance & Policy | Upstream Kyverno + Policy Reporter + starter ops & security policies (Audit → Enforce). | Sealed Secrets (optional), monitoring (metrics). | Deployed in every cluster for local admission & policy enforcement; centrally versioned. |
| redis | Data Services (Shared) | Vendors upstream Bitnami Redis for cache/session workloads. | Sealed Secrets (auth), monitoring (metrics). | Enable auth & persistence in env overrides before production. Note: Bitnami is deprecating free images; consider migrating to valkey (drop-in Redis replacement). |
| valkey | Data Services (Shared) | High-performance Redis fork for cache/session workloads. Drop-in replacement for Redis with modern features. | Sealed Secrets (auth), monitoring (metrics). | Recommended alternative to Redis. Currently supports replication mode; Sentinel mode coming soon. Enable auth & persistence in env overrides before production. |
| logging | Observability: Logs | Centralized multi‑cluster logging (Elasticsearch + Kibana + Filebeat) using ECK operator & mTLS via Gateway. | envoy-gateway (ingest endpoint), Sealed Secrets (certs), eck-operator. | Deployed with Helm release name logging; ops cluster hosts ES/Kibana; other clusters ship via Filebeat. |
| jaeger | Observability: Tracing | Multi‑cluster tracing (collectors in all clusters, query UI only in Ops) storing spans in shared Elasticsearch (logging stack). | logging (Elasticsearch), Sealed Secrets (ES creds / TLS), optional Envoy Gateway (if exposing query). | Agents optional (apps can emit OTLP direct); uses upstream Jaeger chart. |
Argo CD runs in the ops cluster (namespace argocd) and serves as the central command & control plane for all platform components across every environment cluster. The ops cluster hosts:
app-of-apps chart) for the ops cluster itselfargocd-bootstrap-apps/ (e.g. ops-01.yaml) which create the root Application pointing at the app-of-apps Helm chartFrom this root, Argo CD renders and manages per‑cluster component Applications (monitoring, logging, jaeger, gateway, cert-manager, kyverno, etc.). Remote clusters are registered in Argo CD (cluster secrets) and targeted by generated Application specs; reconciliation therefore originates centrally while resources apply to their respective destination clusters.
The Ops cluster functions as the global command center:
Workload Environment Clusters (Dev, QA/Stage, Prod) connect outward to the Ops control plane:
%% Top-down layout: Ops at top (horizontal internals), Prod | Stage | Dev row beneath
flowchart TB
classDef env fill:#f5f7fa,stroke:#cfd6dd,color:#111111;
classDef tel fill:#ffffff,stroke:#97a3ab,color:#222222,stroke-dasharray:3 3;
%% ENVIRONMENT CLUSTERS
subgraph PROD[Prod Cluster]
PRODAPP[Apps]
KYVPRD[Kyverno]
GWPRD[Envoy Gateway]
TELPRD[Telemetry Stack]
end
subgraph STAGE[Stage / QA Cluster]
STGAPP[Apps]
KYVSTG[Kyverno]
GWSTG[Envoy Gateway]
TELSTG[Telemetry Stack]
end
subgraph DEV[Dev Cluster]
DEVAPP[Apps]
KYVDEV[Kyverno]
GWDEV[Envoy Gateway]
TELDEV[Telemetry Stack]
end
%% OPS CLUSTER
subgraph OPS[Ops Cluster]
KYV[Kyverno]
GWOPS[Envoy Gateway]
ACD[Argo CD]
OBS[Observability Stack]
end
%% GitOps fan-out
ACD --> PROD
ACD --> STAGE
ACD --> DEV
class PROD,STAGE,DEV env;
class TELDEV,TELSTG,TELPRD tel;
When mandated by isolation (e.g., regulated vs commercial, or geo latency), run multiple Ops clusters—each an island of control—but optionally feed a higher‑level analytics layer.
flowchart LR
subgraph OPSA[Ops Cluster A]
ACD1[Argo CD A]
OBSA[Observability A]
end
subgraph OPSB[Ops Cluster B]
ACD2[Argo CD B]
OBSB[Observability B]
end
subgraph ENVA[Env Clusters A]
D1[Dev A]
S1[Stage A]
P1[Prod A]
end
subgraph ENVB[Env Clusters B]
D2[Dev B]
S2[Stage B]
P2[Prod B]
end
ACD1 --> D1
ACD1 --> S1
ACD1 --> P1
ACD2 --> D2
ACD2 --> S2
ACD2 --> P2
OBSA -. optional aggregated exports .- OBSB
Guidance
ops-global, ops-regulated) and duplicate only the minimal control plane + observability roles.kubectl apply -f argocd-bootstrap-apps/ops-01.yaml in the ops cluster).app-of-apps chart with the appropriate values.<env>.yaml (e.g. values.ops-01.yaml).Application CRs for the ops cluster (and optionally for other clusters if multi-cluster definitions are embedded or generated via environment-specific invocations).argocd-bootstrap-apps/dev-01.yaml) referencing the same repo but a different values file.Result: a single Argo CD UI orchestrating desired state across all clusters; rollback, sync, health, and diff inspection are centralized.
A consistent branching & promotion workflow governs both:
charts/) andapp-of-apps and bootstrap manifests).| Stage | Git Reference | Purpose | Typical Cluster(s) |
|——-|—————|———|——————–|
| Development | dev branch | Fast iteration; immediate merges & validation | dev-01 |
| Staging | staging branch | Pre‑production soak / integration tests | stag-01 |
| Production | stable tag (cut from staging) | Immutable, audited release for production | prod-01 |
| Operations / Control | Follows stable or pinned tag per component | ops-01 (control plane) |
Optional: Create release/x.y branch off staging prior to tagging stable for more complex hardening windows.
Sequence Summary
dev (fast commits).staging branch created from a dev snapshot for soak.release/x.y branch split for hardening.staging.stable tag cut from hardened branch (or staging) → production.flowchart LR
D[dev branch\nFast iteration]
S[staging branch\nSoak & integration]
R[release/x.y branch optional]
P[Production stable tag]
D --> S
S -->|create tag stable| P
S --> R
S -->|merge fixes| R
R -->|tag stable| P
subgraph Clusters
C1[dev-01\ntracks dev]
C2[stag-01\ntracks staging]
C3[prod-01\ntracks stable]
C4[ops-01\ntracks stable]
end
D --> C1
S --> C2
P --> C3
P --> C4
Explanation:
dev → auto-syncs dev-01.staging → syncs stag-01.release/x.y branch isolates final hardening; fixes flow back to staging.stable on vetted commit (from release/x.y or staging) → prod-01 + ops-01 reconcile that immutable reference.stable (or release/x.y), apply fix, retag stable, then forward-merge to staging & dev to avoid divergence.dev branch auto-syncs in dev-01.dev → staging (or cherry-pick) → Argo CD syncs stag-01 using that branch.stable tag to the desired commit on staging.prod-01 (and ops control plane components requiring production parity) track stable ensuring deterministic deployments.Environment values files (e.g. charts/app-of-apps/values.prod-01.yaml) set:
source.targetRevision: stable # prod
source.targetRevision: staging # stag-01
source.targetRevision: HEAD # dev-01 (or dev)
Adjust these to match your exact branch names (e.g., replace HEAD with dev if you prefer explicit branch). The same targetRevision semantics apply for every enabled component because the app-of-apps chart propagates defaults unless overridden at a component block.
Application spec).Argo CD in the ops cluster functions as the operational nerve center:
<cluster>-<component> naming convention)staging and stable tag creationIf Argo CD control plane outage occurs, existing workloads remain running; only reconciliation pauses. Recovery: restore ops cluster or Argo CD deployment, reapply bootstrap Application if necessary—state rehydrates from Git.
Each chart provides environment value files:
values.dev-01.yaml
values.stag-01.yaml
values.ops-01.yaml
values.prod-01.yaml
Use the matching file (or merge multiple with -f).
values.yaml excerpt)sealedSecrets.enable # sealed-secrets controller + global secrets
certManager.enable # cert-manager + reflector for certificate issuance
externalDns.enable # external-dns controller for DNS records
ingressController.enable # nginx ingress controller
envoyGateway.enable # envoy gateway platform ingress
monitoring.enable # monitoring stack (Prometheus/Thanos)
kyverno.enable # kyverno policies + reporter
redis.enable # redis data service (consider valkey instead)
valkey.enable # valkey data service (recommended Redis alternative)
logging.enable # elastic logging stack (Helm release name: logging)
jaeger.enable # distributed tracing (Jaeger collectors + optional query UI)
Each block also supplies:
project: Argo CD Project namenamespace: Target namespace for componentsource.repoURL / path / targetRevisionhelmexternal-dns domains.Bitnami is deprecating free container images, which affects charts that depend on Bitnami images (notably the redis chart). We recommend migrating to valkey, which is a drop-in replacement for Redis with:
The valkey chart provides the same functionality as redis with improved long-term maintainability.
helm template monitoring ./monitoring -f monitoring/values.dev-01.yaml | less
helm lint monitoring
A GitHub Actions workflow runs a chart scan matrix with three steps: lint, trivy, and checkov. Only charts changed in the current diff to origin/master are scanned (unless --all is used).
Script: scripts/scan.sh
---\n---).trivy-image-report.txt). Duplicate images across charts are scanned once (in-memory cache).scan-output/)test-results/*.xml (JUnit per scan/image)${STEP}-test-results.html (if xunit-viewer installed in CI)trivy-image-report.txt (only for trivy step; line: <image> <OK|FAIL|SKIPPED|CACHED>)scan-summary.txt (aggregate counts & failures)# Lint only changed charts
scripts/scan.sh lint
# Scan all charts with Trivy (include images not in diff)
scripts/scan.sh trivy --all
# Include HIGH severities too
TRIVY_SEVERITY=CRITICAL,HIGH scripts/scan.sh trivy
# Ignore unfixed vulnerabilities
TRIVY_IGNORE_UNFIXED=true scripts/scan.sh trivy
# Keep rendered manifests beside charts
KEEP_RENDERED=1 scripts/scan.sh lint
# Custom config location
CONFIG_FILE=my-scan-config.yaml scripts/scan.sh checkov
Skips are defined in scan-config.yaml (or file pointed to by CONFIG_FILE). Example:
trivy:
skipcharts:
- redis # skip all images for this chart
skipimages:
- ghcr.io/org/tooling-helper:latest # skip a specific image fully
checkov:
skipcharts:
- logging # skip Checkov for this chart
Place the config at repo root (default path) or set CONFIG_FILE.
.trivyignore inside a chart: Standard Trivy ignore rules (CVE IDs etc.)..checkov.yaml inside a chart: Merged with .globalcheckov.yaml (local file may remove global check entries if they appear in skip-check).| Variable | Purpose | Default |
|———-|———|———|
| TRIVY_SEVERITY | Severities to fail on (comma list) | CRITICAL |
| TRIVY_IGNORE_UNFIXED | Ignore vulns without fixes | false |
| CONFIG_FILE | Path to scan config | scan-config.yaml |
| KEEP_RENDERED | Keep rendered manifests (1/0) | 0 |
| CONCURRENCY | (Reserved) future parallelism | 4 |
| YQ_CMD | yq binary name | yq |
| OUTPUT_DIR | Output root | ./scan-output |
scan-summary.txt fields:
Kind:Detail[:...])Kinds emitted today:
Render:<chart>:<values> (render failure)DoubleDoc:<chart>:<values> (duplicate document separator)Checkov:<chart>:<values> (Checkov non-zero)Trivy:<image>:<chart>:<values> (image vuln failure)Deps:<chart> (helm dependency build failed)NoChart:<chart> (missing Chart.yaml)Add it under trivy.skipimages in the config file and re-run the trivy step. The summary will reflect Images skipped count; the image will appear with status SKIPPED in trivy-image-report.txt.
The Trivy JUnit output uses @./scripts/junit.tpl. Adjust this path if relocating the template.