This guide helps diagnose and resolve common issues when deploying and operating the Platform in a Box. Issues are organized by component and symptom.
Before diving into component-specific troubleshooting, follow these general steps:
# Check cluster access
kubectl cluster-info
kubectl get nodes
# Verify you're connected to the correct cluster
kubectl config current-context
# List all namespaces
kubectl get namespaces
# Check resources in a namespace
kubectl get all -n <namespace>
# Describe a resource for detailed status
kubectl describe <resource-type> <resource-name> -n <namespace>
# Get pod logs
kubectl logs <pod-name> -n <namespace>
# Follow logs in real-time
kubectl logs -f <pod-name> -n <namespace>
# Get logs from previous container instance (if restarted)
kubectl logs <pod-name> -n <namespace> --previous
# Get logs from all pods with a label
kubectl logs -l app=<label-value> -n <namespace>
# Get events for a namespace
kubectl get events -n <namespace> --sort-by='.lastTimestamp'
# Watch events in real-time
kubectl get events -n <namespace> --watch
# List all applications
kubectl get applications -n argocd
# Get detailed application status
kubectl describe application <app-name> -n argocd
# Check sync status
argocd app get <app-name>
# View application diff
argocd app diff <app-name>
Symptoms: Application shows Syncing status for extended period
Diagnosis:
# Check application status
kubectl get application <app-name> -n argocd -o yaml
# Check Argo CD controller logs
kubectl logs -n argocd -l app.kubernetes.io/name=argocd-application-controller --tail=100
# Check repo server logs
kubectl logs -n argocd -l app.kubernetes.io/name=argocd-repo-server --tail=100
Common Causes & Solutions:
| Cause | Solution |
|---|---|
| Repository authentication failure | Verify repo credentials: argocd repo get <repo-url> |
| Helm dependency not updated | Run helm dependency update in chart directory |
| Resource conflicts (namespace/CRDs) | Check for existing resources: kubectl get <resource> -A |
| Sync wave ordering issue | Verify sync wave annotations are correct |
| Large manifest size | Check repo server memory limits |
| Git repository unreachable | Verify network connectivity and repository URL |
Actions:
# Force refresh application
argocd app get <app-name> --refresh
# Hard refresh (clears cache)
argocd app get <app-name> --hard-refresh
# Retry sync
argocd app sync <app-name>
# Check repository connection
argocd repo get <repo-url>
Symptoms: Application status is Unknown or Degraded
Diagnosis:
# Get application conditions
kubectl get application <app-name> -n argocd -o jsonpath='{.status.conditions}'
# Check health status
kubectl get application <app-name> -n argocd -o jsonpath='{.status.health}'
# View full application spec
kubectl get application <app-name> -n argocd -o yaml
Common Causes & Solutions:
| Cause | Solution |
|---|---|
| Destination cluster unreachable | Verify cluster connectivity: kubectl get --raw /healthz |
| Namespace doesn’t exist | Enable CreateNamespace=true in sync options |
| Resource validation failed | Check resource YAML for syntax errors |
| CRDs not installed | Install required CRDs before application sync |
| RBAC permissions insufficient | Verify service account has required permissions |
Actions:
# Check destination cluster
kubectl get --raw /healthz --context=<cluster-context>
# Create namespace manually if needed
kubectl create namespace <namespace>
# Check RBAC
kubectl auth can-i create deployments --namespace=<namespace>
Symptoms: Argo CD cannot fetch from Git repository
Diagnosis:
# List repositories
argocd repo list
# Get repository details
argocd repo get <repo-url>
# Test repository connection
argocd repo get <repo-url> --refresh
Common Causes & Solutions:
| Cause | Solution |
|---|---|
| SSH key not configured | Add SSH key: argocd repo add <repo-url> --ssh-private-key-path ~/.ssh/id_rsa |
| HTTPS credentials missing | Add credentials: argocd repo add <repo-url> --username <user> --password <pass> |
| Repository URL incorrect | Verify repository URL format |
| Network/firewall blocking | Check network connectivity to Git provider |
Actions:
# Add SSH repository
argocd repo add git@github.com:org/repo.git \
--ssh-private-key-path ~/.ssh/id_rsa
# Add HTTPS repository
argocd repo add https://github.com/org/repo.git \
--username <user> \
--password <token>
# Remove and re-add repository
argocd repo rm <repo-url>
argocd repo add <repo-url> [credentials]
Symptoms: SealedSecret exists but Secret is not created
Diagnosis:
# Check sealed secret status
kubectl get sealedsecrets -A
kubectl describe sealedsecret <name> -n <namespace>
# Check if secret exists
kubectl get secret <name> -n <namespace>
# Check controller logs
kubectl logs -n sealed-secrets -l app.kubernetes.io/name=sealed-secrets --tail=50
Common Causes & Solutions:
| Cause | Solution |
|---|---|
| Wrong scope (namespace vs cluster-wide) | Reseal with correct scope: kubeseal --scope=namespace-wide --namespace=<ns> |
| Controller not running | Check pod status: kubectl get pods -n sealed-secrets |
| Wrong controller key | Verify using correct sealing key for this cluster |
| Namespace mismatch | Ensure SealedSecret is in correct namespace |
Actions:
# Verify controller is running
kubectl get pods -n sealed-secrets
# Check controller logs for errors
kubectl logs -n sealed-secrets -l app.kubernetes.io/name=sealed-secrets
# Reseal with correct scope
kubeseal --cert=pub-cert.pem \
--scope=namespace-wide \
--namespace=<target-namespace> \
--format=yaml < secret.yaml > sealed-secret.yaml
Symptoms: Controller logs show decryption errors
Diagnosis:
# Check controller logs
kubectl logs -n sealed-secrets -l app.kubernetes.io/name=sealed-secrets | grep -i decrypt
# Verify active key
kubectl get secrets -n sealed-secrets -l sealedsecrets.bitnami.com/sealed-secrets-key=active
# Check sealed secret encryption
kubectl get sealedsecret <name> -n <namespace> -o yaml
Common Causes & Solutions:
| Cause | Solution |
|---|---|
| Wrong controller key (rotated/missing) | Restore original key or reseal with new cert |
| Multiple active keys | Ensure only one key has sealed-secrets-key=active label |
| Key corruption | Restore key from backup |
Actions:
# List all keys
kubectl get secrets -n sealed-secrets -l sealedsecrets.bitnami.com/sealed-secrets-key
# Verify active key
kubectl get secret -n sealed-secrets -l sealedsecrets.bitnami.com/sealed-secrets-key=active
# If key rotated, reseal secrets with new cert
kubeseal --fetch-cert > new-pub-cert.pem
kubeseal --cert=new-pub-cert.pem < secret.yaml > sealed-secret.yaml
Symptoms: Certificate stuck in Pending or shows Failed status
Diagnosis:
# Check certificate status
kubectl get certificates -A
kubectl describe certificate <cert-name> -n <namespace>
# Check certificate orders
kubectl get orders -A
kubectl describe order <order-name> -n <namespace>
# Check challenges
kubectl get challenges -A
kubectl describe challenge <challenge-name> -n <namespace>
# Check cert-manager logs
kubectl logs -n cert-manager -l app.kubernetes.io/name=cert-manager --tail=100
Common Causes & Solutions:
| Cause | Solution |
|---|---|
| DNS provider credentials incorrect | Verify secret exists and contains correct credentials |
| DNS propagation delay | Wait for DNS TTL period, check with dig |
| ACME rate limits | Use Let’s Encrypt staging endpoint for testing |
| ClusterIssuer not configured | Verify ClusterIssuer exists and is ready |
| DNS zone permissions insufficient | Check IAM/service account permissions |
| Network connectivity to ACME | Verify egress to Let’s Encrypt API |
Actions:
# Verify DNS credentials secret
kubectl get secret <dns-credentials-secret> -n cert-manager
kubectl describe secret <dns-credentials-secret> -n cert-manager
# Check DNS challenge record
dig _acme-challenge.<domain> TXT
# Verify ClusterIssuer
kubectl get clusterissuers
kubectl describe clusterissuer <issuer-name>
# Force certificate renewal
kubectl annotate certificate <cert-name> -n <namespace> \
cert-manager.io/issue-temporary-certificate=true
# Check cert-manager controller logs
kubectl logs -n cert-manager -l app.kubernetes.io/name=cert-manager | grep -i error
Symptoms: Alert or warning about certificate expiration
Diagnosis:
# Check certificate expiration
kubectl get certificates -A -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.notAfter}{"\n"}{end}'
# Describe certificate
kubectl describe certificate <cert-name> -n <namespace>
Actions:
# Force renewal
kubectl delete secret <cert-secret-name> -n <namespace>
# Cert-manager will automatically renew
# Or annotate to trigger renewal
kubectl annotate certificate <cert-name> -n <namespace> \
cert-manager.io/issue-temporary-certificate=true
Symptoms: Gateway status shows NotReady or listeners not accepting traffic
Diagnosis:
# Check gateway status
kubectl get gateways -A
kubectl describe gateway <gateway-name> -n <namespace>
# Check gateway controller logs
kubectl logs -n envoy-gateway-system -l app.kubernetes.io/name=envoy-gateway --tail=100
# Check envoy proxy pods
kubectl get pods -n envoy-gateway-system
kubectl logs -n envoy-gateway-system -l app.kubernetes.io/name=envoy-gateway
Common Causes & Solutions:
| Cause | Solution |
|---|---|
| Load balancer not provisioned | Check cloud provider LB status |
| TLS certificate not ready | Verify certificate exists and is valid |
| Port conflicts | Check for port conflicts with other services |
| Resource constraints | Verify node resources and pod limits |
| Network policies blocking | Check NetworkPolicy rules |
Actions:
# Check gateway listeners
kubectl get gateway <gateway-name> -n <namespace> -o yaml | grep -A 10 listeners
# Verify service and endpoints
kubectl get svc -n envoy-gateway-system
kubectl get endpoints -n envoy-gateway-system
# Check load balancer status
kubectl get svc -n envoy-gateway-system -o jsonpath='{.items[*].status.loadBalancer}'
Symptoms: HTTPRoute exists but traffic not reaching backend
Diagnosis:
# Check HTTPRoute status
kubectl get httproute -A
kubectl describe httproute <route-name> -n <namespace>
# Check gateway status
kubectl get gateway <gateway-name> -n <namespace>
# Check backend service
kubectl get svc <backend-service> -n <namespace>
kubectl get endpoints <backend-service> -n <namespace>
Common Causes & Solutions:
| Cause | Solution |
|---|---|
| HTTPRoute not attached to Gateway | Verify parentRefs in HTTPRoute |
| Backend service has no endpoints | Check service selector matches pod labels |
| Port mismatch | Verify port in backendRefs matches service port |
| Hostname mismatch | Ensure hostname in HTTPRoute matches Gateway listener |
| Backend service not ready | Check backend pods are running and ready |
Actions:
# Verify HTTPRoute parent reference
kubectl get httproute <route-name> -n <namespace> -o yaml | grep -A 5 parentRefs
# Check backend service endpoints
kubectl get endpoints <backend-service> -n <namespace>
# Test backend directly
kubectl port-forward svc/<backend-service> -n <namespace> <port>:<port>
# Check envoy proxy configuration
kubectl exec -n envoy-gateway-system <envoy-pod> -- envoy admin config_dump
Symptoms: Services/HTTPRoutes annotated but DNS records not appearing
Diagnosis:
# Check external-dns logs
kubectl logs -n external-dns -l app.kubernetes.io/name=external-dns --tail=100
# Verify external-dns is running
kubectl get pods -n external-dns
# Check DNS provider connectivity
kubectl exec -n external-dns <pod-name> -- nslookup <domain>
Common Causes & Solutions:
| Cause | Solution |
|---|---|
| DNS provider credentials incorrect | Verify secret contains correct credentials |
| Domain filter mismatch | Check domainFilters in external-dns config |
| Insufficient IAM permissions | Verify service account has DNS write permissions |
| TXT registry conflict | Check for conflicting TXT records |
| Annotation missing or incorrect | Verify external-dns.alpha.kubernetes.io/hostname annotation |
Actions:
# Verify DNS credentials
kubectl get secret <dns-credentials> -n external-dns
# Check domain filters
kubectl get configmap -n external-dns external-dns -o yaml | grep domainFilters
# Verify annotations on service/route
kubectl get svc <service-name> -n <namespace> -o yaml | grep external-dns
kubectl get httproute <route-name> -n <namespace> -o yaml | grep external-dns
# Check DNS provider directly
# For Google Cloud DNS:
gcloud dns record-sets list --zone=<zone-name>
# For Route53:
aws route53 list-resource-record-sets --hosted-zone-id=<zone-id>
Symptoms: DNS records exist but point to old IPs
Diagnosis:
# Check external-dns sync status
kubectl logs -n external-dns -l app.kubernetes.io/name=external-dns | grep -i sync
# Verify current desired state
kubectl get svc -A -o jsonpath='{range .items[*]}{.metadata.annotations.external-dns\.alpha\.kubernetes\.io/hostname}{"\n"}{end}'
Actions:
# Force external-dns to reconcile
kubectl delete pod -n external-dns -l app.kubernetes.io/name=external-dns
# Check DNS TTL (may need to wait)
dig <hostname> +short
Symptoms: No metrics appearing in Prometheus
Diagnosis:
# Check Prometheus targets
kubectl port-forward -n monitoring svc/prometheus 9090:9090
# Visit http://localhost:9090/targets
# Check Prometheus logs
kubectl logs -n monitoring -l app.kubernetes.io/name=prometheus --tail=100
# Verify ServiceMonitor resources
kubectl get servicemonitors -A
kubectl describe servicemonitor <name> -n <namespace>
Common Causes & Solutions:
| Cause | Solution |
|---|---|
| ServiceMonitor not created | Create ServiceMonitor for service |
| Service annotations missing | Add prometheus.io/scrape: "true" annotation |
| Network policies blocking | Check NetworkPolicy allows Prometheus access |
| Service endpoints empty | Verify service has ready endpoints |
| Scrape config incorrect | Check Prometheus scrape configuration |
Actions:
# Check Prometheus configuration
kubectl get configmap -n monitoring prometheus -o yaml
# Verify ServiceMonitor selector
kubectl get servicemonitor <name> -n <namespace> -o yaml | grep -A 5 selector
# Check service has endpoints
kubectl get endpoints <service-name> -n <namespace>
# Test scrape manually
kubectl port-forward svc/<service-name> -n <namespace> 8080:8080
curl http://localhost:8080/metrics
Symptoms: No blocks appearing in object storage bucket
Diagnosis:
# Check sidecar logs
kubectl logs -n monitoring -l app=prometheus -c thanos-sidecar --tail=100
# Check sidecar metrics
kubectl port-forward -n monitoring svc/prometheus 9090:9090
# Visit http://localhost:9090/metrics | grep thanos_sidecar
# Verify object storage credentials
kubectl get secret <thanos-credentials> -n monitoring
Common Causes & Solutions:
| Cause | Solution |
|---|---|
| Object storage credentials incorrect | Verify secret contains correct credentials |
| Bucket permissions insufficient | Check IAM/service account has write permissions |
| Network egress blocked | Verify cluster can reach object storage API |
| Sidecar configuration error | Check sidecar args and environment variables |
| Bucket doesn’t exist | Create bucket before deployment |
Actions:
# Verify object storage secret
kubectl get secret <thanos-credentials> -n monitoring -o yaml
# Check sidecar configuration
kubectl get statefulset prometheus -n monitoring -o yaml | grep -A 20 thanos-sidecar
# Test object storage connectivity
kubectl exec -n monitoring <prometheus-pod> -c thanos-sidecar -- \
thanos tools bucket verify --objstore.config-file=/etc/thanos/objstore.yaml
# Check bucket contents (GCS example)
gsutil ls gs://<bucket-name>/
Symptoms: Global Thanos Query missing metrics from some clusters
Diagnosis:
# Check Thanos Query configuration
kubectl get deployment thanos-query -n monitoring -o yaml
# Check Query logs
kubectl logs -n monitoring -l app.kubernetes.io/name=thanos-query --tail=100
# Verify Store API endpoints
kubectl get svc -A | grep thanos-query
Common Causes & Solutions:
| Cause | Solution |
|---|---|
| Local Query endpoints not configured | Verify --store flags point to local queries |
| mTLS certificate issues | Check client certs are valid and trusted |
| Network connectivity | Verify Query can reach local Query endpoints |
| External labels mismatch | Ensure cluster label is consistent |
Actions:
# Check Query store endpoints
kubectl get deployment thanos-query -n monitoring -o yaml | grep -A 5 --store
# Test connectivity to local queries
kubectl exec -n monitoring <thanos-query-pod> -- \
wget -O- https://<local-query-endpoint>/api/v1/stores
# Verify external labels
kubectl get prometheus -n monitoring -o yaml | grep externalLabels
Symptoms: Cluster status shows RED or YELLOW
Diagnosis:
# Check Elasticsearch status
kubectl get elasticsearch -n logging
kubectl describe elasticsearch <name> -n logging
# Check cluster health via API
kubectl port-forward -n logging svc/elasticsearch-es-http 9200:9200
curl -u elastic:password https://localhost:9200/_cluster/health?pretty
# Check pod status
kubectl get pods -n logging -l elasticsearch.k8s.elastic.co/cluster-name
Common Causes & Solutions:
| Cause | Solution |
|---|---|
| Insufficient nodes | Scale up node count or adjust replicas |
| Disk space full | Check PVC usage: kubectl get pvc -n logging |
| Shard allocation issues | Check: curl .../_cluster/allocation/explain |
| Node failures | Check pod status and events |
| Resource constraints | Verify CPU/memory requests and limits |
Actions:
# Check cluster allocation explanation
curl -u elastic:password https://localhost:9200/_cluster/allocation/explain?pretty
# Check disk usage
kubectl get pvc -n logging
kubectl describe pvc <pvc-name> -n logging
# Check node status
curl -u elastic:password https://localhost:9200/_cat/nodes?v
# Check indices status
curl -u elastic:password https://localhost:9200/_cat/indices?v
Symptoms: No logs appearing in Elasticsearch indices
Diagnosis:
# Check Filebeat pods
kubectl get pods -n logging -l app=filebeat
# Check Filebeat logs
kubectl logs -n logging -l app=filebeat --tail=100
# Verify Filebeat configuration
kubectl get configmap filebeat -n logging -o yaml
Common Causes & Solutions:
| Cause | Solution |
|---|---|
| Elasticsearch endpoint unreachable | Verify ES service and network policies |
| mTLS certificate issues | Check client certs are valid |
| Filebeat configuration error | Verify output.elasticsearch settings |
| Network policies blocking | Check NetworkPolicy allows egress to ES |
| Index template missing | Create index template in Elasticsearch |
Actions:
# Check Filebeat output configuration
kubectl get configmap filebeat -n logging -o yaml | grep -A 10 output
# Test Elasticsearch connectivity
kubectl exec -n logging <filebeat-pod> -- \
curl -k https://elasticsearch-es-http.logging:9200
# Verify mTLS certificates
kubectl get secret <filebeat-cert> -n logging
Symptoms: Cannot access Kibana UI
Diagnosis:
# Check Kibana pod status
kubectl get pods -n logging -l kibana.k8s.elastic.co/name
# Check Kibana logs
kubectl logs -n logging -l kibana.k8s.elastic.co/name --tail=100
# Verify HTTPRoute/Gateway
kubectl get httproute -n logging
kubectl get gateway -n logging
Common Causes & Solutions:
| Cause | Solution |
|---|---|
| HTTPRoute not configured | Create HTTPRoute for Kibana service |
| Gateway not ready | Check Gateway status |
| TLS certificate issues | Verify certificate is valid |
| Elasticsearch connection failed | Check Kibana can reach Elasticsearch |
Actions:
# Check Kibana service
kubectl get svc -n logging | grep kibana
# Verify HTTPRoute
kubectl get httproute kibana -n logging -o yaml
# Test Kibana endpoint
kubectl port-forward -n logging svc/kibana-kb-http 5601:5601
# Visit http://localhost:5601
Symptoms: Policy exists but not enforcing or auditing
Diagnosis:
# Check policy status
kubectl get clusterpolicies
kubectl get policies -A
kubectl describe clusterpolicy <name>
# Check Kyverno logs
kubectl logs -n kyverno -l app.kubernetes.io/name=kyverno --tail=100
Common Causes & Solutions:
| Cause | Solution |
|---|---|
Policy enabled: false in values |
Set enabled: true in chart values |
| Policy file conditional mismatch | Verify Helm template condition matches values |
| Kyverno controller not running | Check pod status: kubectl get pods -n kyverno |
| Policy validation errors | Check policy YAML syntax |
Actions:
# Verify policy is enabled
kubectl get clusterpolicy <name> -o yaml | grep -A 5 spec
# Check Kyverno admission controller
kubectl get validatingwebhookconfigurations | grep kyverno
kubectl get mutatingwebhookconfigurations | grep kyverno
# Test policy manually
kubectl apply -f <test-resource.yaml>
Symptoms: Valid workloads rejected by Kyverno
Diagnosis:
# Check policy reports
kubectl get policyreports -A
kubectl describe policyreport <name> -n <namespace>
# Check admission request logs
kubectl logs -n kyverno -l app.kubernetes.io/name=kyverno | grep -i deny
# Verify policy mode
kubectl get clusterpolicy <name> -o jsonpath='{.spec.rules[*].validate.failureAction}'
Common Causes & Solutions:
| Cause | Solution |
|---|---|
| Policy switched to Enforce prematurely | Revert to Audit mode or add exception |
| Exception label missing | Add required exception label to resource |
| Policy rule too restrictive | Review and adjust policy rules |
| Namespace not excluded | Add namespace to policy exclusions |
Actions:
# Check policy failure action
kubectl get clusterpolicy <name> -o yaml | grep failureAction
# Add exception label
kubectl label <resource> policy.exception/<rule>=approved
# Temporarily disable policy
kubectl patch clusterpolicy <name> -p '{"spec":{"validationFailureAction":"Audit"}}'
Symptoms: Pods cannot connect to other services in cluster
Diagnosis:
# Test connectivity from pod
kubectl exec -n <namespace> <pod-name> -- curl <service-name>.<namespace>.svc.cluster.local
# Check service endpoints
kubectl get endpoints <service-name> -n <namespace>
# Verify DNS resolution
kubectl exec -n <namespace> <pod-name> -- nslookup <service-name>
Common Causes & Solutions:
| Cause | Solution |
|---|---|
| Network policies blocking | Check NetworkPolicy rules |
| Service selector mismatch | Verify service selector matches pod labels |
| DNS issues | Check CoreDNS pods and configuration |
| CNI plugin issues | Verify CNI pods are running |
Actions:
# Check NetworkPolicies
kubectl get networkpolicies -A
kubectl describe networkpolicy <name> -n <namespace>
# Verify service selector
kubectl get svc <service-name> -n <namespace> -o yaml | grep selector
kubectl get pods -n <namespace> -l <selector>
# Check CoreDNS
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dns
Symptoms: Cannot reach external services from pods
Diagnosis:
# Test external connectivity
kubectl run test-pod --image=busybox --rm -it -- wget -O- https://www.google.com
# Check egress network policies
kubectl get networkpolicies -A | grep -i egress
# Verify node network
kubectl get nodes -o wide
Common Causes & Solutions:
| Cause | Solution |
|---|---|
| Egress network policies blocking | Review and adjust NetworkPolicy egress rules |
| Node network issues | Check node network configuration |
| Firewall rules | Verify cloud provider firewall rules |
| NAT gateway issues | Check NAT gateway configuration |
Symptoms: PersistentVolumeClaim stuck in Pending state
Diagnosis:
# Check PVC status
kubectl get pvc -A
kubectl describe pvc <name> -n <namespace>
# Check available storage classes
kubectl get storageclass
# Check PVs
kubectl get pv
Common Causes & Solutions:
| Cause | Solution |
|---|---|
| No storage class specified | Set storageClassName in PVC |
| Storage class doesn’t exist | Create StorageClass or use default |
| Insufficient storage | Check available storage in cluster |
| Provisioner not working | Check storage provisioner pods |
Actions:
# Check storage classes
kubectl get storageclass
# Verify default storage class
kubectl get storageclass -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.metadata.annotations.storageclass\.kubernetes\.io/is-default-class}{"\n"}{end}'
# Check provisioner logs
kubectl logs -n <provisioner-namespace> -l app=<provisioner>
Symptoms: Pods evicted or cannot write data
Diagnosis:
# Check node disk usage
kubectl top nodes
# Check PVC usage
kubectl get pvc -A
df -h # on nodes
# Check pod eviction events
kubectl get events -A | grep -i evict
Actions:
# Expand PVC (if supported)
kubectl patch pvc <name> -n <namespace> -p '{"spec":{"resources":{"requests":{"storage":"<new-size>"}}}}'
# Clean up unused images
# On nodes: docker system prune -a
# Delete unused PVCs
kubectl get pvc -A | grep Released
kubectl delete pvc <name> -n <namespace>
Symptoms: Pods throttled or OOMKilled
Diagnosis:
# Check resource usage
kubectl top pods -A
kubectl top nodes
# Check resource limits
kubectl get pods -A -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[*].resources}{"\n"}{end}'
# Check for OOMKilled
kubectl get pods -A | grep -i oom
Actions:
# Adjust resource requests/limits
kubectl set resources deployment <name> -n <namespace> \
--requests=cpu=500m,memory=512Mi \
--limits=cpu=1000m,memory=1Gi
# Scale horizontally
kubectl scale deployment <name> -n <namespace> --replicas=<count>
Symptoms: Argo CD applications take long time to sync
Diagnosis:
# Check sync duration
argocd app get <app-name> | grep Sync
# Check repo server performance
kubectl top pods -n argocd -l app.kubernetes.io/name=argocd-repo-server
# Check application controller queue
kubectl logs -n argocd -l app.kubernetes.io/name=argocd-application-controller | grep queue
Actions:
# Scale repo server
kubectl scale deployment argocd-repo-server -n argocd --replicas=<count>
# Increase repo server resources
kubectl set resources deployment argocd-repo-server -n argocd \
--requests=memory=2Gi,cpu=1000m \
--limits=memory=4Gi,cpu=2000m
If you cannot resolve an issue using this guide:
charts/<component>/README.mddocs/# Argo CD
argocd app list
argocd app get <app-name>
argocd app sync <app-name>
argocd app diff <app-name>
# Kubernetes
kubectl get all -n <namespace>
kubectl describe <resource> <name> -n <namespace>
kubectl logs <pod-name> -n <namespace>
kubectl get events -n <namespace> --sort-by='.lastTimestamp'
# Debugging
kubectl exec -it <pod-name> -n <namespace> -- /bin/sh
kubectl port-forward svc/<service> -n <namespace> <local-port>:<service-port>
kubectl run debug-pod --image=busybox --rm -it -- /bin/sh