k8s

Elasticsearch Best Practices (Kubernetes Platform)

Opinionated guidance for operating a production‑grade Elasticsearch (ES) deployment on Kubernetes using the Elastic Operator / Helm chart.


1. Cluster Architecture & Node Roles

Use distinct node sets (separate StatefulSets) for each role to isolate resource contention and scale independently:

Do NOT mix master + data on same Pod for production scale (recoveries & GC pauses on data nodes can trigger master instability).

Label all Pods with es.node.role=<master|data-hot|data-warm|data-cold|ingest|coord> and use node selectors / affinities:

nodeSets:
  - name: master
    config.elasticsearch.node.roles: ["master"]
  - name: data-hot
    config.elasticsearch.node.roles: ["data_hot","data_content"]
  - name: ingest
    config.elasticsearch.node.roles: ["ingest"]

Enable anti‑affinity for master and data nodes to spread them across failure domains:


2. Secure External Access (Gateway API + mTLS)

External access to Elasticsearch (outside cluster boundary) must be over mTLS:

Gateway API Example (pseudo):

Gateway:
  name: es-gw
  listeners:
    - protocol: HTTPS
      tls:
        mode: Terminate
        certificateRef:
            secret: es-public-cert
HTTPRoute:
  rules:
    - matches: /
      backendRefs:
        - name: elasticsearch-coord
          port: 9200

Add mTLS policy extension (custom filter / AuthZ) requiring client cert signed by internal CA.


3. Resource Requests, Limits & JVM Heap Alignment

Every nodeSet must define CPU/memory requests & limits. Align JVM heap (Xms=Xmx) to a fraction of Pod memory:

Validate: /_nodes/jvm and monitor heap usage vs container memory.


4. JVM Heap Guidance


5. Index Lifecycle Management (ILM)

Always enable ILM for retention & tiering:

Monitor ILM execution & backlog: _ilm/explain, adjust transition thresholds before shard explosion.


6. JVM & Performance Tuning (Advanced)

Augment ES_JAVA_OPTS for heavier workloads:

Do NOT blindly copy large GC thread counts from bare‑metal guidance; respect container CPU quota.


7. Shard & Index Strategy


8. Operational Settings (Recovery & Allocation)

For large clusters / many indices tune allocation & recovery parameters:

Balance faster recovery with query latency; monitor disk IO saturation.


9. ILM Poll Interval Tuning

If thousands of indices or ILM backlog observed:


10. Monitoring & Alerting

Collect metrics:

Alerts:

Use Prometheus exporters or native Elastic monitoring; route alerts via Alertmanager or ElastAlert/Kibana watchers.


11. Security & Compliance


12. Backup & Disaster Recovery


13. Common Pitfalls / Anti‑Patterns


Reference Commands