Guide

Building and Optimizing Elasticsearch on Kubernetes: A Logging Stack Homelab Journey

Nicholas Pecka

31 Oct 2025 • 20 min read

Guide Overview

Guide: Building and Optimizing Elasticsearch on Kubernetes: A Logging Stack Homelab Journey
Category: Monitoring / Infrastructure Optimization
Difficulty: Easy
Estimated Time: 1-2 hours for full deployment and optimization
Cost: Free (open-source tools)
Note: This guide documents a complete Elasticsearch deployment from initial setup through optimization, with all configurations tested and validated in a live homelab environment.

What You'll Build

This guide walks through deploying a complete Elasticsearch logging stack in Kubernetes, then documents the real-world problems that emerged during operation and how they were systematically resolved.

By the end of this guide, you'll have:

A working Elasticsearch, Kibana, and Filebeat deployment on Kubernetes
Ready-to-go configurations for single-node homelab environments
Solutions for common operational problems (crash loops, resource sizing, index management)
A methodology for using AI tools to accelerate troubleshooting
Tested configurations handling up to 40+ million documents across multiple namespaces

Brief Background on the ELK Stack

The ELK Stack (Elasticsearch, Logstash, Kibana) is a popular open-source solution for centralized logging and log analysis. In modern implementations, Filebeat often replaces Logstash as a more lightweight log shipper, making it the "EFK" stack.

Components:

Elasticsearch: A distributed search and analytics engine that stores and indexes log data
Filebeat: A lightweight log shipper that collects logs from various sources and forwards them to Elasticsearch
Kibana: A visualization and exploration tool that provides a web interface for querying and visualizing data stored in Elasticsearch

This stack is particularly valuable in Kubernetes environments where logs from hundreds of pods across multiple nodes need to be aggregated, searched, and analyzed from a single location.

Use Cases for the ELK Stack

When to Use ELK/EFK

Ideal for:

Complex log analysis - Full-text search across all log fields with powerful query language
Multi-source aggregation - Collecting logs from diverse sources (Kubernetes, applications, infrastructure)
Historical analysis - Investigating issues that occurred hours or days ago with fast search
Security monitoring - Correlating security events across multiple systems
Compliance - Retaining logs for audit requirements with search capabilities
Dashboarding - Creating visualizations and dashboards in Kibana for team visibility

Homelab use cases:

Learning Kubernetes logging patterns
Troubleshooting application deployments
Monitoring security tools and experiments
Centralizing logs from multiple namespaces/projects
Building skills relevant to enterprise environments

Production use cases:

Application performance monitoring (APM)
Security Information and Event Management (SIEM)
Infrastructure monitoring and alerting
Customer support troubleshooting
Business analytics from application logs

When NOT to Use ELK

Consider alternatives if:

Resource-constrained environments:

ELK requires significant resources (4GB+ RAM for Elasticsearch alone)
Alternative: Grafana Loki uses ~80% fewer resources for simple log aggregation
Trade-off: Loki is slower for full-text search but much lighter

Simple log viewing:

If you just need to "tail" recent logs without complex queries
Alternative: kubectl logs or simple file-based logging
Trade-off: No centralization or historical search

Managed solutions preferred:

If you don't want to maintain infrastructure
Alternative: CloudWatch, Datadog, New Relic, Splunk Cloud
Trade-off: Ongoing costs vs. self-hosted maintenance burden

Metric-focused monitoring:

ELK is optimized for logs, not time-series metrics
Alternative: Prometheus + Grafana for metrics
Note: Many deployments use both (ELK for logs, Prometheus for metrics)

ELK vs Alternatives Comparison

Feature	ELK/EFK	Loki	Splunk	CloudWatch
Cost	Free (self-hosted)	Free (self-hosted)	Paid	Pay-per-use
Resource Usage	High (4GB+)	Low (~500MB)	High	N/A (managed)
Full-Text Search	Excellent	Slower	Excellent	Good
Complexity	Moderate	Low	Moderate	Low
Best For	Complex queries	Simple aggregation	Enterprise	AWS workloads

Prerequisites

Required:

Running Kubernetes cluster (RKE, k3s, or any distribution)
At least one node with 4GB+ RAM available for Elasticsearch
Basic Kubectl and Helm knowledge
Understanding of Kubernetes concepts (StatefulSets, DaemonSets, Deployments, ConfigMaps)

Helpful but Optional:

Familiarity with Elasticsearch concepts
Experience with log aggregation systems
Access to AI tools (Claude, ChatGPT) for troubleshooting assistance

Architecture Overview

This deployment creates a centralized logging pipeline for Kubernetes:

┌─────────────────────────────────────────────────────────────┐
│             Kubernetes Cluster                              │
│                                                             │
│  ┌──────────┐  ┌──────────┐   ┌──────────┐                  │
│  │   Pod    │  │   Pod    │   │   Pod    │                  │
│  │  Logs    │  │  Logs    │   │  Logs    │                  │
│  └────┬─────┘  └────┬─────┘   └────┬─────┘                  │
│       │             │              │                        │
│       └─────────────┴──────────────┘                        │
│                     │                                       │
│              ┌──────▼───────┐                               │
│              │   Filebeat   │   (DaemonSet on each node)    │
│              │  Log Shipper │                               │
│              └──────┬───────┘                               │
│                     │                                       │
│              ┌──────▼────────┐                              │
│              │ Elasticsearch │  (StatefulSet, single node)  │
│              │   Storage &   │                              │
│              │    Indexing   │                              │
│              └──────┬────────┘                              │
│                     │                                       │
│              ┌──────▼────────┐                              │
│              │    Kibana     │  (Deployment)                │
│              │ Visualization │                              │
│              └───────────────┘                              │
└─────────────────────────────────────────────────────────────┘

Data Flow:

Filebeat runs on each node, tailing container logs
Logs are shipped to Elasticsearch for indexing and storage
Kibana provides a web UI for searching and visualizing logs

Hardware/Software Requirements

Minimum Requirements

Cluster: Kubernetes cluster with at least one node
Node Resources: 4GB+ RAM available for Elasticsearch
Storage: Calculate based on your log volume (see formula below)

Software Versions

Kubernetes: 1.24+ (any distribution)
Elasticsearch: 8.5.1
Kibana: 8.5.1
Filebeat: 8.5.1
Helm: 3.x

Storage Sizing Calculator

Formula:

Required Storage = (Daily Log Volume in GB × Retention Days) × 1.5

The 1.5 multiplier accounts for:

Elasticsearch overhead (indices, mappings, shards): ~20%
Growth buffer: ~30%

Example Calculation (this deployment):

Daily log volume: ~300MB
Retention period: 30 days
Calculation: (0.3 GB × 30) × 1.5 = 13.5GB
Allocated: 50Gi (provides 4x headroom for growth)

How to measure your daily log volume:

# After running 24 hours, check index sizes:
kubectl exec elasticsearch-master-0 -n elastic-stack -- \
  curl -s -k -u "elastic:PASSWORD" \
  "https://localhost:9200/_cat/indices?v&s=store.size:desc"

# Sum the sizes of indices from the last 24 hours

Quick Reference:

Small homelab (1-5 namespaces, low traffic): 20-30Gi
Medium homelab (5-15 namespaces): 50-100Gi
Large homelab (15+ namespaces, high traffic): 100-200Gi

Final Resource Usage (this deployment):

Elasticsearch: 4GB RAM, 1 CPU
Kibana: 2GB RAM, 1 CPU
Filebeat: 100-200MB RAM per node
Storage: 12GB used of 50Gi allocated (300MB/day × 30 days with overhead)

Quick Reference

This section provides essential commands and information for managing your Elasticsearch deployment.

Important Endpoints and Ports

Elasticsearch: https://elasticsearch-master.elastic-stack.svc:9200
Kibana: http://kibana-kibana.elastic-stack.svc:5601
Namespace: elastic-stack

Default Credentials

# Get Elasticsearch password
kubectl get secret elasticsearch-master-credentials \
  -n elastic-stack -o jsonpath='{.data.password}' | base64 -d

# Default username: elastic

Common kubectl Commands

Check pod status:

# All elastic-stack pods
kubectl get pods -n elastic-stack

# Watch for changes
kubectl get pods -n elastic-stack -w

# Check specific component
kubectl get pods -n elastic-stack -l app=elasticsearch-master
kubectl get pods -n elastic-stack -l app=kibana-kibana
kubectl get pods -n elastic-stack -l app=filebeat-filebeat

Check resource usage:

# Pod resource consumption
kubectl top pods -n elastic-stack

# Node resource consumption
kubectl top nodes

View logs:

# Elasticsearch logs
kubectl logs elasticsearch-master-0 -n elastic-stack

# Kibana logs
kubectl logs deployment/kibana-kibana -n elastic-stack

# Filebeat logs (specific pod)
kubectl logs <filebeat-pod-name> -n elastic-stack

# Follow logs in real-time
kubectl logs -f elasticsearch-master-0 -n elastic-stack

Access services locally:

# Port-forward Elasticsearch
kubectl port-forward svc/elasticsearch-master 9200:9200 -n elastic-stack

# Port-forward Kibana
kubectl port-forward svc/kibana-kibana 5601:5601 -n elastic-stack

# Then access:
# Elasticsearch: https://localhost:9200
# Kibana: http://localhost:5601

Elasticsearch API Commands

Cluster health:

# Get password first
ELASTIC_PASSWORD=$(kubectl get secret elasticsearch-master-credentials \
  -n elastic-stack -o jsonpath='{.data.password}' | base64 -d)

# Check cluster health
curl -k -u "elastic:${ELASTIC_PASSWORD}" \
  "https://localhost:9200/_cluster/health?pretty"

# Cluster stats
curl -k -u "elastic:${ELASTIC_PASSWORD}" \
  "https://localhost:9200/_cluster/stats?pretty"

Index management:

# List all indices
curl -k -u "elastic:${ELASTIC_PASSWORD}" \
  "https://localhost:9200/_cat/indices?v"

# List indices sorted by size
curl -k -u "elastic:${ELASTIC_PASSWORD}" \
  "https://localhost:9200/_cat/indices?v&s=store.size:desc"

# List indices sorted by creation date
curl -k -u "elastic:${ELASTIC_PASSWORD}" \
  "https://localhost:9200/_cat/indices?v&s=creation.date:desc"

# Delete specific index
curl -k -u "elastic:${ELASTIC_PASSWORD}" \
  -X DELETE "https://localhost:9200/index-name"

Node information:

# Node stats
curl -k -u "elastic:${ELASTIC_PASSWORD}" \
  "https://localhost:9200/_cat/nodes?v"

# JVM memory usage
curl -k -u "elastic:${ELASTIC_PASSWORD}" \
  "https://localhost:9200/_nodes/stats/jvm?pretty"

Helm Commands

Check current deployments:

# List Helm releases in namespace
helm list -n elastic-stack

# Get values for a release
helm get values elasticsearch -n elastic-stack
helm get values kibana -n elastic-stack
helm get values filebeat -n elastic-stack

Upgrade deployments:

# Upgrade Elasticsearch
helm upgrade elasticsearch elastic/elasticsearch \
  -f elasticsearch-values.yaml \
  -n elastic-stack

# Upgrade Kibana
helm upgrade kibana elastic/kibana \
  -f kibana-values.yaml \
  -n elastic-stack

# Upgrade Filebeat
helm upgrade filebeat elastic/filebeat \
  -f filebeat-values.yaml \
  -n elastic-stack

Rollback if needed:

# See release history
helm history elasticsearch -n elastic-stack

# Rollback to previous version
helm rollback elasticsearch -n elastic-stack

# Rollback to specific revision
helm rollback elasticsearch 2 -n elastic-stack

Quick Troubleshooting Commands

Pod won't start:

# Describe pod to see events
kubectl describe pod <pod-name> -n elastic-stack

# Check PVC status
kubectl get pvc -n elastic-stack

# Check node resources
kubectl top nodes

Performance issues:

# Check resource usage
kubectl top pods -n elastic-stack

# Check Elasticsearch heap usage
kubectl exec elasticsearch-master-0 -n elastic-stack -- \
  curl -k -u "elastic:${ELASTIC_PASSWORD}" \
  "https://localhost:9200/_nodes/stats/jvm?pretty"

# Check index count (too many can slow things down)
curl -k -u "elastic:${ELASTIC_PASSWORD}" \
  "https://localhost:9200/_cat/indices?v" | wc -l

Restart components:

# Restart Elasticsearch (StatefulSet - restarts in order)
kubectl rollout restart statefulset/elasticsearch-master -n elastic-stack

# Restart Kibana
kubectl rollout restart deployment/kibana-kibana -n elastic-stack

# Restart Filebeat
kubectl rollout restart daemonset/filebeat-filebeat -n elastic-stack

Configuration File Locations

Local values files: elasticsearch-values.yaml, kibana-values.yaml, filebeat-values.yaml
See the References and Additional Resources section at the end of this guide for GitHub repository links

Elastic Deployment

Overview

Deploy Elasticsearch as a StatefulSet with persistent storage. This initial deployment uses conservative resource estimates that we'll adjust later based on actual usage. The yaml will deploy Elasticsearch in single-node mode.

Instructions

1. Create namespace

kubectl create namespace elastic-stack

2. Add Elastic Helm repository

helm repo add elastic https://helm.elastic.co
helm repo update

3. Create Elasticsearch values file

Create elasticsearch-values.yaml:

---
clusterName: "elasticsearch"
nodeGroup: "master"

# Elasticsearch roles for single-node deployment
roles:
  - master
  - data
  - data_content
  - data_hot
  - data_warm
  - data_cold
  - ingest
  - ml
  - remote_cluster_client
  - transform

replicas: 1
minimumMasterNodes: 1

# Initial resource allocation (conservative estimate)
# NOTE: This will be increased later based on actual usage
resources:
  requests:
    cpu: "1000m"
    memory: "2Gi"  # INITIAL - will increase to 4Gi after monitoring
  limits:
    cpu: "1000m"
    memory: "2Gi"  # INITIAL - will increase to 4Gi after monitoring

esJavaOpts: "-Xmx1g -Xms1g"  # INITIAL - 50% of 2Gi, will adjust later

# Persistent storage
# Use the storage calculator in the requirements section to determine your needs
# Formula: (Daily Log Volume GB × Retention Days) × 1.5
volumeClaimTemplate:
  accessModes: ["ReadWriteOnce"]
  resources:
    requests:
      storage: 50Gi  # Adjust based on your log volume

# Security settings
secret:
  enabled: true
  password: "elastic"  # Change in production

# Single-node optimization
antiAffinity: "hard"

protocol: https
httpPort: 9200
transportPort: 9300

# Security hardening
podSecurityContext:
  fsGroup: 1000
  runAsUser: 1000

securityContext:
  capabilities:
    drop:
      - ALL
  runAsNonRoot: true
  runAsUser: 1000

sysctlInitContainer:
  enabled: true

sysctlVmMaxMapCount: 262144

readinessProbe:
  failureThreshold: 3
  initialDelaySeconds: 10
  periodSeconds: 10
  successThreshold: 3
  timeoutSeconds: 5

clusterHealthCheckParams: "wait_for_status=yellow&timeout=1s"

4. Deploy Elasticsearch

helm install elasticsearh elastic/elasticsearch \
  -f elasticsearch-values.yaml \
  -n elastic-stack

5. Wait for Elasticsearch to be ready

kubectl get pods -n elastic-stack -w

Expected output:

NAME                     READY   STATUS    RESTARTS   AGE
elasticsearch-master-0   1/1     Running   0          3m

Verification

Check Elasticsearch health:

# Get the password
ELASTIC_PASSWORD=$(kubectl get secret elasticsearch-master-credentials \
  -n elastic-stack -o jsonpath='{.data.password}' | base64 -d)

# Port-forward to access Elasticsearch
kubectl port-forward svc/elasticsearch-master 9200:9200 -n elastic-stack &

# Check cluster health
curl -k -u "elastic:${ELASTIC_PASSWORD}" https://localhost:9200/_cluster/health?pretty

Expected response:

{
  "cluster_name" : "elasticsearch",
  "status" : "yellow",
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1
}

Note: Yellow status is expected for single-node clusters (no replicas possible).

Kibana Deployment

Overview

Kibana provides the web interface for querying and visualizing logs stored in Elasticsearch.

Instructions

1. Create Kibana values file

Create kibana-values.yaml:

---
image: "docker.elastic.co/kibana/kibana"
imageTag: "8.5.1"

replicas: 1

# Resource allocation
resources:
  requests:
    memory: "1Gi"
    cpu: "1000m"
  limits:
    memory: "2Gi"
    cpu: "1000m"

# Elasticsearch connection
elasticsearchHosts: "https://elasticsearch-master:9200"

# Environment variables
extraEnvs:
  - name: "ELASTICSEARCH_SSL_CERTIFICATEAUTHORITIES"
    value: "/usr/share/kibana/config/certs/ca.crt"
  - name: "SERVER_HOST"
    value: "0.0.0.0"
  - name: "NODE_OPTIONS"
    value: "--max-old-space-size=1800"

secretMounts:
  - name: elasticsearch-certs
    secretName: elasticsearch-master-certs
    path: /usr/share/kibana/config/certs
    readOnly: true

# Security context
securityContext:
  capabilities:
    drop:
      - ALL
  runAsNonRoot: true
  runAsUser: 1000

podSecurityContext:
  fsGroup: 1000

# Service configuration
service:
  type: ClusterIP
  port: 5601

2. Create Kibana service account token

Create a service account token for Kibana to authenticate with Elasticsearch:

# Create a service account token in Elasticsearch
kubectl exec elasticsearch-master-0 -n elastic-stack -- \
  /usr/share/elasticsearch/bin/elasticsearch-service-tokens create \
  elastic/kibana kibana-token > /tmp/kibana-token.txt

# Extract the token value (4th field in output)
TOKEN=$(cat /tmp/kibana-token.txt | awk '{print $4}')

# Create Kubernetes secret with the token
kubectl create secret generic kibana-kibana-es-token \
  -n elastic-stack \
  --from-literal=token="${TOKEN}"

# Clean up temp file
rm /tmp/kibana-token.txt

3. Deploy Kibana

helm install kibana elastic/kibana \
  -f kibana-values.yaml \
  -n elastic-stack

4. Wait for Kibana to be ready

kubectl get pods -n elastic-stack -w

Verification

Access Kibana:

kubectl port-forward svc/kibana-kibana 5601:5601 -n elastic-stack

Navigate to http://localhost:5601 in your browser. Login with:

Username: elastic
Password: (value from ELASTIC_PASSWORD variable)

Filebeat Deployment

Overview

Filebeat runs as a DaemonSet on each node, collecting container logs and shipping them to Elasticsearch.

Instructions

1. Create Filebeat values file

Create filebeat-values.yaml:

---
image: "docker.elastic.co/beats/filebeat"
imageTag: "8.5.1"
imagePullPolicy: "IfNotPresent"

# DaemonSet configuration
daemonset:
  enabled: true

deployment:
  enabled: false

# Cluster role for reading pod metadata
clusterRoleRules:
- apiGroups:
  - ""
  resources:
  - namespaces
  - pods
  - nodes
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - apps
  resources:
  - replicasets
  verbs:
  - get
  - list
  - watch

# Resource limits
resources:
  requests:
    cpu: "100m"
    memory: "100Mi"
  limits:
    cpu: "200m"
    memory: "200Mi"

# Environment variables
extraEnvs:
  - name: ELASTICSEARCH_PASSWORD
    valueFrom:
      secretKeyRef:
        name: elasticsearch-master-credentials
        key: password

# Volume mounts for certificates
secretMounts:
  - name: elasticsearch-master-certs
    secretName: elasticsearch-master-certs
    path: /usr/share/filebeat/certs/
  - name: elasticsearch-credentials
    secretName: elasticsearch-master-credentials
    path: /usr/share/filebeat/secrets/

# Filebeat configuration
filebeatConfig:
  filebeat.yml: |
    filebeat.inputs:
    - type: container
      paths:
        - /var/log/containers/*.log
      processors:
      - add_kubernetes_metadata:
          host: ${NODE_NAME}
          matchers:
          - logs_path:
              logs_path: "/var/log/containers/"
      - drop_event:
          when:
            or:
              # Drop filebeat's own logs
              - equals:
                  kubernetes.namespace: "kube-system"
                  kubernetes.container.name: "filebeat"
              # Drop metrics-server logs (noisy)
              - equals:
                  kubernetes.namespace: "kube-system"
                  kubernetes.container.name: "metrics-server"

    # Send to Elasticsearch
    output.elasticsearch:
      hosts: ['https://elasticsearch-master.elastic-stack.svc:9200']
      protocol: "https"
      username: "elastic"
      password: "${ELASTICSEARCH_PASSWORD}"
      ssl.certificate_authorities:
        - /usr/share/filebeat/certs/ca.crt
      indices:
        - index: "logs-kubernetes-%{+yyyy.MM.dd}"
          when.or:
            - not:
                has_fields: ['kubernetes.namespace']
        - index: "logs-security-%{[kubernetes.namespace]}-%{+yyyy.MM.dd}"
          when.or:
            - equals:
                kubernetes.namespace: "your-security-namespace-1"
            - equals:
                kubernetes.namespace: "your-security-namespace-2"
            - equals:
                kubernetes.namespace: "your-security-namespace-3"
        - index: "logs-app-%{[kubernetes.namespace]}-%{+yyyy.MM.dd}"

    # ILM and index template settings
    setup.ilm.enabled: true
    setup.ilm.rollover_alias: "filebeat"
    setup.ilm.pattern: "{now/d}-000001"
    setup.template.name: "filebeat"
    setup.template.pattern: "logs-*"
    setup.template.settings:
      index.number_of_shards: 1
      index.number_of_replicas: 0

readinessProbe:
  failureThreshold: 3
  initialDelaySeconds: 10
  periodSeconds: 10
  successThreshold: 3
  timeoutSeconds: 15

2. Deploy Filebeat

helm install filebeat elastic/filebeat \
  -f filebeat-values.yaml \
  -n elastic-stack

3. Verify Filebeat pods

kubectl get pods -n elastic-stack -l app=filebeat-filebeat

Expected output (one pod per node):

NAME                      READY   STATUS    RESTARTS   AGE
filebeat-filebeat-abc12   1/1     Running   0          2m
filebeat-filebeat-def34   1/1     Running   0          2m

Verification

Check logs are flowing to Elasticsearch:

# Port-forward to Elasticsearch
kubectl port-forward svc/elasticsearch-master 9200:9200 -n elastic-stack &

# Check indices
curl -k -u "elastic:${ELASTIC_PASSWORD}" \
  "https://localhost:9200/_cat/indices/logs-*?v"

Expected output:

health status index                      pri rep docs.count
yellow open   logs-kubernetes-2025.10.29  1   0      12345
yellow open   logs-app-default-2025.10.29 1   0       5678

View logs in Kibana:

Navigate to Kibana (http://localhost:5601)
Go to "Discover"
Create index pattern: logs-*
You should see logs flowing in

Baking Period and Monitoring

Overview

After initial deployment, I let the system run for several weeks to understand actual usage patterns and identify issues.

Monitoring Commands

Check resource usage:

# Node-level resource usage
kubectl top nodes

# Pod-level resource usage
kubectl top pods -n elastic-stack

# Elasticsearch memory usage specifically
kubectl exec elasticsearch-master-0 -n elastic-stack -- \
  curl -k -u "elastic:${ELASTIC_PASSWORD}" \
  "https://localhost:9200/_nodes/stats/jvm?pretty"

Check pod status and restarts:

# Watch for crash loops
kubectl get pods -n elastic-stack -w

# Check pod restart counts
kubectl get pods -n elastic-stack

Monitor Elasticsearch health:

# Cluster stats
curl -k -u "elastic:${ELASTIC_PASSWORD}" \
  "https://localhost:9200/_cluster/stats?pretty"

# Index count and sizes
curl -k -u "elastic:${ELASTIC_PASSWORD}" \
  "https://localhost:9200/_cat/indices?v&s=store.size:desc"

What I Observed (After 2-3 Weeks)

Resource Usage Reality:

$ kubectl top pods -n elastic-stack

NAME                       CPU    MEMORY
elasticsearch-master-0     823m   2760Mi  # Exceeding 2Gi limit!
kibana-kibana-xxx          200m   400Mi
filebeat-filebeat-xxx      50m    100Mi
filebeat-filebeat-yyy      45m    95Mi

Problems Identified:

Elasticsearch Memory Issues
- Allocated: 2Gi
- Actual usage: 2.76Gi
- Risk: OOM kills, performance degradation

Filebeat Crash Loops

$ kubectl get pods -n elastic-stack

NAME                       READY   STATUS    RESTARTS   AGE
filebeat-filebeat-abc12    1/1     Running   100+       3d

Error in logs:

cannot obtain lockfile: cannot start, data directory belongs to process with pid 8

Index Sprawl

$ curl -k -u "elastic:${ELASTIC_PASSWORD}" \
  "https://localhost:9200/_cat/indices?v" | wc -l

500+ indices  # No lifecycle management!

No Backup Strategy
- Millions of documents stored
- Growing data with no disaster recovery plan
- Need automated backup solution

Utilizing AI to Enhance Deployment

Observing Findings with AI

At this point, I engaged Claude AI to help diagnose and fix these issues. This section documents that workflow and what we discovered.

The AI-Assisted Workflow

1. Gathered diagnostic data:

# Filebeat crash loop investigation
kubectl describe pod filebeat-filebeat-abc12 -n elastic-stack > filebeat-crash.txt
kubectl logs filebeat-filebeat-abc12 -n elastic-stack --tail=100 > filebeat-logs.txt

# Elasticsearch resource analysis
kubectl describe pod elasticsearch-master-0 -n elastic-stack > es-describe.txt
kubectl top pod elasticsearch-master-0 -n elastic-stack > es-usage.txt

# Index analysis
curl -k -u "elastic:${ELASTIC_PASSWORD}" \
  "https://localhost:9200/_cat/indices?v&s=creation.date:desc" > indices.txt

2. Presented problems to Claude AI:

Shared error messages and logs (sanitized of credentials)
Described environment constraints (homelab, single-node)
Asked for multiple solution options with trade-offs

3. Evaluated solutions:

Discussed pros/cons of each approach
Considered homelab vs. production best practices
Made decisions based on my specific context

4. Implemented fixes incrementally:

One change at a time
Validated each fix before moving to next
Documented decisions and findings

Mitigating Findings Suggested by AI

The following sections detail each fix that was implemented based on AI-assisted troubleshooting.

Fix #1 - Backup Strategy (Velero + MinIO)

Overview:
Before making any risky configuration changes, I established a disaster recovery strategy using Velero to back up the Elasticsearch data to MinIO running on my NAS.

Implementation Note:
This guide focuses on Elasticsearch optimization, so I won't detail the full Velero/MinIO setup here. That will be covered in a dedicated backup guide. However, the key points:

Velero installed in the cluster for Kubernetes backup/restore
MinIO running on NAS as S3-compatible storage target
Scheduled backups of the elastic-stack namespace and persistent volumes
Tested restores to validate the backup chain works

Why this was critical:
With backups in place, I could confidently make configuration changes knowing I could recover from any mistakes. This enabled the emptyDir fix in the next step.

Verification:

# Check Velero backups exist
velero backup get

# Validate latest backup
velero backup describe <backup-name> --details

Fix #2 - Filebeat Crash Loop (emptyDir)

Root Cause:
Filebeat was using a hostPath volume for data persistence:

# BEFORE (problematic configuration)
volumes:
  - name: data
    hostPath:
      path: /var/lib/filebeat-filebeat-elastic-stack-data
      type: DirectoryOrCreate

When Filebeat containers crashed, lock files remained on the host filesystem, preventing new containers from starting.

The Fix:

Changed to emptyDir volume type:

# Apply the patch
kubectl patch daemonset filebeat-filebeat -n elastic-stack \
  --type='json' \
  -p='[{"op": "replace", "path": "/spec/template/spec/volumes/3", "value": {"name": "data", "emptyDir": {}}}]'

Current deployed configuration (in DaemonSet spec):

# AFTER (fixed configuration)
# File reference: kubectl get daemonset filebeat-filebeat -n elastic-stack -o yaml
volumes:
  - name: data
    emptyDir: {}  # ← The fix!

Trade-offs:

Pro:

Clean restarts, no more crash loops
Lock files cleared on pod restart

Con:

Registry state (log file positions) lost on restart
Potential duplicate logs after restart

Why acceptable:

Logs are backed up to NAS
Occasional duplicates preferable to non-functional system
Homelab context (not financial transactions)

Result:

$ kubectl get pods -n elastic-stack -l app=filebeat-filebeat

NAME                      READY   STATUS    RESTARTS   AGE
filebeat-filebeat-abc12   1/1     Running   0          2d
filebeat-filebeat-def34   1/1     Running   0          2d

Zero restarts since October 29, 2025!

Fix #3 - Elasticsearch Resource Sizing

The Problem:

Initial allocation was conservative guesswork:

Allocated: 2Gi RAM
Actual usage: 2.76Gi RAM
Java heap: 1g (50% of 2Gi)

This caused performance issues and risk of OOM kills.

The Analysis:

Elasticsearch needs memory for:

Java heap - Core operations (indexing, search)
OS file cache - Lucene segments (critical for performance)

Best practice: 50/50 split between heap and file cache.

The Fix:

Updated resource allocation to match reality:

# BEFORE (initial guess)
resources:
  requests:
    cpu: "1000m"
    memory: "2Gi"
  limits:
    cpu: "1000m"
    memory: "2Gi"

esJavaOpts: "-Xmx1g -Xms1g"

# AFTER (data-driven sizing)
# See: https://github.com/npecka/peckacyber/blob/main/guides/elasticsearch-k8s-optimization/elasticsearch-values.yaml
resources:
  requests:
    cpu: "1000m"
    memory: "4Gi"  # Doubled to accommodate actual usage
  limits:
    cpu: "1000m"
    memory: "4Gi"

esJavaOpts: "-Xmx2g -Xms2g"  # 2GB heap, 2GB for OS/Lucene cache

Applied via Helm upgrade:

helm upgrade elasticsearh elastic/elasticsearch \
  -f elasticsearch-values.yaml \
  -n elastic-stack

Result:

$ kubectl top pod elasticsearch-master-0 -n elastic-stack

NAME                     CPU    MEMORY
elasticsearch-master-0   823m   2760Mi  # Now within limits!

Stable operation with no OOM events.

Fix #4 - Index Lifecycle Management (ILM)

The Problem:

500+ indices accumulated with no cleanup strategy:

Daily indices never deleted
Storage growing unbounded
Performance degradation from too many indices

The Solution:

Enabled ILM in Filebeat configuration:

# See: https://github.com/npecka/peckacyber/blob/main/guides/elasticsearch-k8s-optimization/filebeat-values.yaml

# ILM and index template settings
setup.ilm.enabled: true
setup.ilm.rollover_alias: "filebeat"
setup.ilm.pattern: "{now/d}-000001"
setup.template.name: "filebeat"
setup.template.pattern: "logs-*"
setup.template.settings:
  index.number_of_shards: 1
  index.number_of_replicas: 0  # Single node, no replicas needed

Namespace-based index routing:

# See: https://github.com/npecka/peckacyber/blob/main/guides/elasticsearch-k8s-optimization/filebeat-values.yaml

indices:
  - index: "logs-kubernetes-%{+yyyy.MM.dd}"
    when.or:
      - not:
          has_fields: ['kubernetes.namespace']
  - index: "logs-security-%{[kubernetes.namespace]}-%{+yyyy.MM.dd}"
    when.or:
      - equals:
          kubernetes.namespace: "your-security-namespace-1"
      - equals:
          kubernetes.namespace: "your-security-namespace-2"
      - equals:
          kubernetes.namespace: "your-security-namespace-3"
  - index: "logs-app-%{[kubernetes.namespace]}-%{+yyyy.MM.dd}"

Benefits:

Organized indices - Security logs separate from app logs
Automated rollover - Daily indices managed automatically
Custom retention - Can set different retention per namespace
Better performance - Fewer indices to search across

Applied the fix:

# Update Filebeat configuration
helm upgrade filebeat elastic/filebeat \
  -f filebeat-values.yaml \
  -n elastic-stack

# Restart Filebeat to apply
kubectl rollout restart daemonset filebeat-filebeat -n elastic-stack

Security Hardening

Overview

Applied least-privilege security settings to all components.

Pod Security Context

Elasticsearch:

# File reference: elasticsearch StatefulSet spec
podSecurityContext:
  fsGroup: 1000
  runAsUser: 1000

securityContext:
  capabilities:
    drop:
      - ALL
  runAsNonRoot: true
  runAsUser: 1000

Kibana:

# File reference: kibana Deployment spec
securityContext:
  capabilities:
    drop:
      - ALL
  runAsNonRoot: true
  runAsUser: 1000

podSecurityContext:
  fsGroup: 1000

Understanding Least Privilege Security

What we configured and why:

runAsUser: 1000 and runAsNonRoot: true
- Forces containers to run as UID 1000 (elasticsearch user), not root (UID 0)
- Prevents attackers from gaining root access if they compromise the container
- Default behavior: Many containers run as root by default, which is dangerous
capabilities.drop: ALL
- Linux capabilities are special privileges (like CAP_NET_ADMIN, CAP_SYS_ADMIN)
- By default, containers get ~14 capabilities even when running as non-root
- drop: ALL removes every capability, leaving only basic user permissions
- Example: Without this, a compromised container might manipulate network interfaces or kill processes
fsGroup: 1000
- Sets file system group ownership to GID 1000
- Ensures Elasticsearch can read/write to mounted volumes without needing root
- Volumes mounted into the pod will have group ownership set to 1000

Real-world impact:
If an attacker exploits a vulnerability in Elasticsearch, Kibana, or Filebeat:

❌ Without these settings: Attacker runs as root with capabilities → can escape container, access host filesystem, network manipulation
✅ With these settings: Attacker runs as unprivileged user 1000 with no capabilities → limited to application-level damage only

This is "defense in depth" - even if the application is compromised, the blast radius is minimal.

Results and Key Takeaways

Measurable Results

Before Optimization:

Filebeat: 100+ restarts, crash loop
Elasticsearch: 2Gi RAM (insufficient), OOM risk
Indices: 500+ unmanaged indices, no lifecycle
Backup: No strategy
Operations: Manual, reactive troubleshooting

After Implementation:

Filebeat: 0 restarts, stable
Elasticsearch: 4Gi RAM, stable operation
Indices: Organized by namespace with ILM
Backup: Velero + MinIO automated backups to NAS
Operations: Documented, reproducible

Resource Impact:

Eliminated: 100+ crashes per week
Prevented: OOM kills via proper memory allocation
Improved: Query performance through ILM
Saved: ~4 hours/week troubleshooting time

System Stats (Example Deployment):

Daily log volume: ~300MB/day
Retention period: 30 days
Storage used: ~12GB (with Elasticsearch overhead)
Uptime: Stable since fixes applied
Log sources: Multiple namespaces (kube-system, security monitoring, CI/CD, application workloads, etc.)

Key Takeaways

1. Monitor Before Optimizing
Running the system for 2-3 weeks revealed actual usage patterns. Initial resource guesses were 50% off - real data is essential.

2. One Change at a Time
Each fix was applied individually and validated. This made it easy to identify what worked and simplified rollback if needed.

3. Context Matters More Than "Best Practices"
The emptyDir solution isn't what production guides recommend, but it was perfect for my homelab where occasional duplicate logs are acceptable. Understand your constraints.

4. Document Everything
Creating detailed documentation of changes - what was changed, why, and what trade-offs were accepted - is invaluable for future troubleshooting and knowledge sharing.

5. AI as a Force Multiplier
Using Claude AI accelerated troubleshooting by correlating error messages across Kubernetes, Elasticsearch, and Filebeat domains simultaneously. However, I still made all decisions and validated everything in my environment.

AI helped with:

Root cause analysis (correlating errors with known issues)
Solution options (providing multiple approaches with trade-offs)
Configuration examples (YAML snippets adapted to my context)
Verification steps (commands to validate fixes)

I was responsible for:

Gathering diagnostic data
Evaluating trade-offs for my specific context
Making architectural decisions
Testing and validating all changes
Documenting what actually worked

AI Tips for Operations

How to Use AI Tools Effectively

1. Gather comprehensive diagnostics first:

kubectl get pods -n <namespace>
kubectl describe pod <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --tail=100
kubectl top nodes
kubectl top pods -n <namespace>

2. Present specific problems, not vague questions:

❌ "How do I configure Elasticsearch?"
✅ "My Filebeat pod is crash-looping with this error: [paste error]"

3. Ask for trade-off analysis:

Request multiple solution options
Ask about pros/cons of each
Discuss homelab vs. production implications

4. Validate everything:

Test one change at a time
Verify it works in your environment
Don't trust suggestions blindly

5. Document your decisions:

Capture what you changed
Explain why you chose this approach
Note trade-offs accepted
Include verification steps

Troubleshooting

Common Issues

Issue 1: Elasticsearch pod stuck in pending

Symptoms: Pod shows "Pending" status indefinitely

Diagnosis:

kubectl describe pod elasticsearch-master-0 -n elastic-stack
kubectl get pvc -n elastic-stack
kubectl top nodes

Solutions:

If PVC not binding: Check storage class exists and has available capacity
```
kubectl get storageclass
# Ensure your storage provisioner is running
```

If insufficient node resources: Either free up resources or reduce Elasticsearch memory requests in values file

# Reduce memory in elasticsearch-values.yaml, then upgrade
helm upgrade elasticsearch elastic/elasticsearch -f elasticsearch-values.yaml -n elastic-stack

Issue 2: Filebeat can't connect to Elasticsearch

Symptoms: Filebeat logs show "connection refused" or TLS errors

Diagnosis:

kubectl logs daemonset/filebeat-filebeat -n elastic-stack --tail=50
kubectl get secret elasticsearch-master-certs -n elastic-stack

Solutions:
- If certificate secret missing: Elasticsearch may not be fully deployed yet. Wait for it to create the certs secret, or recreate the Filebeat deployment
```
kubectl rollout restart daemonset/filebeat-filebeat -n elastic-stack
```
- If wrong service name: Verify Elasticsearch service exists and update filebeat-values.yaml if needed
```
kubectl get svc -n elastic-stack
# Should show: elasticsearch-master
```

Issue 3: Kibana shows "Elasticsearch is not ready"

Symptoms: Kibana UI displays connection error or "waiting for Elasticsearch"

Diagnosis:

kubectl logs deployment/kibana-kibana -n elastic-stack --tail=50
kubectl exec elasticsearch-master-0 -n elastic-stack -- \
  curl -k -u "elastic:${ELASTIC_PASSWORD}" https://localhost:9200/_cluster/health

Solutions:
- If service account token missing: Recreate the token as shown in Step 2
```
# Create token and secret (see Step 2 for full commands)
kubectl exec elasticsearch-master-0 -n elastic-stack -- \
  /usr/share/elasticsearch/bin/elasticsearch-service-tokens create elastic/kibana kibana-token
```
- If Elasticsearch cluster not healthy: Wait for Elasticsearch to reach yellow/green status, or investigate Elasticsearch pod issues first
- If authentication fails: Verify the token secret name matches what's referenced in the Kibana deployment

Future Considerations

After completing this deployment:

Customize ILM policies - Set retention periods for your needs:

# Example: Delete indices older than 30 days
curl -k -u "elastic:${ELASTIC_PASSWORD}" -X PUT \
  "https://localhost:9200/_ilm/policy/logs_policy" -H 'Content-Type: application/json' -d'
{
  "policy": {
    "phases": {
      "delete": {
        "min_age": "30d",
        "actions": { "delete": {} }
      }
    }
  }
}'

Set up ingress - Expose Kibana securely:
- Configure TLS certificates
- Set up authentication (LDAP, SAML, etc.)
- Restrict access by IP
Enhance backup strategy - Beyond Velero:
- Consider Elasticsearch native snapshots for faster recovery
- Test restore procedures regularly
- See the upcoming dedicated backup guide for full details
Consider Loki migration - If resource-constrained:
- Loki uses ~80% less resources for simple log aggregation
- Trade-off: Slower full-text search vs. Elasticsearch
- Evaluate whether full-text search speed is critical for your use case
Apply to other workloads - Use these principles for:
- PostgreSQL, MongoDB, Redis deployments
- Any stateful Kubernetes application
- Resource sizing methodology applies universally

References and Additional Resources

Deployment File References

All configurations discussed in this guide are currently deployed and can be found:

Cluster (source of truth):

Elasticsearch: kubectl get statefulset elasticsearch-master -n elastic-stack -o yaml
Kibana: kubectl get deployment kibana-kibana -n elastic-stack -o yaml
Filebeat: kubectl get daemonset filebeat-filebeat -n elastic-stack -o yaml
Filebeat config: kubectl get configmap filebeat-filebeat-daemonset-config -n elastic-stack -o yaml

GitHub Repository:
All configuration files referenced in this guide are available at:
https://github.com/npecka/peckacyber/tree/main/guides/elasticsearch-k8s-optimization

Additional Resources

Elasticsearch on Kubernetes: https://www.elastic.co/guide/en/cloud-on-k8s/current/index.html
Filebeat Configuration: https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-reference-yml.html
ILM Policies: https://www.elastic.co/guide/en/elasticsearch/reference/current/index-lifecycle-management.html
Kubernetes Best Practices: https://kubernetes.io/docs/concepts/configuration/overview/
Claude AI: https://claude.ai/

Related Guides on This Site:

Coming soon: "Kubernetes Backup Strategy: Velero + MinIO for Homelabs"
Coming soon: "Migrating from Elasticsearch to Loki for Cost Savings"
Coming soon: "AI-Assisted Kubernetes Operations: A Workflow Guide"

Security Considerations

What to Share with AI Tools:

Safe to share:

Error messages
Configuration structures (sanitized)
Resource metrics
Architecture diagrams

Never share:

Actual passwords or API keys
Internal IP addresses (use placeholders)
Hostnames revealing network structure
Sensitive log data

Example sanitization:

# Shared with AI:
password: "${ELASTICSEARCH_PASSWORD}"  # Reference only

# NOT shared:
password: "actualP@ssw0rd123"

Production Hardening:

For production deployments, additionally consider:

Network policies - Restrict pod-to-pod communication
RBAC - Limit service account permissions
Secrets management - Use Vault or Sealed Secrets
TLS everywhere - Inter-node communication encryption
Audit logging - Track all Elasticsearch access
Regular patching - Keep Elastic Stack up to date

Changelog

v1.0 - October 31, 2025 - Initial release

Complete deployment guide from initial setup through optimization
All configurations tested and validated in homelab
Documents real problems and proven solutions

Questions or Issues? Feel free to reach out via the contact form on peckacyber.com

Found this helpful? Subscribe to the newsletter for more guides.

Guide Overview

What You'll Build

Brief Background on the ELK Stack

Use Cases for the ELK Stack

When to Use ELK/EFK

When NOT to Use ELK

ELK vs Alternatives Comparison

Prerequisites

Architecture Overview

Hardware/Software Requirements

Minimum Requirements

Software Versions

Storage Sizing Calculator

Quick Reference

Important Endpoints and Ports

Default Credentials

Common kubectl Commands

Elasticsearch API Commands

Helm Commands

Quick Troubleshooting Commands

Configuration File Locations

Elastic Deployment

Overview

Instructions

Verification

Kibana Deployment

Overview

Instructions

Verification

Filebeat Deployment

Overview

Instructions

Verification

Baking Period and Monitoring

Overview

Monitoring Commands

What I Observed (After 2-3 Weeks)

Utilizing AI to Enhance Deployment

Observing Findings with AI

The AI-Assisted Workflow

Mitigating Findings Suggested by AI

Fix #1 - Backup Strategy (Velero + MinIO)

Fix #2 - Filebeat Crash Loop (emptyDir)

Fix #3 - Elasticsearch Resource Sizing

Fix #4 - Index Lifecycle Management (ILM)

Security Hardening

Overview

Pod Security Context

Understanding Least Privilege Security

Results and Key Takeaways

Measurable Results

Key Takeaways

AI Tips for Operations

How to Use AI Tools Effectively

Troubleshooting

Common Issues

Future Considerations

References and Additional Resources

Deployment File References

Additional Resources

Security Considerations

Changelog

Sign up for more like this.