Building and Optimizing Elasticsearch on Kubernetes: A Logging Stack Homelab Journey
Guide Overview
Guide: Building and Optimizing Elasticsearch on Kubernetes: A Logging Stack Homelab Journey
Category: Monitoring / Infrastructure Optimization
Difficulty: Easy
Estimated Time: 1-2 hours for full deployment and optimization
Cost: Free (open-source tools)
Note: This guide documents a complete Elasticsearch deployment from initial setup through optimization, with all configurations tested and validated in a live homelab environment.
What You'll Build
This guide walks through deploying a complete Elasticsearch logging stack in Kubernetes, then documents the real-world problems that emerged during operation and how they were systematically resolved.
By the end of this guide, you'll have:
- A working Elasticsearch, Kibana, and Filebeat deployment on Kubernetes
- Ready-to-go configurations for single-node homelab environments
- Solutions for common operational problems (crash loops, resource sizing, index management)
- A methodology for using AI tools to accelerate troubleshooting
- Tested configurations handling up to 40+ million documents across multiple namespaces
Brief Background on the ELK Stack
The ELK Stack (Elasticsearch, Logstash, Kibana) is a popular open-source solution for centralized logging and log analysis. In modern implementations, Filebeat often replaces Logstash as a more lightweight log shipper, making it the "EFK" stack.
Components:
- Elasticsearch: A distributed search and analytics engine that stores and indexes log data
- Filebeat: A lightweight log shipper that collects logs from various sources and forwards them to Elasticsearch
- Kibana: A visualization and exploration tool that provides a web interface for querying and visualizing data stored in Elasticsearch
This stack is particularly valuable in Kubernetes environments where logs from hundreds of pods across multiple nodes need to be aggregated, searched, and analyzed from a single location.
Use Cases for the ELK Stack
When to Use ELK/EFK
Ideal for:
- Complex log analysis - Full-text search across all log fields with powerful query language
- Multi-source aggregation - Collecting logs from diverse sources (Kubernetes, applications, infrastructure)
- Historical analysis - Investigating issues that occurred hours or days ago with fast search
- Security monitoring - Correlating security events across multiple systems
- Compliance - Retaining logs for audit requirements with search capabilities
- Dashboarding - Creating visualizations and dashboards in Kibana for team visibility
Homelab use cases:
- Learning Kubernetes logging patterns
- Troubleshooting application deployments
- Monitoring security tools and experiments
- Centralizing logs from multiple namespaces/projects
- Building skills relevant to enterprise environments
Production use cases:
- Application performance monitoring (APM)
- Security Information and Event Management (SIEM)
- Infrastructure monitoring and alerting
- Customer support troubleshooting
- Business analytics from application logs
When NOT to Use ELK
Consider alternatives if:
Resource-constrained environments:
- ELK requires significant resources (4GB+ RAM for Elasticsearch alone)
- Alternative: Grafana Loki uses ~80% fewer resources for simple log aggregation
- Trade-off: Loki is slower for full-text search but much lighter
Simple log viewing:
- If you just need to "tail" recent logs without complex queries
- Alternative: kubectl logs or simple file-based logging
- Trade-off: No centralization or historical search
Managed solutions preferred:
- If you don't want to maintain infrastructure
- Alternative: CloudWatch, Datadog, New Relic, Splunk Cloud
- Trade-off: Ongoing costs vs. self-hosted maintenance burden
Metric-focused monitoring:
- ELK is optimized for logs, not time-series metrics
- Alternative: Prometheus + Grafana for metrics
- Note: Many deployments use both (ELK for logs, Prometheus for metrics)
ELK vs Alternatives Comparison
| Feature | ELK/EFK | Loki | Splunk | CloudWatch |
|---|---|---|---|---|
| Cost | Free (self-hosted) | Free (self-hosted) | Paid | Pay-per-use |
| Resource Usage | High (4GB+) | Low (~500MB) | High | N/A (managed) |
| Full-Text Search | Excellent | Slower | Excellent | Good |
| Complexity | Moderate | Low | Moderate | Low |
| Best For | Complex queries | Simple aggregation | Enterprise | AWS workloads |
Prerequisites
Required:
- Running Kubernetes cluster (RKE, k3s, or any distribution)
- At least one node with 4GB+ RAM available for Elasticsearch
- Basic Kubectl and Helm knowledge
- Understanding of Kubernetes concepts (StatefulSets, DaemonSets, Deployments, ConfigMaps)
Helpful but Optional:
- Familiarity with Elasticsearch concepts
- Experience with log aggregation systems
- Access to AI tools (Claude, ChatGPT) for troubleshooting assistance
Architecture Overview
This deployment creates a centralized logging pipeline for Kubernetes:
┌─────────────────────────────────────────────────────────────┐
│ Kubernetes Cluster │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Pod │ │ Pod │ │ Pod │ │
│ │ Logs │ │ Logs │ │ Logs │ │
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
│ │ │ │ │
│ └─────────────┴──────────────┘ │
│ │ │
│ ┌──────▼───────┐ │
│ │ Filebeat │ (DaemonSet on each node) │
│ │ Log Shipper │ │
│ └──────┬───────┘ │
│ │ │
│ ┌──────▼────────┐ │
│ │ Elasticsearch │ (StatefulSet, single node) │
│ │ Storage & │ │
│ │ Indexing │ │
│ └──────┬────────┘ │
│ │ │
│ ┌──────▼────────┐ │
│ │ Kibana │ (Deployment) │
│ │ Visualization │ │
│ └───────────────┘ │
└─────────────────────────────────────────────────────────────┘
Data Flow:
- Filebeat runs on each node, tailing container logs
- Logs are shipped to Elasticsearch for indexing and storage
- Kibana provides a web UI for searching and visualizing logs
Hardware/Software Requirements
Minimum Requirements
- Cluster: Kubernetes cluster with at least one node
- Node Resources: 4GB+ RAM available for Elasticsearch
- Storage: Calculate based on your log volume (see formula below)
Software Versions
- Kubernetes: 1.24+ (any distribution)
- Elasticsearch: 8.5.1
- Kibana: 8.5.1
- Filebeat: 8.5.1
- Helm: 3.x
Storage Sizing Calculator
Formula:
Required Storage = (Daily Log Volume in GB × Retention Days) × 1.5
The 1.5 multiplier accounts for:
- Elasticsearch overhead (indices, mappings, shards): ~20%
- Growth buffer: ~30%
Example Calculation (this deployment):
- Daily log volume: ~300MB
- Retention period: 30 days
- Calculation: (0.3 GB × 30) × 1.5 = 13.5GB
- Allocated: 50Gi (provides 4x headroom for growth)
How to measure your daily log volume:
# After running 24 hours, check index sizes:
kubectl exec elasticsearch-master-0 -n elastic-stack -- \
curl -s -k -u "elastic:PASSWORD" \
"https://localhost:9200/_cat/indices?v&s=store.size:desc"
# Sum the sizes of indices from the last 24 hours
Quick Reference:
- Small homelab (1-5 namespaces, low traffic): 20-30Gi
- Medium homelab (5-15 namespaces): 50-100Gi
- Large homelab (15+ namespaces, high traffic): 100-200Gi
Final Resource Usage (this deployment):
- Elasticsearch: 4GB RAM, 1 CPU
- Kibana: 2GB RAM, 1 CPU
- Filebeat: 100-200MB RAM per node
- Storage: 12GB used of 50Gi allocated (300MB/day × 30 days with overhead)
Quick Reference
This section provides essential commands and information for managing your Elasticsearch deployment.
Important Endpoints and Ports
- Elasticsearch:
https://elasticsearch-master.elastic-stack.svc:9200 - Kibana:
http://kibana-kibana.elastic-stack.svc:5601 - Namespace:
elastic-stack
Default Credentials
# Get Elasticsearch password
kubectl get secret elasticsearch-master-credentials \
-n elastic-stack -o jsonpath='{.data.password}' | base64 -d
# Default username: elastic
Common kubectl Commands
Check pod status:
# All elastic-stack pods
kubectl get pods -n elastic-stack
# Watch for changes
kubectl get pods -n elastic-stack -w
# Check specific component
kubectl get pods -n elastic-stack -l app=elasticsearch-master
kubectl get pods -n elastic-stack -l app=kibana-kibana
kubectl get pods -n elastic-stack -l app=filebeat-filebeat
Check resource usage:
# Pod resource consumption
kubectl top pods -n elastic-stack
# Node resource consumption
kubectl top nodes
View logs:
# Elasticsearch logs
kubectl logs elasticsearch-master-0 -n elastic-stack
# Kibana logs
kubectl logs deployment/kibana-kibana -n elastic-stack
# Filebeat logs (specific pod)
kubectl logs <filebeat-pod-name> -n elastic-stack
# Follow logs in real-time
kubectl logs -f elasticsearch-master-0 -n elastic-stack
Access services locally:
# Port-forward Elasticsearch
kubectl port-forward svc/elasticsearch-master 9200:9200 -n elastic-stack
# Port-forward Kibana
kubectl port-forward svc/kibana-kibana 5601:5601 -n elastic-stack
# Then access:
# Elasticsearch: https://localhost:9200
# Kibana: http://localhost:5601
Elasticsearch API Commands
Cluster health:
# Get password first
ELASTIC_PASSWORD=$(kubectl get secret elasticsearch-master-credentials \
-n elastic-stack -o jsonpath='{.data.password}' | base64 -d)
# Check cluster health
curl -k -u "elastic:${ELASTIC_PASSWORD}" \
"https://localhost:9200/_cluster/health?pretty"
# Cluster stats
curl -k -u "elastic:${ELASTIC_PASSWORD}" \
"https://localhost:9200/_cluster/stats?pretty"
Index management:
# List all indices
curl -k -u "elastic:${ELASTIC_PASSWORD}" \
"https://localhost:9200/_cat/indices?v"
# List indices sorted by size
curl -k -u "elastic:${ELASTIC_PASSWORD}" \
"https://localhost:9200/_cat/indices?v&s=store.size:desc"
# List indices sorted by creation date
curl -k -u "elastic:${ELASTIC_PASSWORD}" \
"https://localhost:9200/_cat/indices?v&s=creation.date:desc"
# Delete specific index
curl -k -u "elastic:${ELASTIC_PASSWORD}" \
-X DELETE "https://localhost:9200/index-name"
Node information:
# Node stats
curl -k -u "elastic:${ELASTIC_PASSWORD}" \
"https://localhost:9200/_cat/nodes?v"
# JVM memory usage
curl -k -u "elastic:${ELASTIC_PASSWORD}" \
"https://localhost:9200/_nodes/stats/jvm?pretty"
Helm Commands
Check current deployments:
# List Helm releases in namespace
helm list -n elastic-stack
# Get values for a release
helm get values elasticsearch -n elastic-stack
helm get values kibana -n elastic-stack
helm get values filebeat -n elastic-stack
Upgrade deployments:
# Upgrade Elasticsearch
helm upgrade elasticsearch elastic/elasticsearch \
-f elasticsearch-values.yaml \
-n elastic-stack
# Upgrade Kibana
helm upgrade kibana elastic/kibana \
-f kibana-values.yaml \
-n elastic-stack
# Upgrade Filebeat
helm upgrade filebeat elastic/filebeat \
-f filebeat-values.yaml \
-n elastic-stack
Rollback if needed:
# See release history
helm history elasticsearch -n elastic-stack
# Rollback to previous version
helm rollback elasticsearch -n elastic-stack
# Rollback to specific revision
helm rollback elasticsearch 2 -n elastic-stack
Quick Troubleshooting Commands
Pod won't start:
# Describe pod to see events
kubectl describe pod <pod-name> -n elastic-stack
# Check PVC status
kubectl get pvc -n elastic-stack
# Check node resources
kubectl top nodes
Performance issues:
# Check resource usage
kubectl top pods -n elastic-stack
# Check Elasticsearch heap usage
kubectl exec elasticsearch-master-0 -n elastic-stack -- \
curl -k -u "elastic:${ELASTIC_PASSWORD}" \
"https://localhost:9200/_nodes/stats/jvm?pretty"
# Check index count (too many can slow things down)
curl -k -u "elastic:${ELASTIC_PASSWORD}" \
"https://localhost:9200/_cat/indices?v" | wc -l
Restart components:
# Restart Elasticsearch (StatefulSet - restarts in order)
kubectl rollout restart statefulset/elasticsearch-master -n elastic-stack
# Restart Kibana
kubectl rollout restart deployment/kibana-kibana -n elastic-stack
# Restart Filebeat
kubectl rollout restart daemonset/filebeat-filebeat -n elastic-stack
Configuration File Locations
- Local values files:
elasticsearch-values.yaml,kibana-values.yaml,filebeat-values.yaml - See the References and Additional Resources section at the end of this guide for GitHub repository links
Elastic Deployment
Overview
Deploy Elasticsearch as a StatefulSet with persistent storage. This initial deployment uses conservative resource estimates that we'll adjust later based on actual usage. The yaml will deploy Elasticsearch in single-node mode.
Instructions
1. Create namespace
kubectl create namespace elastic-stack
2. Add Elastic Helm repository
helm repo add elastic https://helm.elastic.co
helm repo update
3. Create Elasticsearch values file
Create elasticsearch-values.yaml:
---
clusterName: "elasticsearch"
nodeGroup: "master"
# Elasticsearch roles for single-node deployment
roles:
- master
- data
- data_content
- data_hot
- data_warm
- data_cold
- ingest
- ml
- remote_cluster_client
- transform
replicas: 1
minimumMasterNodes: 1
# Initial resource allocation (conservative estimate)
# NOTE: This will be increased later based on actual usage
resources:
requests:
cpu: "1000m"
memory: "2Gi" # INITIAL - will increase to 4Gi after monitoring
limits:
cpu: "1000m"
memory: "2Gi" # INITIAL - will increase to 4Gi after monitoring
esJavaOpts: "-Xmx1g -Xms1g" # INITIAL - 50% of 2Gi, will adjust later
# Persistent storage
# Use the storage calculator in the requirements section to determine your needs
# Formula: (Daily Log Volume GB × Retention Days) × 1.5
volumeClaimTemplate:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 50Gi # Adjust based on your log volume
# Security settings
secret:
enabled: true
password: "elastic" # Change in production
# Single-node optimization
antiAffinity: "hard"
protocol: https
httpPort: 9200
transportPort: 9300
# Security hardening
podSecurityContext:
fsGroup: 1000
runAsUser: 1000
securityContext:
capabilities:
drop:
- ALL
runAsNonRoot: true
runAsUser: 1000
sysctlInitContainer:
enabled: true
sysctlVmMaxMapCount: 262144
readinessProbe:
failureThreshold: 3
initialDelaySeconds: 10
periodSeconds: 10
successThreshold: 3
timeoutSeconds: 5
clusterHealthCheckParams: "wait_for_status=yellow&timeout=1s"
4. Deploy Elasticsearch
helm install elasticsearh elastic/elasticsearch \
-f elasticsearch-values.yaml \
-n elastic-stack
5. Wait for Elasticsearch to be ready
kubectl get pods -n elastic-stack -w
Expected output:
NAME READY STATUS RESTARTS AGE
elasticsearch-master-0 1/1 Running 0 3m
Verification
Check Elasticsearch health:
# Get the password
ELASTIC_PASSWORD=$(kubectl get secret elasticsearch-master-credentials \
-n elastic-stack -o jsonpath='{.data.password}' | base64 -d)
# Port-forward to access Elasticsearch
kubectl port-forward svc/elasticsearch-master 9200:9200 -n elastic-stack &
# Check cluster health
curl -k -u "elastic:${ELASTIC_PASSWORD}" https://localhost:9200/_cluster/health?pretty
Expected response:
{
"cluster_name" : "elasticsearch",
"status" : "yellow",
"number_of_nodes" : 1,
"number_of_data_nodes" : 1
}
Note: Yellow status is expected for single-node clusters (no replicas possible).
Kibana Deployment
Overview
Kibana provides the web interface for querying and visualizing logs stored in Elasticsearch.
Instructions
1. Create Kibana values file
Create kibana-values.yaml:
---
image: "docker.elastic.co/kibana/kibana"
imageTag: "8.5.1"
replicas: 1
# Resource allocation
resources:
requests:
memory: "1Gi"
cpu: "1000m"
limits:
memory: "2Gi"
cpu: "1000m"
# Elasticsearch connection
elasticsearchHosts: "https://elasticsearch-master:9200"
# Environment variables
extraEnvs:
- name: "ELASTICSEARCH_SSL_CERTIFICATEAUTHORITIES"
value: "/usr/share/kibana/config/certs/ca.crt"
- name: "SERVER_HOST"
value: "0.0.0.0"
- name: "NODE_OPTIONS"
value: "--max-old-space-size=1800"
secretMounts:
- name: elasticsearch-certs
secretName: elasticsearch-master-certs
path: /usr/share/kibana/config/certs
readOnly: true
# Security context
securityContext:
capabilities:
drop:
- ALL
runAsNonRoot: true
runAsUser: 1000
podSecurityContext:
fsGroup: 1000
# Service configuration
service:
type: ClusterIP
port: 5601
2. Create Kibana service account token
Create a service account token for Kibana to authenticate with Elasticsearch:
# Create a service account token in Elasticsearch
kubectl exec elasticsearch-master-0 -n elastic-stack -- \
/usr/share/elasticsearch/bin/elasticsearch-service-tokens create \
elastic/kibana kibana-token > /tmp/kibana-token.txt
# Extract the token value (4th field in output)
TOKEN=$(cat /tmp/kibana-token.txt | awk '{print $4}')
# Create Kubernetes secret with the token
kubectl create secret generic kibana-kibana-es-token \
-n elastic-stack \
--from-literal=token="${TOKEN}"
# Clean up temp file
rm /tmp/kibana-token.txt
3. Deploy Kibana
helm install kibana elastic/kibana \
-f kibana-values.yaml \
-n elastic-stack
4. Wait for Kibana to be ready
kubectl get pods -n elastic-stack -w
Verification
Access Kibana:
kubectl port-forward svc/kibana-kibana 5601:5601 -n elastic-stack
Navigate to http://localhost:5601 in your browser. Login with:
- Username:
elastic - Password: (value from
ELASTIC_PASSWORDvariable)
Filebeat Deployment
Overview
Filebeat runs as a DaemonSet on each node, collecting container logs and shipping them to Elasticsearch.
Instructions
1. Create Filebeat values file
Create filebeat-values.yaml:
---
image: "docker.elastic.co/beats/filebeat"
imageTag: "8.5.1"
imagePullPolicy: "IfNotPresent"
# DaemonSet configuration
daemonset:
enabled: true
deployment:
enabled: false
# Cluster role for reading pod metadata
clusterRoleRules:
- apiGroups:
- ""
resources:
- namespaces
- pods
- nodes
verbs:
- get
- list
- watch
- apiGroups:
- apps
resources:
- replicasets
verbs:
- get
- list
- watch
# Resource limits
resources:
requests:
cpu: "100m"
memory: "100Mi"
limits:
cpu: "200m"
memory: "200Mi"
# Environment variables
extraEnvs:
- name: ELASTICSEARCH_PASSWORD
valueFrom:
secretKeyRef:
name: elasticsearch-master-credentials
key: password
# Volume mounts for certificates
secretMounts:
- name: elasticsearch-master-certs
secretName: elasticsearch-master-certs
path: /usr/share/filebeat/certs/
- name: elasticsearch-credentials
secretName: elasticsearch-master-credentials
path: /usr/share/filebeat/secrets/
# Filebeat configuration
filebeatConfig:
filebeat.yml: |
filebeat.inputs:
- type: container
paths:
- /var/log/containers/*.log
processors:
- add_kubernetes_metadata:
host: ${NODE_NAME}
matchers:
- logs_path:
logs_path: "/var/log/containers/"
- drop_event:
when:
or:
# Drop filebeat's own logs
- equals:
kubernetes.namespace: "kube-system"
kubernetes.container.name: "filebeat"
# Drop metrics-server logs (noisy)
- equals:
kubernetes.namespace: "kube-system"
kubernetes.container.name: "metrics-server"
# Send to Elasticsearch
output.elasticsearch:
hosts: ['https://elasticsearch-master.elastic-stack.svc:9200']
protocol: "https"
username: "elastic"
password: "${ELASTICSEARCH_PASSWORD}"
ssl.certificate_authorities:
- /usr/share/filebeat/certs/ca.crt
indices:
- index: "logs-kubernetes-%{+yyyy.MM.dd}"
when.or:
- not:
has_fields: ['kubernetes.namespace']
- index: "logs-security-%{[kubernetes.namespace]}-%{+yyyy.MM.dd}"
when.or:
- equals:
kubernetes.namespace: "your-security-namespace-1"
- equals:
kubernetes.namespace: "your-security-namespace-2"
- equals:
kubernetes.namespace: "your-security-namespace-3"
- index: "logs-app-%{[kubernetes.namespace]}-%{+yyyy.MM.dd}"
# ILM and index template settings
setup.ilm.enabled: true
setup.ilm.rollover_alias: "filebeat"
setup.ilm.pattern: "{now/d}-000001"
setup.template.name: "filebeat"
setup.template.pattern: "logs-*"
setup.template.settings:
index.number_of_shards: 1
index.number_of_replicas: 0
readinessProbe:
failureThreshold: 3
initialDelaySeconds: 10
periodSeconds: 10
successThreshold: 3
timeoutSeconds: 15
2. Deploy Filebeat
helm install filebeat elastic/filebeat \
-f filebeat-values.yaml \
-n elastic-stack
3. Verify Filebeat pods
kubectl get pods -n elastic-stack -l app=filebeat-filebeat
Expected output (one pod per node):
NAME READY STATUS RESTARTS AGE
filebeat-filebeat-abc12 1/1 Running 0 2m
filebeat-filebeat-def34 1/1 Running 0 2m
Verification
Check logs are flowing to Elasticsearch:
# Port-forward to Elasticsearch
kubectl port-forward svc/elasticsearch-master 9200:9200 -n elastic-stack &
# Check indices
curl -k -u "elastic:${ELASTIC_PASSWORD}" \
"https://localhost:9200/_cat/indices/logs-*?v"
Expected output:
health status index pri rep docs.count
yellow open logs-kubernetes-2025.10.29 1 0 12345
yellow open logs-app-default-2025.10.29 1 0 5678
View logs in Kibana:
- Navigate to Kibana (http://localhost:5601)
- Go to "Discover"
- Create index pattern:
logs-* - You should see logs flowing in
Baking Period and Monitoring
Overview
After initial deployment, I let the system run for several weeks to understand actual usage patterns and identify issues.
Monitoring Commands
Check resource usage:
# Node-level resource usage
kubectl top nodes
# Pod-level resource usage
kubectl top pods -n elastic-stack
# Elasticsearch memory usage specifically
kubectl exec elasticsearch-master-0 -n elastic-stack -- \
curl -k -u "elastic:${ELASTIC_PASSWORD}" \
"https://localhost:9200/_nodes/stats/jvm?pretty"
Check pod status and restarts:
# Watch for crash loops
kubectl get pods -n elastic-stack -w
# Check pod restart counts
kubectl get pods -n elastic-stack
Monitor Elasticsearch health:
# Cluster stats
curl -k -u "elastic:${ELASTIC_PASSWORD}" \
"https://localhost:9200/_cluster/stats?pretty"
# Index count and sizes
curl -k -u "elastic:${ELASTIC_PASSWORD}" \
"https://localhost:9200/_cat/indices?v&s=store.size:desc"
What I Observed (After 2-3 Weeks)
Resource Usage Reality:
$ kubectl top pods -n elastic-stack
NAME CPU MEMORY
elasticsearch-master-0 823m 2760Mi # Exceeding 2Gi limit!
kibana-kibana-xxx 200m 400Mi
filebeat-filebeat-xxx 50m 100Mi
filebeat-filebeat-yyy 45m 95Mi
Problems Identified:
-
Elasticsearch Memory Issues
- Allocated: 2Gi
- Actual usage: 2.76Gi
- Risk: OOM kills, performance degradation
-
Filebeat Crash Loops
$ kubectl get pods -n elastic-stack NAME READY STATUS RESTARTS AGE filebeat-filebeat-abc12 1/1 Running 100+ 3dError in logs:
cannot obtain lockfile: cannot start, data directory belongs to process with pid 8 -
Index Sprawl
$ curl -k -u "elastic:${ELASTIC_PASSWORD}" \ "https://localhost:9200/_cat/indices?v" | wc -l 500+ indices # No lifecycle management! -
No Backup Strategy
- Millions of documents stored
- Growing data with no disaster recovery plan
- Need automated backup solution
Utilizing AI to Enhance Deployment
Observing Findings with AI
At this point, I engaged Claude AI to help diagnose and fix these issues. This section documents that workflow and what we discovered.
The AI-Assisted Workflow
1. Gathered diagnostic data:
# Filebeat crash loop investigation
kubectl describe pod filebeat-filebeat-abc12 -n elastic-stack > filebeat-crash.txt
kubectl logs filebeat-filebeat-abc12 -n elastic-stack --tail=100 > filebeat-logs.txt
# Elasticsearch resource analysis
kubectl describe pod elasticsearch-master-0 -n elastic-stack > es-describe.txt
kubectl top pod elasticsearch-master-0 -n elastic-stack > es-usage.txt
# Index analysis
curl -k -u "elastic:${ELASTIC_PASSWORD}" \
"https://localhost:9200/_cat/indices?v&s=creation.date:desc" > indices.txt
2. Presented problems to Claude AI:
- Shared error messages and logs (sanitized of credentials)
- Described environment constraints (homelab, single-node)
- Asked for multiple solution options with trade-offs
3. Evaluated solutions:
- Discussed pros/cons of each approach
- Considered homelab vs. production best practices
- Made decisions based on my specific context
4. Implemented fixes incrementally:
- One change at a time
- Validated each fix before moving to next
- Documented decisions and findings
Mitigating Findings Suggested by AI
The following sections detail each fix that was implemented based on AI-assisted troubleshooting.
Fix #1 - Backup Strategy (Velero + MinIO)
Overview:
Before making any risky configuration changes, I established a disaster recovery strategy using Velero to back up the Elasticsearch data to MinIO running on my NAS.
Implementation Note:
This guide focuses on Elasticsearch optimization, so I won't detail the full Velero/MinIO setup here. That will be covered in a dedicated backup guide. However, the key points:
- Velero installed in the cluster for Kubernetes backup/restore
- MinIO running on NAS as S3-compatible storage target
- Scheduled backups of the elastic-stack namespace and persistent volumes
- Tested restores to validate the backup chain works
Why this was critical:
With backups in place, I could confidently make configuration changes knowing I could recover from any mistakes. This enabled the emptyDir fix in the next step.
Verification:
# Check Velero backups exist
velero backup get
# Validate latest backup
velero backup describe <backup-name> --details
Fix #2 - Filebeat Crash Loop (emptyDir)
Root Cause:
Filebeat was using a hostPath volume for data persistence:
# BEFORE (problematic configuration)
volumes:
- name: data
hostPath:
path: /var/lib/filebeat-filebeat-elastic-stack-data
type: DirectoryOrCreate
When Filebeat containers crashed, lock files remained on the host filesystem, preventing new containers from starting.
The Fix:
Changed to emptyDir volume type:
# Apply the patch
kubectl patch daemonset filebeat-filebeat -n elastic-stack \
--type='json' \
-p='[{"op": "replace", "path": "/spec/template/spec/volumes/3", "value": {"name": "data", "emptyDir": {}}}]'
Current deployed configuration (in DaemonSet spec):
# AFTER (fixed configuration)
# File reference: kubectl get daemonset filebeat-filebeat -n elastic-stack -o yaml
volumes:
- name: data
emptyDir: {} # ← The fix!
Trade-offs:
Pro:
- Clean restarts, no more crash loops
- Lock files cleared on pod restart
Con:
- Registry state (log file positions) lost on restart
- Potential duplicate logs after restart
Why acceptable:
- Logs are backed up to NAS
- Occasional duplicates preferable to non-functional system
- Homelab context (not financial transactions)
Result:
$ kubectl get pods -n elastic-stack -l app=filebeat-filebeat
NAME READY STATUS RESTARTS AGE
filebeat-filebeat-abc12 1/1 Running 0 2d
filebeat-filebeat-def34 1/1 Running 0 2d
Zero restarts since October 29, 2025!
Fix #3 - Elasticsearch Resource Sizing
The Problem:
Initial allocation was conservative guesswork:
- Allocated: 2Gi RAM
- Actual usage: 2.76Gi RAM
- Java heap: 1g (50% of 2Gi)
This caused performance issues and risk of OOM kills.
The Analysis:
Elasticsearch needs memory for:
- Java heap - Core operations (indexing, search)
- OS file cache - Lucene segments (critical for performance)
Best practice: 50/50 split between heap and file cache.
The Fix:
Updated resource allocation to match reality:
# BEFORE (initial guess)
resources:
requests:
cpu: "1000m"
memory: "2Gi"
limits:
cpu: "1000m"
memory: "2Gi"
esJavaOpts: "-Xmx1g -Xms1g"
# AFTER (data-driven sizing)
# See: https://github.com/npecka/peckacyber/blob/main/guides/elasticsearch-k8s-optimization/elasticsearch-values.yaml
resources:
requests:
cpu: "1000m"
memory: "4Gi" # Doubled to accommodate actual usage
limits:
cpu: "1000m"
memory: "4Gi"
esJavaOpts: "-Xmx2g -Xms2g" # 2GB heap, 2GB for OS/Lucene cache
Applied via Helm upgrade:
helm upgrade elasticsearh elastic/elasticsearch \
-f elasticsearch-values.yaml \
-n elastic-stack
Result:
$ kubectl top pod elasticsearch-master-0 -n elastic-stack
NAME CPU MEMORY
elasticsearch-master-0 823m 2760Mi # Now within limits!
Stable operation with no OOM events.
Fix #4 - Index Lifecycle Management (ILM)
The Problem:
500+ indices accumulated with no cleanup strategy:
- Daily indices never deleted
- Storage growing unbounded
- Performance degradation from too many indices
The Solution:
Enabled ILM in Filebeat configuration:
# See: https://github.com/npecka/peckacyber/blob/main/guides/elasticsearch-k8s-optimization/filebeat-values.yaml
# ILM and index template settings
setup.ilm.enabled: true
setup.ilm.rollover_alias: "filebeat"
setup.ilm.pattern: "{now/d}-000001"
setup.template.name: "filebeat"
setup.template.pattern: "logs-*"
setup.template.settings:
index.number_of_shards: 1
index.number_of_replicas: 0 # Single node, no replicas needed
Namespace-based index routing:
# See: https://github.com/npecka/peckacyber/blob/main/guides/elasticsearch-k8s-optimization/filebeat-values.yaml
indices:
- index: "logs-kubernetes-%{+yyyy.MM.dd}"
when.or:
- not:
has_fields: ['kubernetes.namespace']
- index: "logs-security-%{[kubernetes.namespace]}-%{+yyyy.MM.dd}"
when.or:
- equals:
kubernetes.namespace: "your-security-namespace-1"
- equals:
kubernetes.namespace: "your-security-namespace-2"
- equals:
kubernetes.namespace: "your-security-namespace-3"
- index: "logs-app-%{[kubernetes.namespace]}-%{+yyyy.MM.dd}"
Benefits:
- Organized indices - Security logs separate from app logs
- Automated rollover - Daily indices managed automatically
- Custom retention - Can set different retention per namespace
- Better performance - Fewer indices to search across
Applied the fix:
# Update Filebeat configuration
helm upgrade filebeat elastic/filebeat \
-f filebeat-values.yaml \
-n elastic-stack
# Restart Filebeat to apply
kubectl rollout restart daemonset filebeat-filebeat -n elastic-stack
Security Hardening
Overview
Applied least-privilege security settings to all components.
Pod Security Context
Elasticsearch:
# File reference: elasticsearch StatefulSet spec
podSecurityContext:
fsGroup: 1000
runAsUser: 1000
securityContext:
capabilities:
drop:
- ALL
runAsNonRoot: true
runAsUser: 1000
Kibana:
# File reference: kibana Deployment spec
securityContext:
capabilities:
drop:
- ALL
runAsNonRoot: true
runAsUser: 1000
podSecurityContext:
fsGroup: 1000
Understanding Least Privilege Security
What we configured and why:
-
runAsUser: 1000andrunAsNonRoot: true- Forces containers to run as UID 1000 (elasticsearch user), not root (UID 0)
- Prevents attackers from gaining root access if they compromise the container
- Default behavior: Many containers run as root by default, which is dangerous
-
capabilities.drop: ALL- Linux capabilities are special privileges (like CAP_NET_ADMIN, CAP_SYS_ADMIN)
- By default, containers get ~14 capabilities even when running as non-root
drop: ALLremoves every capability, leaving only basic user permissions- Example: Without this, a compromised container might manipulate network interfaces or kill processes
-
fsGroup: 1000- Sets file system group ownership to GID 1000
- Ensures Elasticsearch can read/write to mounted volumes without needing root
- Volumes mounted into the pod will have group ownership set to 1000
Real-world impact:
If an attacker exploits a vulnerability in Elasticsearch, Kibana, or Filebeat:
- ❌ Without these settings: Attacker runs as root with capabilities → can escape container, access host filesystem, network manipulation
- ✅ With these settings: Attacker runs as unprivileged user 1000 with no capabilities → limited to application-level damage only
This is "defense in depth" - even if the application is compromised, the blast radius is minimal.
Results and Key Takeaways
Measurable Results
Before Optimization:
- Filebeat: 100+ restarts, crash loop
- Elasticsearch: 2Gi RAM (insufficient), OOM risk
- Indices: 500+ unmanaged indices, no lifecycle
- Backup: No strategy
- Operations: Manual, reactive troubleshooting
After Implementation:
- Filebeat: 0 restarts, stable
- Elasticsearch: 4Gi RAM, stable operation
- Indices: Organized by namespace with ILM
- Backup: Velero + MinIO automated backups to NAS
- Operations: Documented, reproducible
Resource Impact:
- Eliminated: 100+ crashes per week
- Prevented: OOM kills via proper memory allocation
- Improved: Query performance through ILM
- Saved: ~4 hours/week troubleshooting time
System Stats (Example Deployment):
- Daily log volume: ~300MB/day
- Retention period: 30 days
- Storage used: ~12GB (with Elasticsearch overhead)
- Uptime: Stable since fixes applied
- Log sources: Multiple namespaces (kube-system, security monitoring, CI/CD, application workloads, etc.)
Key Takeaways
1. Monitor Before Optimizing
Running the system for 2-3 weeks revealed actual usage patterns. Initial resource guesses were 50% off - real data is essential.
2. One Change at a Time
Each fix was applied individually and validated. This made it easy to identify what worked and simplified rollback if needed.
3. Context Matters More Than "Best Practices"
The emptyDir solution isn't what production guides recommend, but it was perfect for my homelab where occasional duplicate logs are acceptable. Understand your constraints.
4. Document Everything
Creating detailed documentation of changes - what was changed, why, and what trade-offs were accepted - is invaluable for future troubleshooting and knowledge sharing.
5. AI as a Force Multiplier
Using Claude AI accelerated troubleshooting by correlating error messages across Kubernetes, Elasticsearch, and Filebeat domains simultaneously. However, I still made all decisions and validated everything in my environment.
AI helped with:
- Root cause analysis (correlating errors with known issues)
- Solution options (providing multiple approaches with trade-offs)
- Configuration examples (YAML snippets adapted to my context)
- Verification steps (commands to validate fixes)
I was responsible for:
- Gathering diagnostic data
- Evaluating trade-offs for my specific context
- Making architectural decisions
- Testing and validating all changes
- Documenting what actually worked
AI Tips for Operations
How to Use AI Tools Effectively
1. Gather comprehensive diagnostics first:
kubectl get pods -n <namespace>
kubectl describe pod <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --tail=100
kubectl top nodes
kubectl top pods -n <namespace>
2. Present specific problems, not vague questions:
- ❌ "How do I configure Elasticsearch?"
- ✅ "My Filebeat pod is crash-looping with this error: [paste error]"
3. Ask for trade-off analysis:
- Request multiple solution options
- Ask about pros/cons of each
- Discuss homelab vs. production implications
4. Validate everything:
- Test one change at a time
- Verify it works in your environment
- Don't trust suggestions blindly
5. Document your decisions:
- Capture what you changed
- Explain why you chose this approach
- Note trade-offs accepted
- Include verification steps
Troubleshooting
Common Issues
Issue 1: Elasticsearch pod stuck in pending
- Symptoms: Pod shows "Pending" status indefinitely
- Diagnosis:
kubectl describe pod elasticsearch-master-0 -n elastic-stack kubectl get pvc -n elastic-stack kubectl top nodes - Solutions:
- If PVC not binding: Check storage class exists and has available capacity
kubectl get storageclass # Ensure your storage provisioner is running - If insufficient node resources: Either free up resources or reduce Elasticsearch memory requests in values file
# Reduce memory in elasticsearch-values.yaml, then upgrade helm upgrade elasticsearch elastic/elasticsearch -f elasticsearch-values.yaml -n elastic-stack
- If PVC not binding: Check storage class exists and has available capacity
Issue 2: Filebeat can't connect to Elasticsearch
- Symptoms: Filebeat logs show "connection refused" or TLS errors
- Diagnosis:
kubectl logs daemonset/filebeat-filebeat -n elastic-stack --tail=50 kubectl get secret elasticsearch-master-certs -n elastic-stack - Solutions:
- If certificate secret missing: Elasticsearch may not be fully deployed yet. Wait for it to create the certs secret, or recreate the Filebeat deployment
kubectl rollout restart daemonset/filebeat-filebeat -n elastic-stack - If wrong service name: Verify Elasticsearch service exists and update filebeat-values.yaml if needed
kubectl get svc -n elastic-stack # Should show: elasticsearch-master
- If certificate secret missing: Elasticsearch may not be fully deployed yet. Wait for it to create the certs secret, or recreate the Filebeat deployment
Issue 3: Kibana shows "Elasticsearch is not ready"
- Symptoms: Kibana UI displays connection error or "waiting for Elasticsearch"
- Diagnosis:
kubectl logs deployment/kibana-kibana -n elastic-stack --tail=50 kubectl exec elasticsearch-master-0 -n elastic-stack -- \ curl -k -u "elastic:${ELASTIC_PASSWORD}" https://localhost:9200/_cluster/health - Solutions:
- If service account token missing: Recreate the token as shown in Step 2
# Create token and secret (see Step 2 for full commands) kubectl exec elasticsearch-master-0 -n elastic-stack -- \ /usr/share/elasticsearch/bin/elasticsearch-service-tokens create elastic/kibana kibana-token - If Elasticsearch cluster not healthy: Wait for Elasticsearch to reach yellow/green status, or investigate Elasticsearch pod issues first
- If authentication fails: Verify the token secret name matches what's referenced in the Kibana deployment
- If service account token missing: Recreate the token as shown in Step 2
Future Considerations
After completing this deployment:
-
Customize ILM policies - Set retention periods for your needs:
# Example: Delete indices older than 30 days curl -k -u "elastic:${ELASTIC_PASSWORD}" -X PUT \ "https://localhost:9200/_ilm/policy/logs_policy" -H 'Content-Type: application/json' -d' { "policy": { "phases": { "delete": { "min_age": "30d", "actions": { "delete": {} } } } } }' -
Set up ingress - Expose Kibana securely:
- Configure TLS certificates
- Set up authentication (LDAP, SAML, etc.)
- Restrict access by IP
-
Enhance backup strategy - Beyond Velero:
- Consider Elasticsearch native snapshots for faster recovery
- Test restore procedures regularly
- See the upcoming dedicated backup guide for full details
-
Consider Loki migration - If resource-constrained:
- Loki uses ~80% less resources for simple log aggregation
- Trade-off: Slower full-text search vs. Elasticsearch
- Evaluate whether full-text search speed is critical for your use case
-
Apply to other workloads - Use these principles for:
- PostgreSQL, MongoDB, Redis deployments
- Any stateful Kubernetes application
- Resource sizing methodology applies universally
References and Additional Resources
Deployment File References
All configurations discussed in this guide are currently deployed and can be found:
Cluster (source of truth):
- Elasticsearch:
kubectl get statefulset elasticsearch-master -n elastic-stack -o yaml - Kibana:
kubectl get deployment kibana-kibana -n elastic-stack -o yaml - Filebeat:
kubectl get daemonset filebeat-filebeat -n elastic-stack -o yaml - Filebeat config:
kubectl get configmap filebeat-filebeat-daemonset-config -n elastic-stack -o yaml
GitHub Repository:
All configuration files referenced in this guide are available at:
https://github.com/npecka/peckacyber/tree/main/guides/elasticsearch-k8s-optimization
Additional Resources
- Elasticsearch on Kubernetes: https://www.elastic.co/guide/en/cloud-on-k8s/current/index.html
- Filebeat Configuration: https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-reference-yml.html
- ILM Policies: https://www.elastic.co/guide/en/elasticsearch/reference/current/index-lifecycle-management.html
- Kubernetes Best Practices: https://kubernetes.io/docs/concepts/configuration/overview/
- Claude AI: https://claude.ai/
Related Guides on This Site:
- Coming soon: "Kubernetes Backup Strategy: Velero + MinIO for Homelabs"
- Coming soon: "Migrating from Elasticsearch to Loki for Cost Savings"
- Coming soon: "AI-Assisted Kubernetes Operations: A Workflow Guide"
Security Considerations
What to Share with AI Tools:
Safe to share:
- Error messages
- Configuration structures (sanitized)
- Resource metrics
- Architecture diagrams
Never share:
- Actual passwords or API keys
- Internal IP addresses (use placeholders)
- Hostnames revealing network structure
- Sensitive log data
Example sanitization:
# Shared with AI:
password: "${ELASTICSEARCH_PASSWORD}" # Reference only
# NOT shared:
password: "actualP@ssw0rd123"
Production Hardening:
For production deployments, additionally consider:
- Network policies - Restrict pod-to-pod communication
- RBAC - Limit service account permissions
- Secrets management - Use Vault or Sealed Secrets
- TLS everywhere - Inter-node communication encryption
- Audit logging - Track all Elasticsearch access
- Regular patching - Keep Elastic Stack up to date
Changelog
v1.0 - October 31, 2025 - Initial release
- Complete deployment guide from initial setup through optimization
- All configurations tested and validated in homelab
- Documents real problems and proven solutions
Questions or Issues? Feel free to reach out via the contact form on peckacyber.com
Found this helpful? Subscribe to the newsletter for more guides.