A pod showing CrashLoopBackOff is a symptom. A node showing NotReady is a symptom. Root cause analysis (RCA) is the process of finding what actually caused those symptoms — and doing it fast enough to matter in a production incident.
The Core RCA Loop
Effective Kubernetes RCA follows a consistent loop regardless of the specific failure:
- Establish the timeline — when did the failure start?
- Identify the blast radius — what else is affected?
- Find the change vector — what was deployed, scaled, or modified?
- Correlate signals — do events, logs, and metrics agree?
- Confirm the cause — does reverting or patching the suspected cause fix it?
- Document and prevent — write the postmortem and add safeguards
Step 1: Establish the Timeline
# Cluster events sorted by time (last 1 hour by default) kubectl get events -A --sort-by='.lastTimestamp' # When did pods start failing? kubectl get pods -n <ns> -o wide kubectl describe pod <pod> | grep -A3 "Last State" # Deployment rollout history — what changed and when? kubectl rollout history deployment/<name> -n <ns> kubectl rollout history deployment/<name> -n <ns> --revision=5 # Check HPA scaling events kubectl describe hpa <name> -n <ns> | grep -A20 "Events"
Cross-reference the timestamp of the first CrashLoopBackOff event with the most recent deployment. In the majority of incidents, they align within 5 minutes. If not, expand the search: check HPA activity, ConfigMap updates, and node events.
Step 2: Map the Blast Radius
# Are other namespaces healthy?
kubectl get pods -A --field-selector='status.phase!=Running'
# Is it isolated to one node?
kubectl get pods -n <ns> -o wide | awk '{print $7}' | sort | uniq -c
# Check node conditions
kubectl describe nodes | grep -E "(Name:|MemoryPressure|DiskPressure|PIDPressure|Ready)"
# Are downstream services affected?
kubectl get endpoints -n <ns> # any empty endpoints?If failures are scattered across nodes, the problem is application-level (bad image, missing secret, quota exhaustion). If they concentrate on one node, suspect node-level resource exhaustion or a hardware fault. If it's cross-namespace, suspect a cluster-level component like CoreDNS, the API server, or a storage provisioner.
Step 3: Find the Change Vector
The most common change vectors in Kubernetes incidents are:
- New image tag deployed with a regression
- ConfigMap or Secret modified (wrong key name, bad value)
- Resource limits lowered in a Helm values update
- HPA scaled down aggressively during low-traffic window, then traffic spiked
- Node drained for maintenance without respecting PodDisruptionBudget
- Certificate expiry (surprising but common — check
NotBefore/NotAfter) - Namespace quota tightened by platform team
- NetworkPolicy added that inadvertently blocks required traffic
# Did a ConfigMap change recently?
kubectl describe configmap <name> | grep -i "annotations|last-applied"
# GitOps — what did Argo/Flux apply?
kubectl get application -n argocd
argocd app history <app-name>
# Check recent image digests
kubectl get pod <pod> -o jsonpath='{.status.containerStatuses[0].imageID}'
# Helm release history
helm history <release> -n <namespace>Step 4: Correlate Signals
A robust RCA requires at least two independent signals pointing to the same cause before you act. A single log line saying "out of memory" is not enough — confirm with the OOMKilled reason in kubectl describe and the container memory metric crossing the limit in Prometheus:
# Prometheus query for OOMKill events
kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}
# Memory usage vs limit for a deployment
container_memory_working_set_bytes{namespace="<ns>"}
/ container_spec_memory_limit_bytes{namespace="<ns>"}
# Error rate spike correlated with a deploy
rate(http_requests_total{status=~"5.."}[5m])Correlate log timestamps with metric spikes. If the 5xx error rate spiked at 14:32 and the deployment rolled out at 14:31, that's strong evidence. If the spike happened at 14:32 but the deploy was at 13:00, look elsewhere.
Multi-Service Log Correlation
In a microservices architecture, a single user request touches multiple pods. Correlate across them using a request ID or trace ID:
# Tail logs from all pods in a deployment simultaneously stern api-deployment -n production # Filter by trace ID across all logs stern -n production . | grep "trace_id=abc123" # If using structured JSON logging kubectl logs -l app=api -n production --since=5m | jq 'select(.trace_id == "abc123")'
Without a correlation ID, you're matching by timestamp — which requires all services to have clock synchronization and is error-prone. Invest in adding a request ID header early in your service mesh or API gateway.
Certificate and Secret Expiry RCA
Expired certificates are a common, slow-burning root cause that affects multiple services simultaneously when it hits:
# Check TLS certificates in secrets
kubectl get secrets -n <namespace> -o json | jq -r '.items[] | select(.type=="kubernetes.io/tls") |
.metadata.name + " expires: " +
(.data["tls.crt"] | @base64d |
split("
") | map(select(startswith("Not After"))) | .[0])'
# For cert-manager managed certs
kubectl get certificates -A
kubectl describe certificate <name> -n <namespace>
# Check if a pod is failing TLS handshakes
kubectl logs <pod> | grep -i "certificate|tls|ssl|x509"Step 5: Confirm and Document
Before marking an incident resolved, confirm the fix holds: watch the pod restart count stop climbing (kubectl get pods -w) and verify the readiness probe is passing. Document the RCA in a postmortem while the timeline is fresh.
Postmortem Template
A postmortem that actually prevents future incidents needs these sections:
- Impact: What broke, for how long, how many users affected, revenue impact
- Timeline: UTC timestamps — detection, escalation, mitigation, full resolution
- Root cause: One clear sentence (not a symptom — the underlying cause)
- Contributing factors: What made the root cause possible or harder to catch
- Fix applied: Exact change made, by whom, verified by what
- Action items: Preventive measures with owner, due date, and definition of done
Common Root Causes Reference
- OOMKilled — memory limit too low for actual workload peak
- ImagePullBackOff — wrong tag, private registry missing imagePullSecret, or network policy blocking registry access
- FailedMount — PVC not bound, wrong StorageClass, or secret key name typo
- Evicted — node disk or memory pressure; pods without resource requests are first to go
- Pending forever — no node matches affinity rules, taints not tolerated, or resource requests too large
- Readiness failing — app slow to start, health endpoint wrong, dependency service down
- High latency spike — HPA scale-out lag, cold-start JVM/Python, or GC pause under load
- Rolling update traffic drop — readiness probe path changed, new pods serving traffic before ready
- Certificate expiry — TLS cert expired; services fail mutual TLS or webhook calls
Using the Five Whys in Kubernetes RCA
The Five Whys technique forces you past the symptom to the systemic cause:
- Why did the service go down? — The pod was OOMKilled.
- Why was it OOMKilled? — Memory usage exceeded the 512Mi limit.
- Why did memory usage spike? — A new endpoint returns unbounded query results.
- Why was there no limit on results? — The query lacked pagination and passed code review.
- Why did code review miss it? — No automated check for unbounded queries exists.
The real action item isn't "raise the memory limit" — it's "add linting or integration tests that catch unbounded queries." Five Whys stops at the systemic fix, not the band-aid.
Accelerating RCA with Automation
Manual RCA on a complex cluster takes 30–90 minutes of jumping between commands, dashboards, and log streams. The signals are all there — the bottleneck is human correlation speed. AI-assisted tools can ingest all relevant signals simultaneously and surface the root cause in seconds, letting SREs spend their time on the fix rather than the investigation.