A pod showing CrashLoopBackOff is a symptom. A node showing NotReady is a symptom. Root cause analysis (RCA) is the process of finding what actually caused those symptoms — and doing it fast enough to matter in a production incident.
The Core RCA Loop
Effective Kubernetes RCA follows a consistent loop regardless of the specific failure:
- Establish the timeline — when did the failure start?
- Identify the blast radius — what else is affected?
- Find the change vector — what was deployed, scaled, or modified?
- Correlate signals — do events, logs, and metrics agree?
- Confirm the cause — does reverting or patching the suspected cause fix it?
Step 1: Establish the Timeline
# Cluster events sorted by time (last 1 hour by default) kubectl get events -A --sort-by='.lastTimestamp' # When did pods start failing? kubectl get pods -n <ns> -o wide kubectl describe pod <pod> | grep -A3 "Last State" # Deployment rollout history kubectl rollout history deployment/<name> -n <ns> kubectl rollout history deployment/<name> -n <ns> --revision=5
Cross-reference the timestamp of the first CrashLoopBackOff event with the most recent deployment. In the majority of incidents, they align.
Step 2: Map the Blast Radius
# Are other namespaces healthy?
kubectl get pods -A --field-selector='status.phase!=Running'
# Is it isolated to one node?
kubectl get pods -n <ns> -o wide | awk '{print $7}' | sort | uniq -c
# Check node conditions
kubectl describe nodes | grep -E "(Name:|MemoryPressure|DiskPressure|PIDPressure|Ready)" If failures are scattered across nodes, the problem is application-level (bad image, missing secret). If they concentrate on one node, suspect node-level resource exhaustion or a hardware fault.
Step 3: Find the Change Vector
The most common change vectors in Kubernetes incidents are:
- New image tag deployed with a regression
- ConfigMap or Secret modified (wrong key name, bad value)
- Resource limits lowered in a Helm values update
- HPA scaled down aggressively during low-traffic window, then traffic spiked
- Node drained for maintenance without respecting PodDisruptionBudget
- Certificate expiry (surprising but common — check
NotBefore/NotAfter)
# Did a ConfigMap change recently? kubectl describe configmap <name> | grep -i "annotations|last-applied" # GitOps — what did Argo/Flux apply? kubectl get application -n argocd argocd app history <app-name>
Step 4: Correlate Signals
A robust RCA requires at least two independent signals pointing to the same cause before you act. A single log line saying "out of memory" is not enough — confirm with the OOMKilled reason in kubectl describe and the container memory metric crossing the limit in Prometheus:
# Prometheus query for OOMKill events
kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}
# Or check cgroup memory stats on the node
kubectl node-shell <node> -- cat /sys/fs/cgroup/memory/.../memory.max_usage_in_bytesStep 5: Confirm and Document
Before marking an incident resolved, confirm the fix holds: watch the pod restart count stop climbing (kubectl get pods -w) and verify the readiness probe is passing. Document the RCA in a postmortem while the timeline is fresh — include the exact sequence of events, the change that caused it, and the immediate fix applied.
Common Root Causes Reference
- OOMKilled — memory limit too low for workload
- ImagePullBackOff — wrong tag, private registry missing imagePullSecret
- FailedMount — PVC not bound, wrong StorageClass, or secret key name typo
- Evicted — node disk or memory pressure; no requests set
- Pending forever — no node matches affinity rules or resource requests
- Readiness failing — app slow to start, probe path wrong, dependency down
Accelerating RCA with Automation
Manual RCA on a complex cluster takes 30–90 minutes of jumping between commands, dashboards, and log streams. The signals are all there — the bottleneck is human correlation speed. AI-assisted tools can ingest all relevant signals simultaneously and surface the root cause in seconds, letting SREs spend their time on the fix rather than the investigation.