A pod showing CrashLoopBackOff is a symptom. A node showing NotReady is a symptom. Root cause analysis (RCA) is the process of finding what actually caused those symptoms — and doing it fast enough to matter in a production incident.

The Core RCA Loop

Effective Kubernetes RCA follows a consistent loop regardless of the specific failure:

  1. Establish the timeline — when did the failure start?
  2. Identify the blast radius — what else is affected?
  3. Find the change vector — what was deployed, scaled, or modified?
  4. Correlate signals — do events, logs, and metrics agree?
  5. Confirm the cause — does reverting or patching the suspected cause fix it?

Step 1: Establish the Timeline

# Cluster events sorted by time (last 1 hour by default)
kubectl get events -A --sort-by='.lastTimestamp'

# When did pods start failing?
kubectl get pods -n <ns> -o wide
kubectl describe pod <pod> | grep -A3 "Last State"

# Deployment rollout history
kubectl rollout history deployment/<name> -n <ns>
kubectl rollout history deployment/<name> -n <ns> --revision=5

Cross-reference the timestamp of the first CrashLoopBackOff event with the most recent deployment. In the majority of incidents, they align.

Step 2: Map the Blast Radius

# Are other namespaces healthy?
kubectl get pods -A --field-selector='status.phase!=Running'

# Is it isolated to one node?
kubectl get pods -n <ns> -o wide | awk '{print $7}' | sort | uniq -c

# Check node conditions
kubectl describe nodes | grep -E "(Name:|MemoryPressure|DiskPressure|PIDPressure|Ready)"  

If failures are scattered across nodes, the problem is application-level (bad image, missing secret). If they concentrate on one node, suspect node-level resource exhaustion or a hardware fault.

Step 3: Find the Change Vector

The most common change vectors in Kubernetes incidents are:

  • New image tag deployed with a regression
  • ConfigMap or Secret modified (wrong key name, bad value)
  • Resource limits lowered in a Helm values update
  • HPA scaled down aggressively during low-traffic window, then traffic spiked
  • Node drained for maintenance without respecting PodDisruptionBudget
  • Certificate expiry (surprising but common — check NotBefore/NotAfter)
# Did a ConfigMap change recently?
kubectl describe configmap <name> | grep -i "annotations|last-applied"

# GitOps — what did Argo/Flux apply?
kubectl get application -n argocd
argocd app history <app-name>

Step 4: Correlate Signals

A robust RCA requires at least two independent signals pointing to the same cause before you act. A single log line saying "out of memory" is not enough — confirm with the OOMKilled reason in kubectl describe and the container memory metric crossing the limit in Prometheus:

# Prometheus query for OOMKill events
kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}

# Or check cgroup memory stats on the node
kubectl node-shell <node> -- cat /sys/fs/cgroup/memory/.../memory.max_usage_in_bytes

Step 5: Confirm and Document

Before marking an incident resolved, confirm the fix holds: watch the pod restart count stop climbing (kubectl get pods -w) and verify the readiness probe is passing. Document the RCA in a postmortem while the timeline is fresh — include the exact sequence of events, the change that caused it, and the immediate fix applied.

Postmortem structure: Impact → Timeline → Root cause → Contributing factors → Fix applied → Action items to prevent recurrence.

Common Root Causes Reference

  • OOMKilled — memory limit too low for workload
  • ImagePullBackOff — wrong tag, private registry missing imagePullSecret
  • FailedMount — PVC not bound, wrong StorageClass, or secret key name typo
  • Evicted — node disk or memory pressure; no requests set
  • Pending forever — no node matches affinity rules or resource requests
  • Readiness failing — app slow to start, probe path wrong, dependency down

Accelerating RCA with Automation

Manual RCA on a complex cluster takes 30–90 minutes of jumping between commands, dashboards, and log streams. The signals are all there — the bottleneck is human correlation speed. AI-assisted tools can ingest all relevant signals simultaneously and surface the root cause in seconds, letting SREs spend their time on the fix rather than the investigation.