Kubernetes Root Cause Analysis: Find the Real Problem Fast

A pod showing CrashLoopBackOff is a symptom. A node showing NotReady is a symptom. Root cause analysis (RCA) is the process of finding what actually caused those symptoms — and doing it fast enough to matter in a production incident.

The Core RCA Loop

Effective Kubernetes RCA follows a consistent loop regardless of the specific failure:

Establish the timeline — when did the failure start?
Identify the blast radius — what else is affected?
Find the change vector — what was deployed, scaled, or modified?
Correlate signals — do events, logs, and metrics agree?
Confirm the cause — does reverting or patching the suspected cause fix it?
Document and prevent — write the postmortem and add safeguards

Step 1: Establish the Timeline

# Cluster events sorted by time (last 1 hour by default)
kubectl get events -A --sort-by='.lastTimestamp'

# When did pods start failing?
kubectl get pods -n <ns> -o wide
kubectl describe pod <pod> | grep -A3 "Last State"

# Deployment rollout history — what changed and when?
kubectl rollout history deployment/<name> -n <ns>
kubectl rollout history deployment/<name> -n <ns> --revision=5

# Check HPA scaling events
kubectl describe hpa <name> -n <ns> | grep -A20 "Events"

Cross-reference the timestamp of the first CrashLoopBackOff event with the most recent deployment. In the majority of incidents, they align within 5 minutes. If not, expand the search: check HPA activity, ConfigMap updates, and node events.

Step 2: Map the Blast Radius

# Are other namespaces healthy?
kubectl get pods -A --field-selector='status.phase!=Running'

# Is it isolated to one node?
kubectl get pods -n <ns> -o wide | awk '{print $7}' | sort | uniq -c

# Check node conditions
kubectl describe nodes | grep -E "(Name:|MemoryPressure|DiskPressure|PIDPressure|Ready)"

# Are downstream services affected?
kubectl get endpoints -n <ns>  # any empty endpoints?

If failures are scattered across nodes, the problem is application-level (bad image, missing secret, quota exhaustion). If they concentrate on one node, suspect node-level resource exhaustion or a hardware fault. If it's cross-namespace, suspect a cluster-level component like CoreDNS, the API server, or a storage provisioner.

Step 3: Find the Change Vector

The most common change vectors in Kubernetes incidents are:

New image tag deployed with a regression
ConfigMap or Secret modified (wrong key name, bad value)
Resource limits lowered in a Helm values update
HPA scaled down aggressively during low-traffic window, then traffic spiked
Node drained for maintenance without respecting PodDisruptionBudget
Certificate expiry (surprising but common — check NotBefore/NotAfter)
Namespace quota tightened by platform team
NetworkPolicy added that inadvertently blocks required traffic

# Did a ConfigMap change recently?
kubectl describe configmap <name> | grep -i "annotations|last-applied"

# GitOps — what did Argo/Flux apply?
kubectl get application -n argocd
argocd app history <app-name>

# Check recent image digests
kubectl get pod <pod> -o jsonpath='{.status.containerStatuses[0].imageID}'

# Helm release history
helm history <release> -n <namespace>

Step 4: Correlate Signals

A robust RCA requires at least two independent signals pointing to the same cause before you act. A single log line saying "out of memory" is not enough — confirm with the OOMKilled reason in kubectl describe and the container memory metric crossing the limit in Prometheus:

# Prometheus query for OOMKill events
kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}

# Memory usage vs limit for a deployment
container_memory_working_set_bytes{namespace="<ns>"}
/ container_spec_memory_limit_bytes{namespace="<ns>"}

# Error rate spike correlated with a deploy
rate(http_requests_total{status=~"5.."}[5m])

Correlate log timestamps with metric spikes. If the 5xx error rate spiked at 14:32 and the deployment rolled out at 14:31, that's strong evidence. If the spike happened at 14:32 but the deploy was at 13:00, look elsewhere.

Multi-Service Log Correlation

In a microservices architecture, a single user request touches multiple pods. Correlate across them using a request ID or trace ID:

# Tail logs from all pods in a deployment simultaneously
stern api-deployment -n production

# Filter by trace ID across all logs
stern -n production . | grep "trace_id=abc123"

# If using structured JSON logging
kubectl logs -l app=api -n production --since=5m |   jq 'select(.trace_id == "abc123")'

Without a correlation ID, you're matching by timestamp — which requires all services to have clock synchronization and is error-prone. Invest in adding a request ID header early in your service mesh or API gateway.

Certificate and Secret Expiry RCA

Expired certificates are a common, slow-burning root cause that affects multiple services simultaneously when it hits:

# Check TLS certificates in secrets
kubectl get secrets -n <namespace> -o json |   jq -r '.items[] | select(.type=="kubernetes.io/tls") |
    .metadata.name + " expires: " +
    (.data["tls.crt"] | @base64d |
     split("
") | map(select(startswith("Not After"))) | .[0])'

# For cert-manager managed certs
kubectl get certificates -A
kubectl describe certificate <name> -n <namespace>

# Check if a pod is failing TLS handshakes
kubectl logs <pod> | grep -i "certificate|tls|ssl|x509"

Step 5: Confirm and Document

Before marking an incident resolved, confirm the fix holds: watch the pod restart count stop climbing (kubectl get pods -w) and verify the readiness probe is passing. Document the RCA in a postmortem while the timeline is fresh.

Postmortem Template

A postmortem that actually prevents future incidents needs these sections:

Impact: What broke, for how long, how many users affected, revenue impact
Timeline: UTC timestamps — detection, escalation, mitigation, full resolution
Root cause: One clear sentence (not a symptom — the underlying cause)
Contributing factors: What made the root cause possible or harder to catch
Fix applied: Exact change made, by whom, verified by what
Action items: Preventive measures with owner, due date, and definition of done

Postmortem structure: Impact → Timeline → Root cause → Contributing factors → Fix applied → Action items to prevent recurrence. Blameless culture: focus on system failures, not individual mistakes.

Common Root Causes Reference

OOMKilled — memory limit too low for actual workload peak
ImagePullBackOff — wrong tag, private registry missing imagePullSecret, or network policy blocking registry access
FailedMount — PVC not bound, wrong StorageClass, or secret key name typo
Evicted — node disk or memory pressure; pods without resource requests are first to go
Pending forever — no node matches affinity rules, taints not tolerated, or resource requests too large
Readiness failing — app slow to start, health endpoint wrong, dependency service down
High latency spike — HPA scale-out lag, cold-start JVM/Python, or GC pause under load
Rolling update traffic drop — readiness probe path changed, new pods serving traffic before ready
Certificate expiry — TLS cert expired; services fail mutual TLS or webhook calls

Using the Five Whys in Kubernetes RCA

The Five Whys technique forces you past the symptom to the systemic cause:

Why did the service go down? — The pod was OOMKilled.
Why was it OOMKilled? — Memory usage exceeded the 512Mi limit.
Why did memory usage spike? — A new endpoint returns unbounded query results.
Why was there no limit on results? — The query lacked pagination and passed code review.
Why did code review miss it? — No automated check for unbounded queries exists.

The real action item isn't "raise the memory limit" — it's "add linting or integration tests that catch unbounded queries." Five Whys stops at the systemic fix, not the band-aid.

Accelerating RCA with Automation

Manual RCA on a complex cluster takes 30–90 minutes of jumping between commands, dashboards, and log streams. The signals are all there — the bottleneck is human correlation speed. AI-assisted tools can ingest all relevant signals simultaneously and surface the root cause in seconds, letting SREs spend their time on the fix rather than the investigation.