Kubernetes is resilient by design, but when something breaks, the blast radius can be wide and the signal buried deep in logs, events, and metrics spread across dozens of components. This guide walks through a proven debugging methodology for the most common failure categories, with the exact commands needed at each step.
Start with Cluster-Wide Visibility
Before diving into individual pods, get a bird's-eye view of what's unhealthy:
# See all non-Running pods across every namespace kubectl get pods -A --field-selector='status.phase!=Running' # Spot nodes under pressure kubectl get nodes kubectl describe nodes | grep -A5 "Conditions:" # Recent cluster events sorted by time kubectl get events -A --sort-by='.lastTimestamp' | tail -40
This surfaces the scope of the incident immediately. A single bad node failing its health checks looks very different from a misconfigured deployment rolling out CrashLoopBackOff pods on every node.
Pod-Level Debugging Workflow
Once you've identified a problematic pod, follow this sequence:
1. Read the pod description
kubectl describe pod <pod-name> -n <namespace>
The Events section at the bottom is the most informative part. Look for:
FailedScheduling— insufficient CPU/memory or node affinity conflictsErrImagePull/ImagePullBackOff— bad image tag or missing registry credentialsLiveness probe failed— app not responding on the expected port/pathOOMKilled— container exceeded its memory limit
2. Fetch logs (including from previous crash)
# Current container logs kubectl logs <pod-name> -n <namespace> --tail=100 # Logs from the last crashed container kubectl logs <pod-name> -n <namespace> --previous --tail=200
3. Exec in for live inspection
kubectl exec -it <pod-name> -n <namespace> -- /bin/sh # Inside the container: env | grep -i db # check env vars curl localhost:8080/health cat /etc/config/app.yaml
If the container image has no shell, launch an ephemeral debug container:
kubectl debug -it <pod-name> --image=busybox --target=<container-name>
Diagnosing Resource Exhaustion
Resource issues are among the most common — and most misdiagnosed — Kubernetes failures. A pod showing Pending may simply have requests that no node can satisfy:
# Visualize allocatable vs. requested per node kubectl describe nodes | grep -A10 "Allocated resources" # Top consumers right now kubectl top pods -A --sort-by=memory kubectl top nodes
If a node shows MemoryPressure=True, the kubelet will start evicting pods — starting with those that exceed their memory requests. Always set both requests and limits on critical workloads.
Network Debugging
Services not routing traffic is a classic Kubernetes head-scratcher. The most common cause is a label selector mismatch:
# Check if any pods are selected
kubectl get endpoints <service-name> -n <namespace>
# Empty "Endpoints: <none>" means the selector matches nothing
# Compare service selector vs pod labels
kubectl get svc <service-name> -o jsonpath='{.spec.selector}'
kubectl get pods -n <namespace> --show-labelsIf endpoints look correct but traffic still fails, check NetworkPolicy objects — a blanket deny-all policy will silently drop packets with no error message on either side.
Reading the Audit Trail
For intermittent failures or issues that resolved themselves, cluster events expire after one hour by default. Persistent event storage (via tools like kube-state-metrics + Prometheus) or a dedicated events backend like Loki is essential for post-incident analysis. When investigating past incidents, check HPA scale events and PodDisruptionBudget violations — they rarely show up in standard pod logs but are captured in events.
kubectl describe pod > incident.txt before the pod is rescheduled and its history disappears.When Manual Debugging Takes Too Long
Running through these steps manually under production pressure is slow and error-prone. KubeIntellect automates the entire sequence — it collects describe output, logs, metrics, and events, correlates them across the timeline, and surfaces the root cause with a one-line fix in under two seconds.