Kubernetes is resilient by design, but when something breaks, the blast radius can be wide and the signal buried deep in logs, events, and metrics spread across dozens of components. This guide walks through a proven debugging methodology for the most common failure categories, with the exact commands needed at each step.

Start with Cluster-Wide Visibility

Before diving into individual pods, get a bird's-eye view of what's unhealthy:

# See all non-Running pods across every namespace
kubectl get pods -A --field-selector='status.phase!=Running'

# Spot nodes under pressure
kubectl get nodes
kubectl describe nodes | grep -A5 "Conditions:"

# Recent cluster events sorted by time
kubectl get events -A --sort-by='.lastTimestamp' | tail -40

This surfaces the scope of the incident immediately. A single bad node failing its health checks looks very different from a misconfigured deployment rolling out CrashLoopBackOff pods on every node.

Pod-Level Debugging Workflow

Once you've identified a problematic pod, follow this sequence:

1. Read the pod description

kubectl describe pod <pod-name> -n <namespace>

The Events section at the bottom is the most informative part. Look for:

  • FailedScheduling — insufficient CPU/memory or node affinity conflicts
  • ErrImagePull / ImagePullBackOff — bad image tag or missing registry credentials
  • Liveness probe failed — app not responding on the expected port/path
  • OOMKilled — container exceeded its memory limit

2. Fetch logs (including from previous crash)

# Current container logs
kubectl logs <pod-name> -n <namespace> --tail=100

# Logs from the last crashed container
kubectl logs <pod-name> -n <namespace> --previous --tail=200

3. Exec in for live inspection

kubectl exec -it <pod-name> -n <namespace> -- /bin/sh

# Inside the container:
env | grep -i db       # check env vars
curl localhost:8080/health
cat /etc/config/app.yaml

If the container image has no shell, launch an ephemeral debug container:

kubectl debug -it <pod-name> --image=busybox --target=<container-name>

Diagnosing Resource Exhaustion

Resource issues are among the most common — and most misdiagnosed — Kubernetes failures. A pod showing Pending may simply have requests that no node can satisfy:

# Visualize allocatable vs. requested per node
kubectl describe nodes | grep -A10 "Allocated resources"

# Top consumers right now
kubectl top pods -A --sort-by=memory
kubectl top nodes

If a node shows MemoryPressure=True, the kubelet will start evicting pods — starting with those that exceed their memory requests. Always set both requests and limits on critical workloads.

Network Debugging

Services not routing traffic is a classic Kubernetes head-scratcher. The most common cause is a label selector mismatch:

# Check if any pods are selected
kubectl get endpoints <service-name> -n <namespace>
# Empty "Endpoints: <none>" means the selector matches nothing

# Compare service selector vs pod labels
kubectl get svc <service-name> -o jsonpath='{.spec.selector}'
kubectl get pods -n <namespace> --show-labels

If endpoints look correct but traffic still fails, check NetworkPolicy objects — a blanket deny-all policy will silently drop packets with no error message on either side.

Reading the Audit Trail

For intermittent failures or issues that resolved themselves, cluster events expire after one hour by default. Persistent event storage (via tools like kube-state-metrics + Prometheus) or a dedicated events backend like Loki is essential for post-incident analysis. When investigating past incidents, check HPA scale events and PodDisruptionBudget violations — they rarely show up in standard pod logs but are captured in events.

Pro tip: Save your debugging session with kubectl describe pod > incident.txt before the pod is rescheduled and its history disappears.

When Manual Debugging Takes Too Long

Running through these steps manually under production pressure is slow and error-prone. KubeIntellect automates the entire sequence — it collects describe output, logs, metrics, and events, correlates them across the timeline, and surfaces the root cause with a one-line fix in under two seconds.