Kubernetes is resilient by design, but when something breaks, the blast radius can be wide and the signal buried deep in logs, events, and metrics spread across dozens of components. This guide walks through a proven debugging methodology for the most common failure categories, with the exact commands needed at each step.

Start with Cluster-Wide Visibility

Before diving into individual pods, get a bird's-eye view of what's unhealthy:

# See all non-Running pods across every namespace
kubectl get pods -A --field-selector='status.phase!=Running'

# Spot nodes under pressure
kubectl get nodes
kubectl describe nodes | grep -A5 "Conditions:"

# Recent cluster events sorted by time
kubectl get events -A --sort-by='.lastTimestamp' | tail -40

This surfaces the scope of the incident immediately. A single bad node failing its health checks looks very different from a misconfigured deployment rolling out CrashLoopBackOff pods on every node.

Pod-Level Debugging Workflow

Once you've identified a problematic pod, follow this sequence:

1. Read the pod description

kubectl describe pod <pod-name> -n <namespace>

The Events section at the bottom is the most informative part. Look for:

  • FailedScheduling — insufficient CPU/memory or node affinity conflicts
  • ErrImagePull / ImagePullBackOff — bad image tag or missing registry credentials
  • Liveness probe failed — app not responding on the expected port/path
  • OOMKilled — container exceeded its memory limit
  • FailedMount — PVC not bound or secret key missing

2. Fetch logs (including from previous crash)

# Current container logs
kubectl logs <pod-name> -n <namespace> --tail=100

# Logs from the last crashed container
kubectl logs <pod-name> -n <namespace> --previous --tail=200

# Follow logs in real time
kubectl logs <pod-name> -n <namespace> -f

3. Exec in for live inspection

kubectl exec -it <pod-name> -n <namespace> -- /bin/sh

# Inside the container:
env | grep -i db       # check env vars
curl localhost:8080/health
cat /etc/config/app.yaml

If the container image has no shell, launch an ephemeral debug container:

kubectl debug -it <pod-name> --image=busybox --target=<container-name>

Debugging with Ephemeral Containers

Distroless and scratch-based images have no shell, making traditional exec impossible. Ephemeral debug containers solve this without modifying the pod spec:

# Attach a debug container to a running pod
kubectl debug -it <pod-name> -n <namespace>   --image=nicolaka/netshoot   --target=<container-name>

# nicolaka/netshoot includes: curl, dig, nslookup, tcpdump,
# netstat, ping, traceroute, ss, iperf3, and more

The debug container shares the target container's network namespace, so you can inspect its open ports, run DNS lookups, and trace network connections without any changes to the application image.

# Copy a running pod to a new pod with debug tools added
kubectl debug <pod-name> -it --copy-to=debug-pod   --image=busybox --share-processes

Diagnosing Resource Exhaustion

Resource issues are among the most common — and most misdiagnosed — Kubernetes failures. A pod showing Pending may simply have requests that no node can satisfy:

# Visualize allocatable vs. requested per node
kubectl describe nodes | grep -A10 "Allocated resources"

# Top consumers right now
kubectl top pods -A --sort-by=memory
kubectl top nodes

# Check namespace-level resource quotas
kubectl describe resourcequota -n <namespace>

If a node shows MemoryPressure=True, the kubelet will start evicting pods — starting with those that exceed their memory requests. Always set both requests and limits on critical workloads.

# Find pods with no resource requests set (eviction risk)
kubectl get pods -A -o json | jq -r '
  .items[] |
  select(.spec.containers[].resources.requests == null) |
  [.metadata.namespace, .metadata.name] | @tsv'

Network Debugging

Services not routing traffic is a classic Kubernetes head-scratcher. The most common cause is a label selector mismatch:

# Check if any pods are selected
kubectl get endpoints <service-name> -n <namespace>
# Empty "Endpoints: <none>" means the selector matches nothing

# Compare service selector vs pod labels
kubectl get svc <service-name> -o jsonpath='{.spec.selector}'
kubectl get pods -n <namespace> --show-labels

# Test connectivity from inside the cluster
kubectl exec -it <any-pod> -n <namespace> --   curl -v http://<service>.<namespace>.svc.cluster.local:<port>/health

If endpoints look correct but traffic still fails, check NetworkPolicy objects — a blanket deny-all policy will silently drop packets with no error message on either side.

# List all NetworkPolicies in a namespace
kubectl get networkpolicy -n <namespace>
kubectl describe networkpolicy <name> -n <namespace>

# Check if a specific port is blocked
kubectl exec -it <pod> -- nc -zv <target-ip> <port>

DNS Debugging

DNS failures inside a cluster often look like connection timeouts or connection refused errors, making them hard to distinguish from routing failures. Always test DNS explicitly first:

# Run a one-shot DNS test pod
kubectl run dnstest --image=busybox:1.28 --rm -it   --restart=Never -- nslookup kubernetes.default

# Test a specific service DNS name
kubectl run dnstest --image=busybox:1.28 --rm -it   --restart=Never -- nslookup <service>.<namespace>.svc.cluster.local

# Check CoreDNS is running
kubectl get pods -n kube-system -l k8s-app=kube-dns

# Read CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50

Common DNS failure causes: CoreDNS pods in CrashLoopBackOff, the pod's dnsPolicy set to None without a dnsConfig, or a NetworkPolicy blocking UDP port 53 from pods to CoreDNS.

# Verify DNS config inside a pod
kubectl exec -it <pod-name> -- cat /etc/resolv.conf
# Should show: nameserver 10.96.0.10 (or cluster DNS IP)
# and: search <namespace>.svc.cluster.local svc.cluster.local cluster.local

Storage and PersistentVolume Debugging

Storage issues manifest as pods stuck in Pending (PVC not bound) or as FailedMount events (volume exists but can't be attached):

# Check PVC status
kubectl get pvc -n <namespace>
kubectl describe pvc <pvc-name> -n <namespace>

# Find PVs and their binding status
kubectl get pv
kubectl describe pv <pv-name>

# Check storage class provisioner
kubectl get storageclass
kubectl describe storageclass <name>

If a PVC is stuck in Pending with no events, the provisioner is likely not running. Check for the provisioner's pod in the relevant namespace. For ReadWriteMany access mode, not all storage drivers support it — check the driver documentation.

# PVC stuck in Terminating? Force remove the finalizer
kubectl patch pvc <name> -n <namespace>   -p '{"metadata":{"finalizers":null}}'

Debugging Multi-Container Pods

Pods with multiple containers (sidecars, init containers) require specifying the container name in most kubectl commands:

# List containers in a pod
kubectl get pod <pod-name> -o jsonpath='{.spec.containers[*].name}'

# Logs from a specific container
kubectl logs <pod-name> -c <container-name> --previous

# Exec into a specific container
kubectl exec -it <pod-name> -c <sidecar-name> -- /bin/sh

# Check init container status
kubectl describe pod <pod-name> | grep -A20 "Init Containers:"

# Init container logs
kubectl logs <pod-name> -c <init-container-name>
Init container failures: If an init container exits with a non-zero code, the main containers never start. The pod status shows Init:CrashLoopBackOff or Init:Error. Always check init container logs first when the main app never appears to start.

Node-Level Debugging

When a node goes NotReady or pods on a specific node all fail simultaneously, investigate at the node level:

# Check node conditions in detail
kubectl describe node <node-name>

# SSH to the node (or use node-shell plugin)
kubectl node-shell <node-name>

# On the node — check kubelet
systemctl status kubelet
journalctl -u kubelet -n 100 --no-pager

# Check container runtime
systemctl status containerd
crictl ps  # list running containers

# Disk usage
df -h
du -sh /var/log/pods/* | sort -hr | head -20

# Memory
free -m
cat /proc/meminfo

Disk pressure is particularly sneaky — it fills up with old pod logs and evicted image layers. Configure log rotation and set imagePullPolicy: Always only where needed.

Debugging StatefulSets

StatefulSets have ordered, sequential pod creation — a stuck pod at ordinal N blocks all pods from ordinal N+1 onward:

# Check StatefulSet status
kubectl describe statefulset <name> -n <namespace>

# Individual pod status (ordered 0, 1, 2...)
kubectl get pods -n <namespace> -l app=<label> --sort-by='.metadata.name'

# PVC per pod (each pod gets its own)
kubectl get pvc -n <namespace> -l app=<label>

# If a pod is stuck, check its specific PVC
kubectl describe pvc <pvc-name-for-pod-0>

Reading the Audit Trail

For intermittent failures or issues that resolved themselves, cluster events expire after one hour by default. Persistent event storage (via tools like kube-state-metrics + Prometheus) or a dedicated events backend like Loki is essential for post-incident analysis. When investigating past incidents, check HPA scale events and PodDisruptionBudget violations — they rarely show up in standard pod logs but are captured in events.

# Events sorted by time, including older ones
kubectl get events -n <namespace>   --sort-by='.lastTimestamp'   --field-selector='type=Warning'

# Watch events in real time during a deploy
kubectl get events -n <namespace> -w
Pro tip: Save your debugging session with kubectl describe pod > incident.txt before the pod is rescheduled and its history disappears. Events are gone after 1 hour — export them to persistent storage if you need them for a postmortem.

Debugging Stuck Rollouts

A deployment rollout that never completes blocks future deployments. The most common causes are readiness probe failures on the new version or insufficient cluster capacity:

# Check rollout status
kubectl rollout status deployment/<name> -n <namespace>

# See what the new pods are doing
kubectl get pods -n <namespace> | grep -v Running
kubectl describe pod <new-pod-name>

# Check rollout history
kubectl rollout history deployment/<name>

# Emergency rollback
kubectl rollout undo deployment/<name> -n <namespace>

# Pause a rollout to investigate
kubectl rollout pause deployment/<name>
kubectl rollout resume deployment/<name>

When Manual Debugging Takes Too Long

Running through these steps manually under production pressure is slow and error-prone. KubeIntellect automates the entire sequence — it collects describe output, logs, metrics, and events, correlates them across the timeline, and surfaces the root cause with a one-line fix in under two seconds. Ask it in plain English: "Why is my api pod crashing in production?"