Kubernetes is resilient by design, but when something breaks, the blast radius can be wide and the signal buried deep in logs, events, and metrics spread across dozens of components. This guide walks through a proven debugging methodology for the most common failure categories, with the exact commands needed at each step.
Start with Cluster-Wide Visibility
Before diving into individual pods, get a bird's-eye view of what's unhealthy:
# See all non-Running pods across every namespace kubectl get pods -A --field-selector='status.phase!=Running' # Spot nodes under pressure kubectl get nodes kubectl describe nodes | grep -A5 "Conditions:" # Recent cluster events sorted by time kubectl get events -A --sort-by='.lastTimestamp' | tail -40
This surfaces the scope of the incident immediately. A single bad node failing its health checks looks very different from a misconfigured deployment rolling out CrashLoopBackOff pods on every node.
Pod-Level Debugging Workflow
Once you've identified a problematic pod, follow this sequence:
1. Read the pod description
kubectl describe pod <pod-name> -n <namespace>
The Events section at the bottom is the most informative part. Look for:
FailedScheduling— insufficient CPU/memory or node affinity conflictsErrImagePull/ImagePullBackOff— bad image tag or missing registry credentialsLiveness probe failed— app not responding on the expected port/pathOOMKilled— container exceeded its memory limitFailedMount— PVC not bound or secret key missing
2. Fetch logs (including from previous crash)
# Current container logs kubectl logs <pod-name> -n <namespace> --tail=100 # Logs from the last crashed container kubectl logs <pod-name> -n <namespace> --previous --tail=200 # Follow logs in real time kubectl logs <pod-name> -n <namespace> -f
3. Exec in for live inspection
kubectl exec -it <pod-name> -n <namespace> -- /bin/sh # Inside the container: env | grep -i db # check env vars curl localhost:8080/health cat /etc/config/app.yaml
If the container image has no shell, launch an ephemeral debug container:
kubectl debug -it <pod-name> --image=busybox --target=<container-name>
Debugging with Ephemeral Containers
Distroless and scratch-based images have no shell, making traditional exec impossible. Ephemeral debug containers solve this without modifying the pod spec:
# Attach a debug container to a running pod kubectl debug -it <pod-name> -n <namespace> --image=nicolaka/netshoot --target=<container-name> # nicolaka/netshoot includes: curl, dig, nslookup, tcpdump, # netstat, ping, traceroute, ss, iperf3, and more
The debug container shares the target container's network namespace, so you can inspect its open ports, run DNS lookups, and trace network connections without any changes to the application image.
# Copy a running pod to a new pod with debug tools added kubectl debug <pod-name> -it --copy-to=debug-pod --image=busybox --share-processes
Diagnosing Resource Exhaustion
Resource issues are among the most common — and most misdiagnosed — Kubernetes failures. A pod showing Pending may simply have requests that no node can satisfy:
# Visualize allocatable vs. requested per node kubectl describe nodes | grep -A10 "Allocated resources" # Top consumers right now kubectl top pods -A --sort-by=memory kubectl top nodes # Check namespace-level resource quotas kubectl describe resourcequota -n <namespace>
If a node shows MemoryPressure=True, the kubelet will start evicting pods — starting with those that exceed their memory requests. Always set both requests and limits on critical workloads.
# Find pods with no resource requests set (eviction risk) kubectl get pods -A -o json | jq -r ' .items[] | select(.spec.containers[].resources.requests == null) | [.metadata.namespace, .metadata.name] | @tsv'
Network Debugging
Services not routing traffic is a classic Kubernetes head-scratcher. The most common cause is a label selector mismatch:
# Check if any pods are selected
kubectl get endpoints <service-name> -n <namespace>
# Empty "Endpoints: <none>" means the selector matches nothing
# Compare service selector vs pod labels
kubectl get svc <service-name> -o jsonpath='{.spec.selector}'
kubectl get pods -n <namespace> --show-labels
# Test connectivity from inside the cluster
kubectl exec -it <any-pod> -n <namespace> -- curl -v http://<service>.<namespace>.svc.cluster.local:<port>/healthIf endpoints look correct but traffic still fails, check NetworkPolicy objects — a blanket deny-all policy will silently drop packets with no error message on either side.
# List all NetworkPolicies in a namespace kubectl get networkpolicy -n <namespace> kubectl describe networkpolicy <name> -n <namespace> # Check if a specific port is blocked kubectl exec -it <pod> -- nc -zv <target-ip> <port>
DNS Debugging
DNS failures inside a cluster often look like connection timeouts or connection refused errors, making them hard to distinguish from routing failures. Always test DNS explicitly first:
# Run a one-shot DNS test pod kubectl run dnstest --image=busybox:1.28 --rm -it --restart=Never -- nslookup kubernetes.default # Test a specific service DNS name kubectl run dnstest --image=busybox:1.28 --rm -it --restart=Never -- nslookup <service>.<namespace>.svc.cluster.local # Check CoreDNS is running kubectl get pods -n kube-system -l k8s-app=kube-dns # Read CoreDNS logs kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50
Common DNS failure causes: CoreDNS pods in CrashLoopBackOff, the pod's dnsPolicy set to None without a dnsConfig, or a NetworkPolicy blocking UDP port 53 from pods to CoreDNS.
# Verify DNS config inside a pod kubectl exec -it <pod-name> -- cat /etc/resolv.conf # Should show: nameserver 10.96.0.10 (or cluster DNS IP) # and: search <namespace>.svc.cluster.local svc.cluster.local cluster.local
Storage and PersistentVolume Debugging
Storage issues manifest as pods stuck in Pending (PVC not bound) or as FailedMount events (volume exists but can't be attached):
# Check PVC status kubectl get pvc -n <namespace> kubectl describe pvc <pvc-name> -n <namespace> # Find PVs and their binding status kubectl get pv kubectl describe pv <pv-name> # Check storage class provisioner kubectl get storageclass kubectl describe storageclass <name>
If a PVC is stuck in Pending with no events, the provisioner is likely not running. Check for the provisioner's pod in the relevant namespace. For ReadWriteMany access mode, not all storage drivers support it — check the driver documentation.
# PVC stuck in Terminating? Force remove the finalizer
kubectl patch pvc <name> -n <namespace> -p '{"metadata":{"finalizers":null}}'Debugging Multi-Container Pods
Pods with multiple containers (sidecars, init containers) require specifying the container name in most kubectl commands:
# List containers in a pod
kubectl get pod <pod-name> -o jsonpath='{.spec.containers[*].name}'
# Logs from a specific container
kubectl logs <pod-name> -c <container-name> --previous
# Exec into a specific container
kubectl exec -it <pod-name> -c <sidecar-name> -- /bin/sh
# Check init container status
kubectl describe pod <pod-name> | grep -A20 "Init Containers:"
# Init container logs
kubectl logs <pod-name> -c <init-container-name>Init:CrashLoopBackOff or Init:Error. Always check init container logs first when the main app never appears to start.Node-Level Debugging
When a node goes NotReady or pods on a specific node all fail simultaneously, investigate at the node level:
# Check node conditions in detail kubectl describe node <node-name> # SSH to the node (or use node-shell plugin) kubectl node-shell <node-name> # On the node — check kubelet systemctl status kubelet journalctl -u kubelet -n 100 --no-pager # Check container runtime systemctl status containerd crictl ps # list running containers # Disk usage df -h du -sh /var/log/pods/* | sort -hr | head -20 # Memory free -m cat /proc/meminfo
Disk pressure is particularly sneaky — it fills up with old pod logs and evicted image layers. Configure log rotation and set imagePullPolicy: Always only where needed.
Debugging StatefulSets
StatefulSets have ordered, sequential pod creation — a stuck pod at ordinal N blocks all pods from ordinal N+1 onward:
# Check StatefulSet status kubectl describe statefulset <name> -n <namespace> # Individual pod status (ordered 0, 1, 2...) kubectl get pods -n <namespace> -l app=<label> --sort-by='.metadata.name' # PVC per pod (each pod gets its own) kubectl get pvc -n <namespace> -l app=<label> # If a pod is stuck, check its specific PVC kubectl describe pvc <pvc-name-for-pod-0>
Reading the Audit Trail
For intermittent failures or issues that resolved themselves, cluster events expire after one hour by default. Persistent event storage (via tools like kube-state-metrics + Prometheus) or a dedicated events backend like Loki is essential for post-incident analysis. When investigating past incidents, check HPA scale events and PodDisruptionBudget violations — they rarely show up in standard pod logs but are captured in events.
# Events sorted by time, including older ones kubectl get events -n <namespace> --sort-by='.lastTimestamp' --field-selector='type=Warning' # Watch events in real time during a deploy kubectl get events -n <namespace> -w
kubectl describe pod > incident.txt before the pod is rescheduled and its history disappears. Events are gone after 1 hour — export them to persistent storage if you need them for a postmortem.Debugging Stuck Rollouts
A deployment rollout that never completes blocks future deployments. The most common causes are readiness probe failures on the new version or insufficient cluster capacity:
# Check rollout status kubectl rollout status deployment/<name> -n <namespace> # See what the new pods are doing kubectl get pods -n <namespace> | grep -v Running kubectl describe pod <new-pod-name> # Check rollout history kubectl rollout history deployment/<name> # Emergency rollback kubectl rollout undo deployment/<name> -n <namespace> # Pause a rollout to investigate kubectl rollout pause deployment/<name> kubectl rollout resume deployment/<name>
When Manual Debugging Takes Too Long
Running through these steps manually under production pressure is slow and error-prone. KubeIntellect automates the entire sequence — it collects describe output, logs, metrics, and events, correlates them across the timeline, and surfaces the root cause with a one-line fix in under two seconds. Ask it in plain English: "Why is my api pod crashing in production?"