This playbook covers the failure scenarios that account for the vast majority of Kubernetes production incidents. Each section follows the same structure: symptom → diagnostic commands → common causes → fix. Bookmark it for your next on-call shift.
Playbook 1: Pod Stuck in Pending
kubectl describe pod <pod-name> -n <namespace> # Read the Events section
- Insufficient cpu / memory: No node has enough capacity. Scale the node group or reduce
resources.requests.kubectl describe nodes | grep -A8 "Allocated resources"
- Unbound PVC: The PersistentVolumeClaim is not provisioned.
kubectl get pvc -n <namespace> kubectl describe pvc <pvc-name>
Check that the StorageClass exists and the provisioner is healthy. - Taint/toleration mismatch: Nodes have taints the pod doesn't tolerate.
kubectl describe nodes | grep Taints kubectl get pod <pod> -o jsonpath='{.spec.tolerations}'
Playbook 2: ImagePullBackOff
kubectl describe pod <pod-name> | grep -A5 "Events"
- Wrong tag: Verify the image exists.
# For Docker Hub docker manifest inspect <image>:<tag> # Or just check the registry UI
- Private registry — missing secret:
# Create registry secret kubectl create secret docker-registry regcred --docker-server=<registry> --docker-username=<user> --docker-password=<token> -n <namespace> # Reference in pod spec spec: imagePullSecrets: - name: regcred
- Network / firewall: The node cannot reach the registry. Test from the node with
curl -I https://<registry>/v2/.
Playbook 3: Service Not Routing Traffic
# Step 1: Check endpoints
kubectl get endpoints <service> -n <namespace>
# "Endpoints: <none>" = selector matches nothing
# Step 2: Compare selector vs pod labels
kubectl get svc <service> -o jsonpath='{.spec.selector}'
kubectl get pods -n <namespace> --show-labels
# Step 3: Exec and curl from inside cluster
kubectl exec -it <any-pod> -n <namespace> -- curl -v http://<service>.<namespace>.svc.cluster.local:<port>/health
# Step 4: Check NetworkPolicy
kubectl get networkpolicy -n <namespace>kubectl get svc <name> -o jsonpath='{.spec.clusterIP}'Playbook 4: Node NotReady
kubectl describe node <node-name> # Check Conditions and Events sections
- kubelet not running:
# SSH to node systemctl status kubelet journalctl -u kubelet -n 50
- Disk pressure:
df -h # Find largest directories du -sh /var/log/pods/* | sort -hr | head -20
- Memory pressure:
free -m # Check for memory hogs ps aux --sort=-%mem | head -15
- Container runtime down:
systemctl status containerd systemctl restart containerd
Playbook 5: RBAC Permission Denied
# Test what a service account can do kubectl auth can-i list pods --as=system:serviceaccount:<namespace>:<sa-name> # List role bindings in namespace kubectl get rolebindings,clusterrolebindings -A | grep <sa-name> # Inspect a role's rules kubectl describe clusterrole <role-name>
For a pod getting 403 Forbidden when calling the Kubernetes API, check the pod's serviceAccountName and ensure the bound Role has the correct verbs for the resource group the app is trying to access.
Playbook 6: Deployment Rollout Stuck
kubectl rollout status deployment/<name> -n <namespace> # Check why new pods won't start kubectl get pods -n <namespace> | grep -v Running kubectl describe pod <new-pod> # Roll back if needed kubectl rollout undo deployment/<name> -n <namespace> # Check rollout history kubectl rollout history deployment/<name>
Stuck rollouts are almost always caused by: new pods failing readiness checks (bad health endpoint in new image), maxUnavailable: 0 combined with insufficient capacity to schedule new pods, or a PodDisruptionBudget blocking the eviction of old pods.
Playbook 7: HPA Not Scaling
kubectl describe hpa <name> -n <namespace> # Check "Conditions" and "Events" # Metrics server running? kubectl top pods -n <namespace> kubectl get --raw "/apis/metrics.k8s.io/v1beta1/nodes"
Common causes: metrics-server not installed, custom metrics adapter not working,minReplicas already at max, or target metric name doesn't match the one the app exports.
Quick Reference: Status → Playbook
Pending→ Playbook 1ImagePullBackOff/ErrImagePull→ Playbook 2- Service unreachable → Playbook 3
NotReadynode → Playbook 4403 Forbiddenin pod logs → Playbook 5- Deployment stuck progressing → Playbook 6
- HPA not reacting → Playbook 7
CrashLoopBackOff→ CrashLoopBackOff guide