This playbook covers the failure scenarios that account for the vast majority of Kubernetes production incidents. Each section follows the same structure: symptom → diagnostic commands → common causes → fix. Bookmark it for your next on-call shift.

Playbook 1: Pod Stuck in Pending

kubectl describe pod <pod-name> -n <namespace>
# Read the Events section
  • Insufficient cpu / memory: No node has enough capacity. Scale the node group or reduce resources.requests.
    kubectl describe nodes | grep -A8 "Allocated resources"
  • Unbound PVC: The PersistentVolumeClaim is not provisioned.
    kubectl get pvc -n <namespace>
    kubectl describe pvc <pvc-name>
    Check that the StorageClass exists and the provisioner is healthy.
  • Taint/toleration mismatch: Nodes have taints the pod doesn't tolerate.
    kubectl describe nodes | grep Taints
    kubectl get pod <pod> -o jsonpath='{.spec.tolerations}'

Playbook 2: ImagePullBackOff

kubectl describe pod <pod-name> | grep -A5 "Events"
  • Wrong tag: Verify the image exists.
    # For Docker Hub
    docker manifest inspect <image>:<tag>
    
    # Or just check the registry UI
  • Private registry — missing secret:
    # Create registry secret
    kubectl create secret docker-registry regcred   --docker-server=<registry>   --docker-username=<user>   --docker-password=<token>   -n <namespace>
    
    # Reference in pod spec
    spec:
      imagePullSecrets:
      - name: regcred
  • Network / firewall: The node cannot reach the registry. Test from the node with curl -I https://<registry>/v2/.

Playbook 3: Service Not Routing Traffic

# Step 1: Check endpoints
kubectl get endpoints <service> -n <namespace>
# "Endpoints: <none>" = selector matches nothing

# Step 2: Compare selector vs pod labels
kubectl get svc <service> -o jsonpath='{.spec.selector}'
kubectl get pods -n <namespace> --show-labels

# Step 3: Exec and curl from inside cluster
kubectl exec -it <any-pod> -n <namespace> --   curl -v http://<service>.<namespace>.svc.cluster.local:<port>/health

# Step 4: Check NetworkPolicy
kubectl get networkpolicy -n <namespace>
DNS failures look identical to routing failures. Distinguish them by curling the ClusterIP directly instead of the DNS name:kubectl get svc <name> -o jsonpath='{.spec.clusterIP}'

Playbook 4: Node NotReady

kubectl describe node <node-name>
# Check Conditions and Events sections
  • kubelet not running:
    # SSH to node
    systemctl status kubelet
    journalctl -u kubelet -n 50
  • Disk pressure:
    df -h
    # Find largest directories
    du -sh /var/log/pods/* | sort -hr | head -20
  • Memory pressure:
    free -m
    # Check for memory hogs
    ps aux --sort=-%mem | head -15
  • Container runtime down:
    systemctl status containerd
    systemctl restart containerd

Playbook 5: RBAC Permission Denied

# Test what a service account can do
kubectl auth can-i list pods   --as=system:serviceaccount:<namespace>:<sa-name>

# List role bindings in namespace
kubectl get rolebindings,clusterrolebindings -A   | grep <sa-name>

# Inspect a role's rules
kubectl describe clusterrole <role-name>

For a pod getting 403 Forbidden when calling the Kubernetes API, check the pod's serviceAccountName and ensure the bound Role has the correct verbs for the resource group the app is trying to access.

Playbook 6: Deployment Rollout Stuck

kubectl rollout status deployment/<name> -n <namespace>

# Check why new pods won't start
kubectl get pods -n <namespace> | grep -v Running
kubectl describe pod <new-pod>

# Roll back if needed
kubectl rollout undo deployment/<name> -n <namespace>

# Check rollout history
kubectl rollout history deployment/<name>

Stuck rollouts are almost always caused by: new pods failing readiness checks (bad health endpoint in new image), maxUnavailable: 0 combined with insufficient capacity to schedule new pods, or a PodDisruptionBudget blocking the eviction of old pods.

Playbook 7: HPA Not Scaling

kubectl describe hpa <name> -n <namespace>
# Check "Conditions" and "Events"

# Metrics server running?
kubectl top pods -n <namespace>
kubectl get --raw "/apis/metrics.k8s.io/v1beta1/nodes"

Common causes: metrics-server not installed, custom metrics adapter not working,minReplicas already at max, or target metric name doesn't match the one the app exports.

Quick Reference: Status → Playbook

  • Pending → Playbook 1
  • ImagePullBackOff / ErrImagePull → Playbook 2
  • Service unreachable → Playbook 3
  • NotReady node → Playbook 4
  • 403 Forbidden in pod logs → Playbook 5
  • Deployment stuck progressing → Playbook 6
  • HPA not reacting → Playbook 7
  • CrashLoopBackOffCrashLoopBackOff guide