This playbook covers the failure scenarios that account for the vast majority of Kubernetes production incidents. Each section follows the same structure: symptom → diagnostic commands → common causes → fix. Bookmark it for your next on-call shift.
Playbook 1: Pod Stuck in Pending
kubectl describe pod <pod-name> -n <namespace> # Read the Events section
- Insufficient cpu / memory: No node has enough capacity. Scale the node group or reduce
resources.requests.kubectl describe nodes | grep -A8 "Allocated resources" kubectl top nodes
- Unbound PVC: The PersistentVolumeClaim is not provisioned.
kubectl get pvc -n <namespace> kubectl describe pvc <pvc-name>
Check that the StorageClass exists and the provisioner is healthy. - Taint/toleration mismatch: Nodes have taints the pod doesn't tolerate.
kubectl describe nodes | grep Taints kubectl get pod <pod> -o jsonpath='{.spec.tolerations}' - Node affinity rules too restrictive: The pod requires labels no node has.
kubectl get pod <pod> -o jsonpath='{.spec.affinity}' | jq . kubectl get nodes --show-labels
Playbook 2: ImagePullBackOff
kubectl describe pod <pod-name> | grep -A5 "Events"
- Wrong tag: Verify the image exists.
# For Docker Hub docker manifest inspect <image>:<tag> # For Azure Container Registry az acr repository show-tags --name <registry> --repository <repo>
- Private registry — missing secret:
# Create registry secret kubectl create secret docker-registry regcred --docker-server=<registry> --docker-username=<user> --docker-password=<token> -n <namespace> # Reference in pod spec spec: imagePullSecrets: - name: regcred
- Network / firewall: The node cannot reach the registry. Test from the node with
curl -I https://<registry>/v2/. - Rate limiting: Docker Hub has pull rate limits. Use authenticated pulls or a registry mirror.
Playbook 3: Service Not Routing Traffic
# Step 1: Check endpoints
kubectl get endpoints <service> -n <namespace>
# "Endpoints: <none>" = selector matches nothing
# Step 2: Compare selector vs pod labels
kubectl get svc <service> -o jsonpath='{.spec.selector}'
kubectl get pods -n <namespace> --show-labels
# Step 3: Exec and curl from inside cluster
kubectl exec -it <any-pod> -n <namespace> -- curl -v http://<service>.<namespace>.svc.cluster.local:<port>/health
# Step 4: Check NetworkPolicy
kubectl get networkpolicy -n <namespace>
kubectl describe networkpolicy <name>kubectl get svc <name> -o jsonpath='{.spec.clusterIP}'Playbook 4: Node NotReady
kubectl describe node <node-name> # Check Conditions and Events sections
- kubelet not running:
# SSH to node systemctl status kubelet journalctl -u kubelet -n 50
- Disk pressure:
df -h # Find largest directories du -sh /var/log/pods/* | sort -hr | head -20 # Clean up old images crictl rmi --prune
- Memory pressure:
free -m # Check for memory hogs ps aux --sort=-%mem | head -15
- Container runtime down:
systemctl status containerd systemctl restart containerd # Verify crictl ps
Playbook 5: RBAC Permission Denied
# Test what a service account can do kubectl auth can-i list pods --as=system:serviceaccount:<namespace>:<sa-name> # List role bindings in namespace kubectl get rolebindings,clusterrolebindings -A | grep <sa-name> # Inspect a role's rules kubectl describe clusterrole <role-name> # Find what roles a user has kubectl get clusterrolebindings -o json | jq -r '.items[] | select(.subjects[]?.name == "<username>") | .metadata.name'
For a pod getting 403 Forbidden when calling the Kubernetes API, check the pod's serviceAccountName and ensure the bound Role has the correct verbs for the resource group the app is trying to access.
Playbook 6: Deployment Rollout Stuck
kubectl rollout status deployment/<name> -n <namespace> # Check why new pods won't start kubectl get pods -n <namespace> | grep -v Running kubectl describe pod <new-pod> # Check if PodDisruptionBudget is blocking eviction of old pods kubectl get pdb -n <namespace> kubectl describe pdb <name> # Roll back if needed kubectl rollout undo deployment/<name> -n <namespace> # Check rollout history kubectl rollout history deployment/<name>
Stuck rollouts are almost always caused by: new pods failing readiness checks (bad health endpoint in new image), maxUnavailable: 0 combined with insufficient capacity to schedule new pods, or a PodDisruptionBudget blocking the eviction of old pods.
Playbook 7: HPA Not Scaling
kubectl describe hpa <name> -n <namespace> # Check "Conditions" and "Events" # Metrics server running? kubectl top pods -n <namespace> kubectl get --raw "/apis/metrics.k8s.io/v1beta1/nodes" # What metric is the HPA targeting? kubectl get hpa <name> -o yaml | grep -A10 "metrics:"
Common causes: metrics-server not installed, custom metrics adapter not working,minReplicas already at maxReplicas, or target metric name doesn't match the one the app exports. For CPU-based HPA, ensure the pod has CPU requests set — HPA uses requests as the denominator.
Playbook 8: DNS Resolution Failures
# Test DNS from inside a pod kubectl run dnstest --image=busybox:1.28 --rm -it --restart=Never -- nslookup kubernetes.default # Test a specific service kubectl run dnstest --image=busybox:1.28 --rm -it --restart=Never -- nslookup <service>.<namespace>.svc.cluster.local # Check CoreDNS pods kubectl get pods -n kube-system -l k8s-app=kube-dns # Read CoreDNS logs for errors kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50 # Check resolv.conf inside a failing pod kubectl exec -it <pod> -- cat /etc/resolv.conf
Common causes: CoreDNS pods in CrashLoopBackOff, a NetworkPolicy blocking UDP port 53 from pods to CoreDNS (10.96.0.10 typically), or dnsPolicy: None set without a valid dnsConfig.
Playbook 9: PersistentVolume Issues
# PVC stuck in Pending
kubectl describe pvc <name> -n <namespace>
# Events tell you: no PV matches, provisioner not running, etc.
# Check available PVs
kubectl get pv
kubectl describe pv <name> # check claim, status, access modes
# StorageClass provisioner healthy?
kubectl get storageclass
kubectl get pods -n kube-system | grep -i provisioner
# PVC stuck in Terminating (finalizer issue)
kubectl patch pvc <name> -n <namespace> -p '{"metadata":{"finalizers":null}}'
# PV stuck in Released (not reclaimed)
kubectl patch pv <name> -p '{"spec":{"claimRef":null}}'Playbook 10: Ingress Not Routing
# Check Ingress controller is running kubectl get pods -n ingress-nginx # or ingress-system # Inspect the Ingress resource kubectl describe ingress <name> -n <namespace> # Check Ingress controller logs kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx --tail=50 # Verify backend service and endpoints kubectl get svc <backend-service> -n <namespace> kubectl get endpoints <backend-service> -n <namespace> # Test directly bypassing DNS curl -H "Host: yourdomain.com" http://<ingress-controller-ip>/path
Common causes: Ingress class annotation mismatch (check kubernetes.io/ingress.class or ingressClassName), path regex not matching actual URL, TLS secret missing, or backend service port name/number mismatch.
Playbook 11: Resource Quota Exceeded
# Check namespace quota usage
kubectl describe resourcequota -n <namespace>
# Shows: used vs hard limits for CPU, memory, pod count, etc.
# Find top CPU consumers
kubectl top pods -n <namespace> --sort-by=cpu
# Find pods without resource requests (no quota enforcement)
kubectl get pods -n <namespace> -o json | jq -r '
.items[] | select(
.spec.containers[].resources.requests == null
) | .metadata.name'
# Temporarily raise quota (then fix root cause)
kubectl patch resourcequota <name> -n <namespace> --type=merge -p '{"spec":{"hard":{"requests.memory":"16Gi"}}}'Playbook 12: CronJob Not Running
# Check CronJob status and last schedule kubectl describe cronjob <name> -n <namespace> # List Jobs created by the CronJob kubectl get jobs -n <namespace> -l app=<name> # Check if a Job is still running (blocking next run) kubectl get pods -n <namespace> -l job-name=<job> # Check for failed Jobs kubectl get jobs -n <namespace> --field-selector=status.failed=1 # Look at the failed pod logs kubectl logs -n <namespace> -l job-name=<job>
Common causes: schedule expression wrong (validate at crontab.guru), concurrencyPolicy: Forbid blocking new runs while old job runs,startingDeadlineSeconds too short causing missed schedules, or the CronJob is suspended (spec.suspend: true).
Playbook 13: Certificate / TLS Issues
# Check cert-manager certificates
kubectl get certificates -A
kubectl describe certificate <name> -n <namespace>
# Manual TLS cert expiry check
kubectl get secret <tls-secret> -n <namespace> -o jsonpath='{.data.tls.crt}' | base64 -d | openssl x509 -noout -dates
# Check if webhook cert is expired (breaks kubectl apply)
kubectl get validatingwebhookconfigurations
kubectl describe validatingwebhookconfiguration <name>
# Rotate a cert-manager certificate manually
kubectl delete secret <tls-secret> -n <namespace>
# cert-manager will re-issue automaticallyExpired admission webhook certificates are particularly impactful — they break kubectl apply across the entire cluster. If kubectl apply times out with a webhook error, check webhook certificate expiry first.
Playbook 14: ConfigMap / Secret Drift
# Compare what the pod sees vs what the ConfigMap has
kubectl exec -it <pod> -- env | grep <KEY>
kubectl get configmap <name> -o jsonpath='{.data.<key>}'
# When did the ConfigMap last change?
kubectl describe configmap <name> | grep "last-applied|creationTimestamp"
# Pods don't pick up ConfigMap changes automatically (for env vars)
# You must trigger a rollout after updating a ConfigMap
kubectl rollout restart deployment/<name> -n <namespace>
# For volume-mounted ConfigMaps, changes propagate within ~60s
# Force immediate update:
kubectl rollout restart deployment/<name>Quick Reference: Status → Playbook
Pending→ Playbook 1ImagePullBackOff/ErrImagePull→ Playbook 2- Service unreachable → Playbook 3
- Node
NotReady→ Playbook 4 403 Forbiddenin pod logs → Playbook 5- Deployment stuck progressing → Playbook 6
- HPA not reacting → Playbook 7
- DNS
nslookupfails → Playbook 8 - PVC stuck
PendingorTerminating→ Playbook 9 - Ingress returning 404/502 → Playbook 10
FailedCreate— quota exceeded → Playbook 11- CronJob never fires → Playbook 12
- TLS handshake failures → Playbook 13
- App missing env var after ConfigMap change → Playbook 14
CrashLoopBackOff→ CrashLoopBackOff guide