K8s Troubleshooting Playbook: Step-by-Step Fixes

This playbook covers the failure scenarios that account for the vast majority of Kubernetes production incidents. Each section follows the same structure: symptom → diagnostic commands → common causes → fix. Bookmark it for your next on-call shift.

Playbook 1: Pod Stuck in Pending

kubectl describe pod <pod-name> -n <namespace>
# Read the Events section

Insufficient cpu / memory: No node has enough capacity. Scale the node group or reduce resources.requests.
```
kubectl describe nodes | grep -A8 "Allocated resources"
kubectl top nodes
```
Unbound PVC: The PersistentVolumeClaim is not provisioned.
```
kubectl get pvc -n <namespace>
kubectl describe pvc <pvc-name>
```
Check that the StorageClass exists and the provisioner is healthy.

Taint/toleration mismatch: Nodes have taints the pod doesn't tolerate.

kubectl describe nodes | grep Taints
kubectl get pod <pod> -o jsonpath='{.spec.tolerations}'

Node affinity rules too restrictive: The pod requires labels no node has.

kubectl get pod <pod> -o jsonpath='{.spec.affinity}' | jq .
kubectl get nodes --show-labels

Playbook 2: ImagePullBackOff

kubectl describe pod <pod-name> | grep -A5 "Events"

Wrong tag: Verify the image exists.

# For Docker Hub
docker manifest inspect <image>:<tag>

# For Azure Container Registry
az acr repository show-tags --name <registry> --repository <repo>

Private registry — missing secret:

# Create registry secret
kubectl create secret docker-registry regcred   --docker-server=<registry>   --docker-username=<user>   --docker-password=<token>   -n <namespace>

# Reference in pod spec
spec:
  imagePullSecrets:
  - name: regcred

Network / firewall: The node cannot reach the registry. Test from the node with curl -I https://<registry>/v2/.
Rate limiting: Docker Hub has pull rate limits. Use authenticated pulls or a registry mirror.

Playbook 3: Service Not Routing Traffic

# Step 1: Check endpoints
kubectl get endpoints <service> -n <namespace>
# "Endpoints: <none>" = selector matches nothing

# Step 2: Compare selector vs pod labels
kubectl get svc <service> -o jsonpath='{.spec.selector}'
kubectl get pods -n <namespace> --show-labels

# Step 3: Exec and curl from inside cluster
kubectl exec -it <any-pod> -n <namespace> --   curl -v http://<service>.<namespace>.svc.cluster.local:<port>/health

# Step 4: Check NetworkPolicy
kubectl get networkpolicy -n <namespace>
kubectl describe networkpolicy <name>

DNS failures look identical to routing failures. Distinguish them by curling the ClusterIP directly instead of the DNS name:kubectl get svc <name> -o jsonpath='{.spec.clusterIP}'

Playbook 4: Node NotReady

kubectl describe node <node-name>
# Check Conditions and Events sections

kubelet not running:

# SSH to node
systemctl status kubelet
journalctl -u kubelet -n 50

Disk pressure:

df -h
# Find largest directories
du -sh /var/log/pods/* | sort -hr | head -20
# Clean up old images
crictl rmi --prune

Memory pressure:

free -m
# Check for memory hogs
ps aux --sort=-%mem | head -15

Container runtime down:

systemctl status containerd
systemctl restart containerd
# Verify
crictl ps

Playbook 5: RBAC Permission Denied

# Test what a service account can do
kubectl auth can-i list pods   --as=system:serviceaccount:<namespace>:<sa-name>

# List role bindings in namespace
kubectl get rolebindings,clusterrolebindings -A   | grep <sa-name>

# Inspect a role's rules
kubectl describe clusterrole <role-name>

# Find what roles a user has
kubectl get clusterrolebindings -o json |   jq -r '.items[] | select(.subjects[]?.name == "<username>") | .metadata.name'

For a pod getting 403 Forbidden when calling the Kubernetes API, check the pod's serviceAccountName and ensure the bound Role has the correct verbs for the resource group the app is trying to access.

Playbook 6: Deployment Rollout Stuck

kubectl rollout status deployment/<name> -n <namespace>

# Check why new pods won't start
kubectl get pods -n <namespace> | grep -v Running
kubectl describe pod <new-pod>

# Check if PodDisruptionBudget is blocking eviction of old pods
kubectl get pdb -n <namespace>
kubectl describe pdb <name>

# Roll back if needed
kubectl rollout undo deployment/<name> -n <namespace>

# Check rollout history
kubectl rollout history deployment/<name>

Stuck rollouts are almost always caused by: new pods failing readiness checks (bad health endpoint in new image), maxUnavailable: 0 combined with insufficient capacity to schedule new pods, or a PodDisruptionBudget blocking the eviction of old pods.

Playbook 7: HPA Not Scaling

kubectl describe hpa <name> -n <namespace>
# Check "Conditions" and "Events"

# Metrics server running?
kubectl top pods -n <namespace>
kubectl get --raw "/apis/metrics.k8s.io/v1beta1/nodes"

# What metric is the HPA targeting?
kubectl get hpa <name> -o yaml | grep -A10 "metrics:"

Common causes: metrics-server not installed, custom metrics adapter not working,minReplicas already at maxReplicas, or target metric name doesn't match the one the app exports. For CPU-based HPA, ensure the pod has CPU requests set — HPA uses requests as the denominator.

Playbook 8: DNS Resolution Failures

# Test DNS from inside a pod
kubectl run dnstest --image=busybox:1.28 --rm -it   --restart=Never -- nslookup kubernetes.default

# Test a specific service
kubectl run dnstest --image=busybox:1.28 --rm -it   --restart=Never -- nslookup <service>.<namespace>.svc.cluster.local

# Check CoreDNS pods
kubectl get pods -n kube-system -l k8s-app=kube-dns

# Read CoreDNS logs for errors
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50

# Check resolv.conf inside a failing pod
kubectl exec -it <pod> -- cat /etc/resolv.conf

Common causes: CoreDNS pods in CrashLoopBackOff, a NetworkPolicy blocking UDP port 53 from pods to CoreDNS (10.96.0.10 typically), or dnsPolicy: None set without a valid dnsConfig.

Playbook 9: PersistentVolume Issues

# PVC stuck in Pending
kubectl describe pvc <name> -n <namespace>
# Events tell you: no PV matches, provisioner not running, etc.

# Check available PVs
kubectl get pv
kubectl describe pv <name>  # check claim, status, access modes

# StorageClass provisioner healthy?
kubectl get storageclass
kubectl get pods -n kube-system | grep -i provisioner

# PVC stuck in Terminating (finalizer issue)
kubectl patch pvc <name> -n <namespace>   -p '{"metadata":{"finalizers":null}}'

# PV stuck in Released (not reclaimed)
kubectl patch pv <name> -p '{"spec":{"claimRef":null}}'

Playbook 10: Ingress Not Routing

# Check Ingress controller is running
kubectl get pods -n ingress-nginx   # or ingress-system

# Inspect the Ingress resource
kubectl describe ingress <name> -n <namespace>

# Check Ingress controller logs
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx --tail=50

# Verify backend service and endpoints
kubectl get svc <backend-service> -n <namespace>
kubectl get endpoints <backend-service> -n <namespace>

# Test directly bypassing DNS
curl -H "Host: yourdomain.com" http://<ingress-controller-ip>/path

Common causes: Ingress class annotation mismatch (check kubernetes.io/ingress.class or ingressClassName), path regex not matching actual URL, TLS secret missing, or backend service port name/number mismatch.

Playbook 11: Resource Quota Exceeded

# Check namespace quota usage
kubectl describe resourcequota -n <namespace>
# Shows: used vs hard limits for CPU, memory, pod count, etc.

# Find top CPU consumers
kubectl top pods -n <namespace> --sort-by=cpu

# Find pods without resource requests (no quota enforcement)
kubectl get pods -n <namespace> -o json | jq -r '
  .items[] | select(
    .spec.containers[].resources.requests == null
  ) | .metadata.name'

# Temporarily raise quota (then fix root cause)
kubectl patch resourcequota <name> -n <namespace>   --type=merge -p '{"spec":{"hard":{"requests.memory":"16Gi"}}}'

Playbook 12: CronJob Not Running

# Check CronJob status and last schedule
kubectl describe cronjob <name> -n <namespace>

# List Jobs created by the CronJob
kubectl get jobs -n <namespace> -l app=<name>

# Check if a Job is still running (blocking next run)
kubectl get pods -n <namespace> -l job-name=<job>

# Check for failed Jobs
kubectl get jobs -n <namespace> --field-selector=status.failed=1

# Look at the failed pod logs
kubectl logs -n <namespace> -l job-name=<job>

Common causes: schedule expression wrong (validate at crontab.guru), concurrencyPolicy: Forbid blocking new runs while old job runs,startingDeadlineSeconds too short causing missed schedules, or the CronJob is suspended (spec.suspend: true).

Playbook 13: Certificate / TLS Issues

# Check cert-manager certificates
kubectl get certificates -A
kubectl describe certificate <name> -n <namespace>

# Manual TLS cert expiry check
kubectl get secret <tls-secret> -n <namespace> -o jsonpath='{.data.tls.crt}' |   base64 -d | openssl x509 -noout -dates

# Check if webhook cert is expired (breaks kubectl apply)
kubectl get validatingwebhookconfigurations
kubectl describe validatingwebhookconfiguration <name>

# Rotate a cert-manager certificate manually
kubectl delete secret <tls-secret> -n <namespace>
# cert-manager will re-issue automatically

Expired admission webhook certificates are particularly impactful — they break kubectl apply across the entire cluster. If kubectl apply times out with a webhook error, check webhook certificate expiry first.

Playbook 14: ConfigMap / Secret Drift

# Compare what the pod sees vs what the ConfigMap has
kubectl exec -it <pod> -- env | grep <KEY>
kubectl get configmap <name> -o jsonpath='{.data.<key>}'

# When did the ConfigMap last change?
kubectl describe configmap <name> | grep "last-applied|creationTimestamp"

# Pods don't pick up ConfigMap changes automatically (for env vars)
# You must trigger a rollout after updating a ConfigMap
kubectl rollout restart deployment/<name> -n <namespace>

# For volume-mounted ConfigMaps, changes propagate within ~60s
# Force immediate update:
kubectl rollout restart deployment/<name>

Quick Reference: Status → Playbook

Pending → Playbook 1
ImagePullBackOff / ErrImagePull → Playbook 2
Service unreachable → Playbook 3
Node NotReady → Playbook 4
403 Forbidden in pod logs → Playbook 5
Deployment stuck progressing → Playbook 6
HPA not reacting → Playbook 7
DNS nslookup fails → Playbook 8
PVC stuck Pending or Terminating → Playbook 9
Ingress returning 404/502 → Playbook 10
FailedCreate — quota exceeded → Playbook 11
CronJob never fires → Playbook 12
TLS handshake failures → Playbook 13
App missing env var after ConfigMap change → Playbook 14
CrashLoopBackOff → CrashLoopBackOff guide