CrashLoopBackOff is Kubernetes telling you: "the container crashed, I restarted it, it crashed again, and I'm now slowing down retries to avoid thrashing the node." Kubernetes itself is fine — your container is the problem. Here's how to find and fix it systematically.
Step 1: Check the Restart Count and State
kubectl get pod <pod-name> -n <namespace> # NAME READY STATUS RESTARTS AGE # api-7d8f9c-xk2p9 0/1 CrashLoopBackOff 14 22m
A restart count of 14 in 22 minutes tells you the crash is fast and consistent, not intermittent. That points to a startup failure rather than a runtime bug. Low restart counts (1–3) with longer intervals suggest a runtime fault that only triggers under certain conditions.
Step 2: Read the Previous Container's Logs
kubectl logs <pod-name> -n <namespace> --previous # If the pod has multiple containers: kubectl logs <pod-name> -n <namespace> --previous -c <container-name> # Limit output to the final lines kubectl logs <pod-name> -n <namespace> --previous --tail=100
This is the most important command. The container is dead, so you need --previous. Common patterns to look for:
- Configuration error:
Error: config file not found,environment variable DB_HOST is required - Port conflict:
listen tcp :8080: bind: address already in use - Missing dependency:
dial tcp: connection refusedon startup - Panic / fatal error: stack trace in Go, Python traceback, Java exception
- Permission denied: can't write to mounted volume or read a file
If logs are empty, the container is exiting before writing anything — look at the exit code instead.
Step 3: Read the Exit Code
kubectl describe pod <pod-name> -n <namespace> # Look for: # Last State: Terminated # Reason: OOMKilled (or Error) # Exit Code: 137
Common exit codes and their meaning:
0— container exited cleanly (but shouldn't have — check your CMD)1— general application error2— shell built-in misuse or invalid argument126— command cannot be executed (permission denied)127— command not found (wrong entrypoint)137— OOMKilled (SIGKILL from kernel / cgroup)139— segmentation fault143— SIGTERM not handled (graceful shutdown timeout)
Fixing OOMKilled (Exit 137)
The container exceeded its resources.limits.memory. Find the peak usage:
# Current usage
kubectl top pod <pod-name> -n <namespace> --containers
# Historical peak (requires Prometheus)
container_memory_working_set_bytes{namespace="<ns>",pod=~"<name>.*"}
# Check the current limit
kubectl get pod <pod-name> -o jsonpath= '{.spec.containers[0].resources.limits.memory}'Then raise the limit in your Deployment:
# deployment.yaml
containers:
- name: api
resources:
requests:
memory: "256Mi"
limits:
memory: "1Gi" # was 512Mirequests lower than limits for memory. A request equal to the limit means the pod gets a Guaranteed QoS class and is less likely to be evicted under node pressure, but also means a single spike kills the container.Fixing Bad Configuration / Missing Secrets
If the app exits because it can't find a required environment variable or config file:
# Verify all referenced secrets exist
kubectl get secret <secret-name> -n <namespace>
# Check what env vars are actually injected
kubectl exec <pod-name> -- env | sort
# For configmap mounts, confirm the key exists
kubectl get configmap <name> -o yaml
# Compare what the pod expects vs what exists
kubectl get pod <pod-name> -o jsonpath='{.spec.containers[0].env}' | jq .A common trap: the secret exists but the key name is wrong. Your pod spec references DB_PASS but the secret has db_password — the container starts with an empty env var and crashes immediately.
# Check exact keys in a secret (base64 decoded)
kubectl get secret <name> -n <namespace> -o jsonpath='{.data}' | jq 'to_entries[] | {key: .key, value: (.value | @base64d)}'Fixing Liveness Probe Misconfiguration
A liveness probe that fires too early is a classic trap. If your app takes 20 seconds to initialize but the probe starts at 10 seconds, Kubernetes kills the container before it's ready:
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 30 # was 10 — give the app time to boot
periodSeconds: 10
failureThreshold: 3Distinguish between liveness and readiness probes:
- Liveness probe failure → container is killed and restarted (causes CrashLoopBackOff)
- Readiness probe failure → container stays running but is removed from Service endpoints (no restart)
For slow-starting applications, use a Startup probe instead of a long initialDelaySeconds:
startupProbe:
httpGet:
path: /healthz
port: 8080
failureThreshold: 30 # up to 30 × 10s = 5 minutes to start
periodSeconds: 10
livenessProbe:
httpGet:
path: /healthz
port: 8080
# Startup probe must pass first, so no initialDelaySeconds needed
periodSeconds: 10
failureThreshold: 3Fixing Init Container Failures (Init:CrashLoopBackOff)
Init containers run sequentially before the main containers start. If one fails, the whole pod is stuck:
# Check init container status kubectl describe pod <pod-name> | grep -A15 "Init Containers:" # Get init container logs kubectl logs <pod-name> -c <init-container-name> kubectl logs <pod-name> -c <init-container-name> --previous
Common init container failure patterns:
- Database not ready: init container runs
psql -c "SELECT 1"but the database pod isn't ready yet. Fix: add a loop with exponential backoff, or use a proper init container that polls withuntil. - Migration script fails: schema migration throws an error. Check the init container logs for the specific SQL error.
- Permission error on volume: init container writes a file but the main container runs as a different UID. Fix with
securityContext.fsGroup.
# Init container that waits for a service to be ready initContainers: - name: wait-for-db image: busybox:1.28 command: ['sh', '-c', 'until nc -z db-service 5432; do sleep 2; done']
Fixing Resource Quota Exhaustion
Namespace-level ResourceQuotas can cause pods to crash or never start if the namespace has consumed its allocated CPU or memory:
# Check namespace quota usage kubectl describe resourcequota -n <namespace> # Output shows: used vs hard limits # requests.cpu: 3800m / 4000m ← nearly exhausted # requests.memory: 7680Mi / 8Gi ← over limit # Find which pods are consuming the most kubectl top pods -n <namespace> --sort-by=memory
A pod that exceeds namespace quota will show FailedCreate in the ReplicaSet events rather than CrashLoopBackOff — check kubectl describe replicaset if the pod never appears.
Fixing a Bad Entrypoint (Exit 0 or Exit 127)
If a container's entrypoint exits immediately (exit 0), Kubernetes will restart it indefinitely. This happens with script containers that complete their task and exit cleanly. For one-shot jobs, use a Job object instead of a Deployment. For persistent processes, make sure the command doesn't return:
# Wrong — nginx daemonizes and command exits immediately command: ["nginx"] # Correct — nginx runs in foreground command: ["nginx", "-g", "daemon off;"] # Exit 127 = command not found — wrong binary path command: ["/usr/local/bin/myapp"] # binary is at /app/myapp # Verify in the image docker run --rm <image> which myapp docker run --rm <image> ls /usr/local/bin/
Debugging Race Conditions and Startup Ordering
Microservices that connect to each other on startup frequently crash because Service A starts before Service B is ready. Kubernetes doesn't guarantee pod startup order across deployments. Solutions:
- Init containers: Poll for the dependency before the main container starts
- Retry logic in the app: The application should retry failed connections with exponential backoff rather than crashing immediately
- Readiness gates: Block traffic to a pod until external conditions are met
# Check if a dependency service has ready endpoints kubectl get endpoints <dependency-service> -n <namespace> # Watch pod readiness over time kubectl get pods -n <namespace> -w
Still Stuck? Escalation Path
If you've worked through all the above and the crash is still unclear, the root cause is often a subtle interaction — a startup race condition, a secret that exists but has a wrong key name, or a resource quota at namespace level blocking the pod. These take time to correlate manually. KubeIntellect ingests all of this simultaneously and surfaces the root cause without the grep marathon.