How to Fix CrashLoopBackOff in Kubernetes

CrashLoopBackOff is Kubernetes telling you: "the container crashed, I restarted it, it crashed again, and I'm now slowing down retries to avoid thrashing the node." Kubernetes itself is fine — your container is the problem. Here's how to find and fix it systematically.

Step 1: Check the Restart Count and State

kubectl get pod <pod-name> -n <namespace>

# NAME                     READY  STATUS             RESTARTS   AGE
# api-7d8f9c-xk2p9         0/1    CrashLoopBackOff   14         22m

A restart count of 14 in 22 minutes tells you the crash is fast and consistent, not intermittent. That points to a startup failure rather than a runtime bug. Low restart counts (1–3) with longer intervals suggest a runtime fault that only triggers under certain conditions.

Step 2: Read the Previous Container's Logs

kubectl logs <pod-name> -n <namespace> --previous

# If the pod has multiple containers:
kubectl logs <pod-name> -n <namespace> --previous -c <container-name>

# Limit output to the final lines
kubectl logs <pod-name> -n <namespace> --previous --tail=100

This is the most important command. The container is dead, so you need --previous. Common patterns to look for:

Configuration error: Error: config file not found, environment variable DB_HOST is required
Port conflict: listen tcp :8080: bind: address already in use
Missing dependency: dial tcp: connection refused on startup
Panic / fatal error: stack trace in Go, Python traceback, Java exception
Permission denied: can't write to mounted volume or read a file

If logs are empty, the container is exiting before writing anything — look at the exit code instead.

Step 3: Read the Exit Code

kubectl describe pod <pod-name> -n <namespace>

# Look for:
# Last State:     Terminated
#   Reason:       OOMKilled   (or Error)
#   Exit Code:    137

Common exit codes and their meaning:

0 — container exited cleanly (but shouldn't have — check your CMD)
1 — general application error
2 — shell built-in misuse or invalid argument
126 — command cannot be executed (permission denied)
127 — command not found (wrong entrypoint)
137 — OOMKilled (SIGKILL from kernel / cgroup)
139 — segmentation fault
143 — SIGTERM not handled (graceful shutdown timeout)

Fixing OOMKilled (Exit 137)

The container exceeded its resources.limits.memory. Find the peak usage:

# Current usage
kubectl top pod <pod-name> -n <namespace> --containers

# Historical peak (requires Prometheus)
container_memory_working_set_bytes{namespace="<ns>",pod=~"<name>.*"}

# Check the current limit
kubectl get pod <pod-name> -o jsonpath=  '{.spec.containers[0].resources.limits.memory}'

Then raise the limit in your Deployment:

# deployment.yaml
containers:
- name: api
  resources:
    requests:
      memory: "256Mi"
    limits:
      memory: "1Gi"   # was 512Mi

Always set requests lower than limits for memory. A request equal to the limit means the pod gets a Guaranteed QoS class and is less likely to be evicted under node pressure, but also means a single spike kills the container.

Fixing Bad Configuration / Missing Secrets

If the app exits because it can't find a required environment variable or config file:

# Verify all referenced secrets exist
kubectl get secret <secret-name> -n <namespace>

# Check what env vars are actually injected
kubectl exec <pod-name> -- env | sort

# For configmap mounts, confirm the key exists
kubectl get configmap <name> -o yaml

# Compare what the pod expects vs what exists
kubectl get pod <pod-name> -o jsonpath='{.spec.containers[0].env}' | jq .

A common trap: the secret exists but the key name is wrong. Your pod spec references DB_PASS but the secret has db_password — the container starts with an empty env var and crashes immediately.

# Check exact keys in a secret (base64 decoded)
kubectl get secret <name> -n <namespace> -o jsonpath='{.data}' |   jq 'to_entries[] | {key: .key, value: (.value | @base64d)}'

Fixing Liveness Probe Misconfiguration

A liveness probe that fires too early is a classic trap. If your app takes 20 seconds to initialize but the probe starts at 10 seconds, Kubernetes kills the container before it's ready:

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 30   # was 10 — give the app time to boot
  periodSeconds: 10
  failureThreshold: 3

Distinguish between liveness and readiness probes:

Liveness probe failure → container is killed and restarted (causes CrashLoopBackOff)
Readiness probe failure → container stays running but is removed from Service endpoints (no restart)

For slow-starting applications, use a Startup probe instead of a long initialDelaySeconds:

startupProbe:
  httpGet:
    path: /healthz
    port: 8080
  failureThreshold: 30   # up to 30 × 10s = 5 minutes to start
  periodSeconds: 10

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  # Startup probe must pass first, so no initialDelaySeconds needed
  periodSeconds: 10
  failureThreshold: 3

Fixing Init Container Failures (Init:CrashLoopBackOff)

Init containers run sequentially before the main containers start. If one fails, the whole pod is stuck:

# Check init container status
kubectl describe pod <pod-name> | grep -A15 "Init Containers:"

# Get init container logs
kubectl logs <pod-name> -c <init-container-name>
kubectl logs <pod-name> -c <init-container-name> --previous

Common init container failure patterns:

Database not ready: init container runs psql -c "SELECT 1" but the database pod isn't ready yet. Fix: add a loop with exponential backoff, or use a proper init container that polls with until.
Migration script fails: schema migration throws an error. Check the init container logs for the specific SQL error.
Permission error on volume: init container writes a file but the main container runs as a different UID. Fix with securityContext.fsGroup.

# Init container that waits for a service to be ready
initContainers:
- name: wait-for-db
  image: busybox:1.28
  command: ['sh', '-c', 'until nc -z db-service 5432; do sleep 2; done']

Fixing Resource Quota Exhaustion

Namespace-level ResourceQuotas can cause pods to crash or never start if the namespace has consumed its allocated CPU or memory:

# Check namespace quota usage
kubectl describe resourcequota -n <namespace>

# Output shows: used vs hard limits
# requests.cpu: 3800m / 4000m   ← nearly exhausted
# requests.memory: 7680Mi / 8Gi ← over limit

# Find which pods are consuming the most
kubectl top pods -n <namespace> --sort-by=memory

A pod that exceeds namespace quota will show FailedCreate in the ReplicaSet events rather than CrashLoopBackOff — check kubectl describe replicaset if the pod never appears.

Fixing a Bad Entrypoint (Exit 0 or Exit 127)

If a container's entrypoint exits immediately (exit 0), Kubernetes will restart it indefinitely. This happens with script containers that complete their task and exit cleanly. For one-shot jobs, use a Job object instead of a Deployment. For persistent processes, make sure the command doesn't return:

# Wrong — nginx daemonizes and command exits immediately
command: ["nginx"]

# Correct — nginx runs in foreground
command: ["nginx", "-g", "daemon off;"]

# Exit 127 = command not found — wrong binary path
command: ["/usr/local/bin/myapp"]  # binary is at /app/myapp

# Verify in the image
docker run --rm <image> which myapp
docker run --rm <image> ls /usr/local/bin/

Debugging Race Conditions and Startup Ordering

Microservices that connect to each other on startup frequently crash because Service A starts before Service B is ready. Kubernetes doesn't guarantee pod startup order across deployments. Solutions:

Init containers: Poll for the dependency before the main container starts
Retry logic in the app: The application should retry failed connections with exponential backoff rather than crashing immediately
Readiness gates: Block traffic to a pod until external conditions are met

# Check if a dependency service has ready endpoints
kubectl get endpoints <dependency-service> -n <namespace>

# Watch pod readiness over time
kubectl get pods -n <namespace> -w

Still Stuck? Escalation Path

If you've worked through all the above and the crash is still unclear, the root cause is often a subtle interaction — a startup race condition, a secret that exists but has a wrong key name, or a resource quota at namespace level blocking the pod. These take time to correlate manually. KubeIntellect ingests all of this simultaneously and surfaces the root cause without the grep marathon.