Lab 11: Troubleshooting
Objectives
- ✅ Debug pod failures (CrashLoopBackOff, Pending, etc.)
- ✅ Inspect events and logs
- ✅ Network debugging (DNS, connectivity)
- ✅ Performance troubleshooting
- ✅ Use kubectl tools effectively
Prerequisites
- All previous labs complete
- ~2 hours
Scenario 1: Pod in Pending State
Create broken deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: broken-pending
spec:
replicas: 1
selector:
matchLabels:
app: broken
template:
metadata:
labels:
app: broken
spec:
containers:
- name: app
image: nonexistent/image:v999
resources:
requests:
memory: "50Gi" # Impossible to allocate
cpu: "100"
Deploy and diagnose:
kubectl apply -f broken-pending.yaml
# Check pod status
kubectl get pods -l app=broken
# STATUS: Pending
# Debug steps
kubectl describe pod <pod-name>
# Look for: Events section
# Output will show:
# "Insufficient memory" or "Unschedulable"
# Check node resources
kubectl describe nodes
# See available memory/CPU
# Check events cluster-wide
kubectl get events --all-namespaces
# Fix: Reduce resource requests
kubectl patch deployment broken-pending -p '{"spec":{"template":{"spec":{"containers":[{"name":"app","resources":{"requests":{"memory":"256Mi","cpu":"100m"}}}]}}}}'
# Pod should start
watch kubectl get pods -l app=broken
Scenario 2: CrashLoopBackOff
Create app that crashes:
apiVersion: v1
kind: Pod
metadata:
name: crashloop-app
spec:
containers:
- name: app
image: busybox
command:
- /bin/sh
- -c
- |
echo "App starting"
exit 1 # Immediate exit = crash
restartPolicy: Always
Deploy and debug:
kubectl apply -f crashloop-app.yaml
# Check status
kubectl get pods crashloop-app -w
# STATUS: CrashLoopBackOff
# View logs
kubectl logs crashloop-app
# Shows: "App starting" then exit
# View all logs (including old restarts)
kubectl logs crashloop-app --previous
# Check restart count
kubectl describe pod crashloop-app
# Shows: Restart Count: 5, Last State: Terminated (exit code 1)
# Fix: Make app not crash
kubectl delete pod crashloop-app
# Edit and redeploy with proper command
Scenario 3: Image Pull Failure
apiVersion: v1
kind: Pod
metadata:
name: imagepull-failed
spec:
containers:
- name: app
image: private.registry.com/secret:v1
# No credentials provided
Debug:
kubectl apply -f imagepull-failed.yaml
# Check status
kubectl get pods imagepull-failed
# STATUS: ImagePullBackOff
# View events
kubectl describe pod imagepull-failed
# Shows: Failed to pull image "private.registry.com/..."
# Error: image pull secrets not configured
# Fix: Create imagePullSecret
kubectl create secret docker-registry regcred \
--docker-server=private.registry.com \
--docker-username=user \
--docker-password=pass
# Add to pod
kubectl patch serviceaccount default \
-p '{"imagePullSecrets":[{"name":"regcred"}]}'
Scenario 4: Service Not Routing Traffic
apiVersion: v1
kind: Service
metadata:
name: broken-service
spec:
ports:
- port: 80
targetPort: 5000
selector:
app: api-wrong-label # Doesn't match pod
---
apiVersion: v1
kind: Pod
metadata:
name: api-pod
labels:
app: api-correct-label # Different label
spec:
containers:
- name: app
image: YOUR_USERNAME/myapp:1.0.0
ports:
- containerPort: 5000
Debug:
kubectl apply -f broken-service.yaml
# Check service
kubectl get svc broken-service
# Check endpoints (should be empty)
kubectl get endpoints broken-service
# ENDPOINTS: <none> ← Problem!
# Diagnose: Check service selector
kubectl describe svc broken-service
# Selectors: app=api-wrong-label
# Check pod labels
kubectl get pods --show-labels
# api-pod has label: app=api-correct-label
# They don't match! Fix the service selector
kubectl patch svc broken-service \
-p '{"spec":{"selector":{"app":"api-correct-label"}}}'
# Endpoints should now populate
kubectl get endpoints broken-service
# ENDPOINTS: 10.x.x.x:5000
Scenario 5: Network Connectivity Issues
Test connectivity:
# Deploy two pods
kubectl run pod1 --image=curlimages/curl --rm -it -- /bin/sh
# Inside: curl http://pod2-ip:5000
# Check if DNS works
kubectl run test --image=curlimages/curl --rm -it -- \
curl http://broken-service
# If fails: DNS issue
# Debug DNS
kubectl run -it --image=busybox --restart=Never debug -- nslookup broken-service
# If fails: CoreDNS issue
# Check network policies
kubectl get networkpolicy
# May be blocking traffic
# Test without network policy
kubectl delete networkpolicy --all
# Retest connectivity
Scenario 6: Resource Exhaustion
Simulate high CPU:
# Create pod that uses CPU
kubectl run stress --image=polinux/stress -- stress --cpu 4
# Monitor
watch kubectl top pods
# Or check metrics
curl http://prometheus:9090/api/v1/query?query=rate\(container_cpu_usage_seconds_total\[5m\]\)
# If pod OOMKilled:
kubectl describe pod stress
# Last State: Terminated (OOMKilled)
# Check limits
kubectl get pod stress -o jsonpath='{.spec.containers[0].resources}'
# Fix: Increase limits
kubectl set resources deployment stress \
--limits=cpu=2,memory=1Gi
Scenario 7: Configuration Errors
ConfigMap/Secret not found:
apiVersion: v1
kind: Pod
metadata:
name: config-error
spec:
containers:
- name: app
image: YOUR_USERNAME/myapp:1.0.0
env:
- name: PASSWORD
valueFrom:
secretKeyRef:
name: db-secret # Doesn't exist
key: password
Debug:
kubectl apply -f config-error.yaml
# Pod may fail to start
kubectl describe pod config-error
# Events: secret "db-secret" not found
# Create missing secret
kubectl create secret generic db-secret \
--from-literal=password=mysecret
# Restart pod
kubectl delete pod config-error
kubectl apply -f config-error.yaml
Scenario 8: Using Tools Effectively
Key debugging commands:
# Comprehensive pod info
kubectl describe pod <name>
# Live logs
kubectl logs -f <pod>
# Previous logs (if crashed)
kubectl logs <pod> --previous
# All events
kubectl get events --sort-by='.lastTimestamp'
# Pod resource usage
kubectl top pod
# Node resource usage
kubectl top nodes
# Pod YAML
kubectl get pod <name> -o yaml
# Detailed pod state
kubectl get pod <name> -o wide
# Execute command in pod
kubectl exec -it <pod> -- /bin/sh
# Port-forward for debugging
kubectl port-forward <pod> 8000:5000
# Watch pods
watch kubectl get pods
# Real-time events
kubectl get events -w
Validation
Successfully debug all scenarios:
- Fix Pending pod (reduce resource requests)
- Fix CrashLoopBackOff (make app not crash)
- Fix ImagePull failure (add image pull secret)
- Fix Service routing (fix selector labels)
- Fix Network connectivity (check DNS, remove network policies)
- Fix Resource exhaustion (increase limits)
- Fix Configuration errors (create missing secrets)
Challenge (Optional)
Create a multi-failure scenario:
# Deployment with multiple issues:
# - Wrong image tag
# - Missing ConfigMap
# - Resource limits too low
# - Wrong service selector
Task: Identify and fix all issues using kubectl tools.
Cleanup
kubectl delete pod/crashloop-app
kubectl delete pod/imagepull-failed
kubectl delete pod/config-error
kubectl delete deployment/broken-pending
kubectl delete deployment/stress
kubectl delete svc/broken-service
Congratulations! You've completed all 11 labs and mastered DevOps from Docker to troubleshooting! 🎉
Next Steps: - Review Interview Prep - Practice with Interview Questions - Build your own multi-cluster deployment - Teach these concepts to others