Lab 10: Observability & Monitoring

Objectives

✅ Deploy Prometheus (metrics collection)
✅ Deploy Grafana (dashboards)
✅ Deploy Loki (log aggregation)
✅ Query metrics with PromQL
✅ Create custom dashboards
✅ Set up alerting rules

Prerequisites

Lab 09 complete
Helm configured
~2 hours

Step 1: Deploy Prometheus

Add Prometheus Helm repo:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

Create prometheus-values.yaml:

prometheus:
  prometheusSpec:
    retention: 7d
    resources:
      requests:
        memory: "512Mi"
        cpu: "250m"
      limits:
        memory: "1Gi"
        cpu: "500m"

    # Scrape Kubernetes components
    serviceMonitorSelectorNilUsesHelmValues: false

grafana:
  enabled: true
  adminPassword: admin123

alertmanager:
  enabled: true

Install:

helm install prometheus prometheus-community/kube-prometheus-stack \
  -f prometheus-values.yaml \
  -n monitoring \
  --create-namespace

# Verify
kubectl get pods -n monitoring

# Port-forward to Prometheus
kubectl port-forward -n monitoring svc/prometheus-operated 9090:9090 &

# Port-forward to Grafana
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80 &

Step 2: Access Grafana

Open browser: - Grafana: http://localhost:3000 - Username: admin - Password: admin123

Add Prometheus datasource:

Go to Configuration → Data Sources
Click "Add Data Source"
Select Prometheus
URL: http://prometheus-operated:9090
Click "Save & Test"

Step 3: Create Custom Dashboard

# Check available metrics
curl http://localhost:9090/api/v1/query?query=up

# Query CPU usage
curl http://localhost:9090/api/v1/query?query=rate\(container_cpu_usage_seconds_total\[5m\]\)

In Grafana:

Create new Dashboard
Add Panel
Query: rate(container_cpu_usage_seconds_total[5m])
Format: Graph
Title: "CPU Usage"

Step 4: Deploy Loki for Logs

Add Loki repo:

helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

Install Loki:

helm install loki grafana/loki-stack \
  -n monitoring \
  --set promtail.enabled=true \
  --set prometheus.enabled=false

# Verify
kubectl get pods -n monitoring | grep loki

Add Loki datasource to Grafana:

Configuration → Data Sources
Add Data Source
Select Loki
URL: http://loki:3100

Step 5: Query Logs in Grafana

Create new dashboard panel:

Add Panel → Logs
Query: {namespace="default"}
Run query

View logs:

# Direct query to Loki
curl "http://localhost:3100/loki/api/v1/query?query=%7Bnamespace%3D%22default%22%7D"

Step 6: Create Alert Rules

Create alert-rules.yaml:

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-alerts
  namespace: monitoring
data:
  alerts.yml: |
    groups:
    - name: kubernetes.rules
      interval: 30s
      rules:
      - alert: HighCPUUsage
        expr: rate(container_cpu_usage_seconds_total[5m]) > 0.8
        for: 5m
        annotations:
          summary: "High CPU usage detected"
          description: "Pod {{ $labels.pod }} CPU > 80%"

      - alert: HighMemoryUsage
        expr: container_memory_usage_bytes / 1073741824 > 0.8
        for: 5m
        annotations:
          summary: "High memory usage"
          description: "Pod {{ $labels.pod }} memory > 800MB"

      - alert: PodRestarts
        expr: increase(kube_pod_container_status_restarts_total[1h]) > 3
        annotations:
          summary: "Pod restarting frequently"
          description: "Pod {{ $labels.pod }} restarted {{ $value }} times in 1h"

Apply:

kubectl apply -f alert-rules.yaml

# Reload Prometheus
kubectl rollout restart -n monitoring deployment/prometheus-operator

Step 7: Monitor Your Application

Deploy test app with metrics:

kubectl deploy myapp --image=YOUR_USERNAME/myapp:1.0.0 -n default

# Wait for metrics to be scraped (~1-2 min)

# Query app metrics
curl http://localhost:9090/api/v1/query?query=myapp_requests_total

Step 8: Dashboard Best Practices

Create comprehensive dashboard:

Row 1: Infrastructure
  - Node CPU Usage (graph)
  - Node Memory Usage (graph)
  - Pod Count (gauge)

Row 2: Application
  - Request Rate (graph)
  - Request Latency p95 (graph)
  - Error Rate (graph)

Row 3: System Health
  - Pod Restart Count (table)
  - Node Status (stat)
  - Disk Usage (gauge)

Steps:

Create new dashboard
Add rows
Add panels with queries from above
Customize colors, thresholds, aliases
Save dashboard

Step 9: Log Analysis

Query patterns in Loki:

# All errors
{namespace="default"} |= "ERROR"

# Logs from specific pod
{pod="myapp-*"}

# Error count over time
{namespace="default"} |= "ERROR" | rate(__error__ [5m])

Validation

# Prometheus scraping metrics
kubectl port-forward -n monitoring svc/prometheus-operated 9090:9090
curl http://localhost:9090/api/v1/query?query=up
# Should return series

# Grafana dashboards accessible
# http://localhost:3000 (login: admin/admin123)

# Alerts configured
kubectl get configmap -n monitoring
# prometheus-alerts exists

# Loki collecting logs
kubectl logs -f -n monitoring deployment/loki

Challenge (Optional)

Implement custom metrics in your app:

from prometheus_client import Counter, Histogram

request_count = Counter('myapp_requests_total', 'Total requests')
request_duration = Histogram('myapp_request_duration_seconds', 'Request duration')

@app.route('/api/data')
def data():
    request_count.inc()
    with request_duration.time():
        # Do work
        return data

@app.route('/metrics')
def metrics():
    return generate_latest()

Cleanup

helm uninstall prometheus -n monitoring
helm uninstall loki -n monitoring
kubectl delete namespace monitoring

Next: Lab 11: Troubleshooting

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search