Alerting

Chapter 28: Alerting & Incident Management

Proactive Monitoring and Response

28.1 Alert Principles

The goal of alerting is to notify the right people at the right time about issues that require action.

    Good vs Bad Alerts
    =================

    BAD Alerts (The Alert Storm):
    ──────────────────────────────

    ✗ "Disk usage > 0%" - Too sensitive
    ✗ "CPU > 1%" - Noise
    ✗ "Any 5xx" without context - Unactionable
    ✗ "User logged in" - Not an issue

    Result: On-call engineers ignore all alerts!

    ─────────────────────────────────────────

    GOOD Alerts:
    ──────────────────────────────

    ✓ "Error rate > 1% for 5 minutes" → Action: Investigate
    ✓ "P99 latency > 2s for 5 min" → Action: Scale/check DB
    ✓ "Disk > 90%" → Action: Clean up files
    ✓ "Payment failed > 10%" → Action: Page someone!
    ✓ "SLO breach imminent" → Action: Stop features

    Result: Engineers act on every alert!

The Four Principles

    Alert Design Principles
    ======================

    1. ACTIONABLE
       ───────────
       Can someone do something about it?

       ✗ "Queue has messages" (normal!)
       ✓ "Queue depth > 1000 for 10 min" (action: scale workers)

    2. RELEVANT
       ──────────
       Does it matter to the business?

       ✗ "Debug log created"
       ✓ "Order checkout failed"

    3. TIMELY
       ─────────
       Not too many, not too few

       ✗ Alert on every spike
       ✓ Alert after sustained issue (5+ minutes)

    4. CLEAR
       ───────
       Person receiving knows what to do

       ✗ "Service health degraded"
       ✓ "Checkout service error rate >5%, likely DB issue"

28.2 Alert Thresholds

Common Alert Rules

Metric	Warning Threshold	Critical Threshold	Action
Error Rate	> 0.5%	> 1%	Page on-call
P99 Latency	> 1s	> 2s	Investigate
CPU Usage	> 70%	> 90%	Scale
Memory Usage	> 75%	> 90%	Investigate
Disk Usage	> 80%	> 90%	Clean up
Queue Depth	> 500	> 1000	Scale workers
DB Connections	> 70%	> 90%	Check slow queries
5xx Errors/min	> 10	> 50	Page immediately
Availability	< 99.9%	< 99%	Page immediately

Alert Severity Levels

    Alert Severity Levels
    ====================

    P1 - CRITICAL (Page immediately)
    ─────────────────────────────────
    • Service completely down
    • Data loss or corruption
    • Security breach
    • Revenue impact

    Response: Call on-call engineer immediately!

    ─────────────────────────────────

    P2 - HIGH (Respond within 1 hour)
    ─────────────────────────────────
    • Major feature broken
    • Error rate > 5%
    • Performance severely degraded

    Response: Notify team, start investigation

    ─────────────────────────────────

    P3 - MEDIUM (Respond within 4 hours)
    ─────────────────────────────────
    • Minor feature broken
    • Performance degraded
    • Error rate > 1%

    Response: Add to todo list

    ─────────────────────────────────

    P4 - LOW (Informational)
    ─────────────────────────────────
    • Non-critical warnings
    • Capacity planning
    • Upcoming SLO breaches

    Response: Review during work hours

28.3 Alert Configuration

Prometheus Alert Rules Example

groups:
  - name: api-service
    rules:
      # Critical: High error rate
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m]))
          /
          sum(rate(http_requests_total[5m])) > 0.01
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate on {{ $labels.service }}"
          description: "Error rate is {{ $value | humanizePercentage }}
                        (threshold: 1%)"
          runbook_url: "https://wiki.example.com/runbooks/high-error-rate"

      # Warning: High latency
      - alert: HighLatency
        expr: histogram_quantile(0.99,
          sum(rate(http_request_duration_seconds_bucket[5m]))
          by (le, service)) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High P99 latency on {{ $labels.service }}"
          description: "P99 latency is {{ $value }}s (threshold: 2s)"

      # Critical: Service down
      - alert: ServiceDown
        expr: up{job="api-service"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "{{ $labels.instance }} is down"
          description: "Service has been down for 2 minutes"

      # Warning: High memory
      - alert: HighMemoryUsage
        expr: |
          (container_memory_usage_bytes / container_spec_memory_limit_bytes) > 0.85
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage on {{ $labels.pod }}"
          description: "Memory usage is {{ $value | humanizePercentage }}"

28.4 Incident Management

Incident Lifecycle

    Incident Lifecycle
    =================

    ┌─────────────────────────────────────────────────────────────┐
    │                                                             │
    │   ┌─────────┐    ┌─────────────┐    ┌──────────┐         │
    │   │ DETECT  │───▶│ ACKNOWLEDGE │───▶│DIAGNOSE  │         │
    │   └─────────┘    └─────────────┘    └────┬─────┘         │
    │       ▲                                    │               │
    │       │                                    ▼               │
    │       │                             ┌──────────┐          │
    │       │                             │ MITIGATE │          │
    │       │                             └────┬─────┘          │
    │       │                                  │                 │
    │       │                                  ▼                 │
    │       │                             ┌──────────┐          │
    │       │                             │ RESOLVE  │          │
    │       │                             └────┬─────┘          │
    │       │                                  │                 │
    │       │                                  ▼                 │
    │       │                             ┌──────────┐          │
    │       │                             │POST-MORTEM│          │
    │       │                             └────┬─────┘          │
    │       │                                  │                 │
    │       └──────────────────────────────────┘                 │
    │                          Continue monitoring              │
    │                                                             │
    └─────────────────────────────────────────────────────────────┘

Incident Response Steps

    Step 1: DETECT
    ─────────────────
    Alert fires → Notification sent

    Who sees it? On-call engineer

    Goal: Acknowledge quickly (< 15 min)

    ─────────────────

    Step 2: ACKNOWLEDGE
    ─────────────────
    Engineer accepts incident

    Actions:
    • Acknowledge alert
    • Create incident ticket
    • Notify stakeholders if needed

    ─────────────────

    Step 3: DIAGNOSE
    ─────────────────
    Find root cause

    Tools:
    • Logs
    • Metrics
    • Traces
    • Dashboards

    ─────────────────

    Step 4: MITIGATE
    ─────────────────
    Stop the bleeding!

    Options:
    • Rollback
    • Scale up
    • Disable feature
    • Restart service

    Goal: Restore service ASAP

    ─────────────────

    Step 5: RESOLVE
    ─────────────────
    Complete fix

    Actions:
    • Deploy fix
    • Verify resolution
    • Update incident ticket
    • Close incident

    ─────────────────

    Step 6: POST-MORTEM
    ─────────────────
    Learn and improve

    Questions:
    • What happened?
    • Why?
    • How to prevent?
    • What to do better next time?

    Document for team learning!

28.5 Runbooks

A runbook is a document that describes the steps to handle a specific alert or incident.

    Example Runbook: High Error Rate
    ===============================

    # Runbook: High Error Rate

    ## Symptoms
    - Error rate > 1% for 5+ minutes
    - 5xx HTTP responses increasing

    ## Impact
    - Users experiencing failures
    - Potential revenue impact

    ## Diagnosis Steps

    1. Check which endpoints are failing
       ```
       kubectl logs -l app=api --tail=100 | grep "5[0-9][0-9]"
       ```

    2. Check database connectivity
       ```
       kubectl exec -it api-pod -- nc -zv db:5432
       ```

    3. Check recent deployments
       ```
       kubectl rollout history deployment/api
       ```

    4. Check external dependencies
       - Stripe API status
       - AWS service health

    ## Mitigation Steps

    1. If DB issue:
       ```
       kubectl rollout restart deployment/api
       ```

    2. If dependency issue:
       - Enable circuit breaker
       - Route traffic to backup

    3. If deployment issue:
       ```
       kubectl rollout undo deployment/api
       ```

    ## Escalation

    - If unresolved in 30 min: Escalate to team lead
    - If data loss: Escalate to VP Engineering

    ## Post-Incident

    - Update this runbook if new symptoms found
    - Create ticket for preventive work

28.6 On-Call Best Practices

    On-Call Guidelines
    =================

    ┌─────────────────────────────────────────────────────────────┐
    │  1. ROTATION                                               │
    │  ─────────────────────────────────────────────────────────│
    │  • Primary + Secondary on-call                             │
    │  • Rotate weekly or bi-weekly                              │
    │  • Hand-off during business hours                         │
    └─────────────────────────────────────────────────────────────┘

    ┌─────────────────────────────────────────────────────────────┐
    │  2. ESCALATION                                             │
    │  ─────────────────────────────────────────────────────────│
    │  • Clear escalation path                                   │
    │  • Auto-escalate after timeout                            │
    │  • Multiple contact methods                                │
    └─────────────────────────────────────────────────────────────┘

    ┌─────────────────────────────────────────────────────────────┐
    │  3. ALERT VOLUME                                           │
    │  ─────────────────────────────────────────────────────────│
    │  • Target: < 10 alerts per shift                          │
    │  • Triage alerts: Should this page?                       │
    │  • Fix alerts, not just symptoms!                          │
    └─────────────────────────────────────────────────────────────┘

    ┌─────────────────────────────────────────────────────────────┐
    │  4. COMPENSATION                                           │
    │  ─────────────────────────────────────────────────────────│
    │  • On-call = extra pay or time off                        │
    │  • Respect off-hours                                       │
    │  • No meetings during on-call                              │
    └─────────────────────────────────────────────────────────────┘

Summary

Actionable alerts - Only alert on things requiring action
Clear thresholds - Define warning and critical levels
Severity levels - P1 (critical) to P4 (low)
Runbooks - Document what to do for each alert
Incident process - Detect → Acknowledge → Diagnose → Mitigate → Resolve → Post-mortem
On-call - Clear rotation, escalation, compensation

Next: Chapter 29: Distributed Tracing