Alerting
Chapter 28: Alerting & Incident Management
Section titled “Chapter 28: Alerting & Incident Management”Proactive Monitoring and Response
Section titled “Proactive Monitoring and Response”28.1 Alert Principles
Section titled “28.1 Alert Principles”The goal of alerting is to notify the right people at the right time about issues that require action.
Good vs Bad Alerts =================
BAD Alerts (The Alert Storm): ──────────────────────────────
✗ "Disk usage > 0%" - Too sensitive ✗ "CPU > 1%" - Noise ✗ "Any 5xx" without context - Unactionable ✗ "User logged in" - Not an issue
Result: On-call engineers ignore all alerts!
─────────────────────────────────────────
GOOD Alerts: ──────────────────────────────
✓ "Error rate > 1% for 5 minutes" → Action: Investigate ✓ "P99 latency > 2s for 5 min" → Action: Scale/check DB ✓ "Disk > 90%" → Action: Clean up files ✓ "Payment failed > 10%" → Action: Page someone! ✓ "SLO breach imminent" → Action: Stop features
Result: Engineers act on every alert!The Four Principles
Section titled “The Four Principles” Alert Design Principles ======================
1. ACTIONABLE ─────────── Can someone do something about it?
✗ "Queue has messages" (normal!) ✓ "Queue depth > 1000 for 10 min" (action: scale workers)
2. RELEVANT ────────── Does it matter to the business?
✗ "Debug log created" ✓ "Order checkout failed"
3. TIMELY ───────── Not too many, not too few
✗ Alert on every spike ✓ Alert after sustained issue (5+ minutes)
4. CLEAR ─────── Person receiving knows what to do
✗ "Service health degraded" ✓ "Checkout service error rate >5%, likely DB issue"28.2 Alert Thresholds
Section titled “28.2 Alert Thresholds”Common Alert Rules
Section titled “Common Alert Rules”| Metric | Warning Threshold | Critical Threshold | Action |
|---|---|---|---|
| Error Rate | > 0.5% | > 1% | Page on-call |
| P99 Latency | > 1s | > 2s | Investigate |
| CPU Usage | > 70% | > 90% | Scale |
| Memory Usage | > 75% | > 90% | Investigate |
| Disk Usage | > 80% | > 90% | Clean up |
| Queue Depth | > 500 | > 1000 | Scale workers |
| DB Connections | > 70% | > 90% | Check slow queries |
| 5xx Errors/min | > 10 | > 50 | Page immediately |
| Availability | < 99.9% | < 99% | Page immediately |
Alert Severity Levels
Section titled “Alert Severity Levels” Alert Severity Levels ====================
P1 - CRITICAL (Page immediately) ───────────────────────────────── • Service completely down • Data loss or corruption • Security breach • Revenue impact
Response: Call on-call engineer immediately!
─────────────────────────────────
P2 - HIGH (Respond within 1 hour) ───────────────────────────────── • Major feature broken • Error rate > 5% • Performance severely degraded
Response: Notify team, start investigation
─────────────────────────────────
P3 - MEDIUM (Respond within 4 hours) ───────────────────────────────── • Minor feature broken • Performance degraded • Error rate > 1%
Response: Add to todo list
─────────────────────────────────
P4 - LOW (Informational) ───────────────────────────────── • Non-critical warnings • Capacity planning • Upcoming SLO breaches
Response: Review during work hours28.3 Alert Configuration
Section titled “28.3 Alert Configuration”Prometheus Alert Rules Example
Section titled “Prometheus Alert Rules Example”groups: - name: api-service rules: # Critical: High error rate - alert: HighErrorRate expr: | sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.01 for: 5m labels: severity: critical annotations: summary: "High error rate on {{ $labels.service }}" description: "Error rate is {{ $value | humanizePercentage }} (threshold: 1%)" runbook_url: "https://wiki.example.com/runbooks/high-error-rate"
# Warning: High latency - alert: HighLatency expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)) > 2 for: 5m labels: severity: warning annotations: summary: "High P99 latency on {{ $labels.service }}" description: "P99 latency is {{ $value }}s (threshold: 2s)"
# Critical: Service down - alert: ServiceDown expr: up{job="api-service"} == 0 for: 1m labels: severity: critical annotations: summary: "{{ $labels.instance }} is down" description: "Service has been down for 2 minutes"
# Warning: High memory - alert: HighMemoryUsage expr: | (container_memory_usage_bytes / container_spec_memory_limit_bytes) > 0.85 for: 10m labels: severity: warning annotations: summary: "High memory usage on {{ $labels.pod }}" description: "Memory usage is {{ $value | humanizePercentage }}"28.4 Incident Management
Section titled “28.4 Incident Management”Incident Lifecycle
Section titled “Incident Lifecycle” Incident Lifecycle =================
┌─────────────────────────────────────────────────────────────┐ │ │ │ ┌─────────┐ ┌─────────────┐ ┌──────────┐ │ │ │ DETECT │───▶│ ACKNOWLEDGE │───▶│DIAGNOSE │ │ │ └─────────┘ └─────────────┘ └────┬─────┘ │ │ ▲ │ │ │ │ ▼ │ │ │ ┌──────────┐ │ │ │ │ MITIGATE │ │ │ │ └────┬─────┘ │ │ │ │ │ │ │ ▼ │ │ │ ┌──────────┐ │ │ │ │ RESOLVE │ │ │ │ └────┬─────┘ │ │ │ │ │ │ │ ▼ │ │ │ ┌──────────┐ │ │ │ │POST-MORTEM│ │ │ │ └────┬─────┘ │ │ │ │ │ │ └──────────────────────────────────┘ │ │ Continue monitoring │ │ │ └─────────────────────────────────────────────────────────────┘Incident Response Steps
Section titled “Incident Response Steps” Step 1: DETECT ───────────────── Alert fires → Notification sent
Who sees it? On-call engineer
Goal: Acknowledge quickly (< 15 min)
─────────────────
Step 2: ACKNOWLEDGE ───────────────── Engineer accepts incident
Actions: • Acknowledge alert • Create incident ticket • Notify stakeholders if needed
─────────────────
Step 3: DIAGNOSE ───────────────── Find root cause
Tools: • Logs • Metrics • Traces • Dashboards
─────────────────
Step 4: MITIGATE ───────────────── Stop the bleeding!
Options: • Rollback • Scale up • Disable feature • Restart service
Goal: Restore service ASAP
─────────────────
Step 5: RESOLVE ───────────────── Complete fix
Actions: • Deploy fix • Verify resolution • Update incident ticket • Close incident
─────────────────
Step 6: POST-MORTEM ───────────────── Learn and improve
Questions: • What happened? • Why? • How to prevent? • What to do better next time?
Document for team learning!28.5 Runbooks
Section titled “28.5 Runbooks”A runbook is a document that describes the steps to handle a specific alert or incident.
Example Runbook: High Error Rate ===============================
# Runbook: High Error Rate
## Symptoms - Error rate > 1% for 5+ minutes - 5xx HTTP responses increasing
## Impact - Users experiencing failures - Potential revenue impact
## Diagnosis Steps
1. Check which endpoints are failing ``` kubectl logs -l app=api --tail=100 | grep "5[0-9][0-9]" ```
2. Check database connectivity ``` kubectl exec -it api-pod -- nc -zv db:5432 ```
3. Check recent deployments ``` kubectl rollout history deployment/api ```
4. Check external dependencies - Stripe API status - AWS service health
## Mitigation Steps
1. If DB issue: ``` kubectl rollout restart deployment/api ```
2. If dependency issue: - Enable circuit breaker - Route traffic to backup
3. If deployment issue: ``` kubectl rollout undo deployment/api ```
## Escalation
- If unresolved in 30 min: Escalate to team lead - If data loss: Escalate to VP Engineering
## Post-Incident
- Update this runbook if new symptoms found - Create ticket for preventive work28.6 On-Call Best Practices
Section titled “28.6 On-Call Best Practices” On-Call Guidelines =================
┌─────────────────────────────────────────────────────────────┐ │ 1. ROTATION │ │ ─────────────────────────────────────────────────────────│ │ • Primary + Secondary on-call │ │ • Rotate weekly or bi-weekly │ │ • Hand-off during business hours │ └─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐ │ 2. ESCALATION │ │ ─────────────────────────────────────────────────────────│ │ • Clear escalation path │ │ • Auto-escalate after timeout │ │ • Multiple contact methods │ └─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐ │ 3. ALERT VOLUME │ │ ─────────────────────────────────────────────────────────│ │ • Target: < 10 alerts per shift │ │ • Triage alerts: Should this page? │ │ • Fix alerts, not just symptoms! │ └─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐ │ 4. COMPENSATION │ │ ─────────────────────────────────────────────────────────│ │ • On-call = extra pay or time off │ │ • Respect off-hours │ │ • No meetings during on-call │ └─────────────────────────────────────────────────────────────┘Summary
Section titled “Summary”- Actionable alerts - Only alert on things requiring action
- Clear thresholds - Define warning and critical levels
- Severity levels - P1 (critical) to P4 (low)
- Runbooks - Document what to do for each alert
- Incident process - Detect → Acknowledge → Diagnose → Mitigate → Resolve → Post-mortem
- On-call - Clear rotation, escalation, compensation