Skip to content

Alerting

Chapter 28: Alerting & Incident Management

Section titled “Chapter 28: Alerting & Incident Management”

The goal of alerting is to notify the right people at the right time about issues that require action.

Good vs Bad Alerts
=================
BAD Alerts (The Alert Storm):
──────────────────────────────
✗ "Disk usage > 0%" - Too sensitive
✗ "CPU > 1%" - Noise
✗ "Any 5xx" without context - Unactionable
✗ "User logged in" - Not an issue
Result: On-call engineers ignore all alerts!
─────────────────────────────────────────
GOOD Alerts:
──────────────────────────────
✓ "Error rate > 1% for 5 minutes" → Action: Investigate
✓ "P99 latency > 2s for 5 min" → Action: Scale/check DB
✓ "Disk > 90%" → Action: Clean up files
✓ "Payment failed > 10%" → Action: Page someone!
✓ "SLO breach imminent" → Action: Stop features
Result: Engineers act on every alert!
Alert Design Principles
======================
1. ACTIONABLE
───────────
Can someone do something about it?
✗ "Queue has messages" (normal!)
✓ "Queue depth > 1000 for 10 min" (action: scale workers)
2. RELEVANT
──────────
Does it matter to the business?
✗ "Debug log created"
✓ "Order checkout failed"
3. TIMELY
─────────
Not too many, not too few
✗ Alert on every spike
✓ Alert after sustained issue (5+ minutes)
4. CLEAR
───────
Person receiving knows what to do
✗ "Service health degraded"
✓ "Checkout service error rate >5%, likely DB issue"

MetricWarning ThresholdCritical ThresholdAction
Error Rate> 0.5%> 1%Page on-call
P99 Latency> 1s> 2sInvestigate
CPU Usage> 70%> 90%Scale
Memory Usage> 75%> 90%Investigate
Disk Usage> 80%> 90%Clean up
Queue Depth> 500> 1000Scale workers
DB Connections> 70%> 90%Check slow queries
5xx Errors/min> 10> 50Page immediately
Availability< 99.9%< 99%Page immediately
Alert Severity Levels
====================
P1 - CRITICAL (Page immediately)
─────────────────────────────────
• Service completely down
• Data loss or corruption
• Security breach
• Revenue impact
Response: Call on-call engineer immediately!
─────────────────────────────────
P2 - HIGH (Respond within 1 hour)
─────────────────────────────────
• Major feature broken
• Error rate > 5%
• Performance severely degraded
Response: Notify team, start investigation
─────────────────────────────────
P3 - MEDIUM (Respond within 4 hours)
─────────────────────────────────
• Minor feature broken
• Performance degraded
• Error rate > 1%
Response: Add to todo list
─────────────────────────────────
P4 - LOW (Informational)
─────────────────────────────────
• Non-critical warnings
• Capacity planning
• Upcoming SLO breaches
Response: Review during work hours

alerts.yaml
groups:
- name: api-service
rules:
# Critical: High error rate
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m])) > 0.01
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate on {{ $labels.service }}"
description: "Error rate is {{ $value | humanizePercentage }}
(threshold: 1%)"
runbook_url: "https://wiki.example.com/runbooks/high-error-rate"
# Warning: High latency
- alert: HighLatency
expr: histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m]))
by (le, service)) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "High P99 latency on {{ $labels.service }}"
description: "P99 latency is {{ $value }}s (threshold: 2s)"
# Critical: Service down
- alert: ServiceDown
expr: up{job="api-service"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "{{ $labels.instance }} is down"
description: "Service has been down for 2 minutes"
# Warning: High memory
- alert: HighMemoryUsage
expr: |
(container_memory_usage_bytes / container_spec_memory_limit_bytes) > 0.85
for: 10m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.pod }}"
description: "Memory usage is {{ $value | humanizePercentage }}"

Incident Lifecycle
=================
┌─────────────────────────────────────────────────────────────┐
│ │
│ ┌─────────┐ ┌─────────────┐ ┌──────────┐ │
│ │ DETECT │───▶│ ACKNOWLEDGE │───▶│DIAGNOSE │ │
│ └─────────┘ └─────────────┘ └────┬─────┘ │
│ ▲ │ │
│ │ ▼ │
│ │ ┌──────────┐ │
│ │ │ MITIGATE │ │
│ │ └────┬─────┘ │
│ │ │ │
│ │ ▼ │
│ │ ┌──────────┐ │
│ │ │ RESOLVE │ │
│ │ └────┬─────┘ │
│ │ │ │
│ │ ▼ │
│ │ ┌──────────┐ │
│ │ │POST-MORTEM│ │
│ │ └────┬─────┘ │
│ │ │ │
│ └──────────────────────────────────┘ │
│ Continue monitoring │
│ │
└─────────────────────────────────────────────────────────────┘
Step 1: DETECT
─────────────────
Alert fires → Notification sent
Who sees it? On-call engineer
Goal: Acknowledge quickly (< 15 min)
─────────────────
Step 2: ACKNOWLEDGE
─────────────────
Engineer accepts incident
Actions:
• Acknowledge alert
• Create incident ticket
• Notify stakeholders if needed
─────────────────
Step 3: DIAGNOSE
─────────────────
Find root cause
Tools:
• Logs
• Metrics
• Traces
• Dashboards
─────────────────
Step 4: MITIGATE
─────────────────
Stop the bleeding!
Options:
• Rollback
• Scale up
• Disable feature
• Restart service
Goal: Restore service ASAP
─────────────────
Step 5: RESOLVE
─────────────────
Complete fix
Actions:
• Deploy fix
• Verify resolution
• Update incident ticket
• Close incident
─────────────────
Step 6: POST-MORTEM
─────────────────
Learn and improve
Questions:
• What happened?
• Why?
• How to prevent?
• What to do better next time?
Document for team learning!

A runbook is a document that describes the steps to handle a specific alert or incident.

Example Runbook: High Error Rate
===============================
# Runbook: High Error Rate
## Symptoms
- Error rate > 1% for 5+ minutes
- 5xx HTTP responses increasing
## Impact
- Users experiencing failures
- Potential revenue impact
## Diagnosis Steps
1. Check which endpoints are failing
```
kubectl logs -l app=api --tail=100 | grep "5[0-9][0-9]"
```
2. Check database connectivity
```
kubectl exec -it api-pod -- nc -zv db:5432
```
3. Check recent deployments
```
kubectl rollout history deployment/api
```
4. Check external dependencies
- Stripe API status
- AWS service health
## Mitigation Steps
1. If DB issue:
```
kubectl rollout restart deployment/api
```
2. If dependency issue:
- Enable circuit breaker
- Route traffic to backup
3. If deployment issue:
```
kubectl rollout undo deployment/api
```
## Escalation
- If unresolved in 30 min: Escalate to team lead
- If data loss: Escalate to VP Engineering
## Post-Incident
- Update this runbook if new symptoms found
- Create ticket for preventive work

On-Call Guidelines
=================
┌─────────────────────────────────────────────────────────────┐
│ 1. ROTATION │
│ ─────────────────────────────────────────────────────────│
│ • Primary + Secondary on-call │
│ • Rotate weekly or bi-weekly │
│ • Hand-off during business hours │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ 2. ESCALATION │
│ ─────────────────────────────────────────────────────────│
│ • Clear escalation path │
│ • Auto-escalate after timeout │
│ • Multiple contact methods │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ 3. ALERT VOLUME │
│ ─────────────────────────────────────────────────────────│
│ • Target: < 10 alerts per shift │
│ • Triage alerts: Should this page? │
│ • Fix alerts, not just symptoms! │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ 4. COMPENSATION │
│ ─────────────────────────────────────────────────────────│
│ • On-call = extra pay or time off │
│ • Respect off-hours │
│ • No meetings during on-call │
└─────────────────────────────────────────────────────────────┘

  1. Actionable alerts - Only alert on things requiring action
  2. Clear thresholds - Define warning and critical levels
  3. Severity levels - P1 (critical) to P4 (low)
  4. Runbooks - Document what to do for each alert
  5. Incident process - Detect → Acknowledge → Diagnose → Mitigate → Resolve → Post-mortem
  6. On-call - Clear rotation, escalation, compensation

Next: Chapter 29: Distributed Tracing