Incident_management

Chapter 54: Incident Management

Incident Management is the practice of responding to and resolving outages and service disruptions effectively.

Incident Management Overview

┌─────────────────────────────────────────────────────────────────────────────┐
│                        Incident Lifecycle                                     │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   ┌─────────────────────────────────────────────────────────────────────┐ │
│   │                    Incident Lifecycle                                  │ │
│   │                                                                       │ │
│   │   ┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐       │ │
│   │   │ Detect  │───▶│ Respond │───▶│ Resolve │───▶│  Learn  │       │ │
│   │   │         │    │         │    │         │    │         │       │ │
│   │   └─────────┘    └─────────┘    └─────────┘    └─────────┘       │ │
│   │                                                                       │ │
│   │   - Monitoring     - Triage         - Fix         - Post-Mortem   │ │
│   │   - Alerting      - Escalate       - Verify      - Process Improve│ │
│   │   - Reporting     - Communicate    - Notify      - Document       │ │
│   │                                                                       │ │
│   └─────────────────────────────────────────────────────────────────────┘ │
│                                                                             │
│   Key Metrics:                                                             │
│   ✓ MTTR - Mean Time To Recovery                                          │
│   ✓ MTTD - Mean Time To Detect                                            │
│   ✓ MTBF - Mean Time Between Failures                                     │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Severity Levels

┌─────────────────────────────────────────────────────────────────────────────┐
│                        Incident Severity Levels                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   ┌─────────────────┐                                                     │
│   │   SEV1 - Critical │  Complete outage affecting all users             │
│   │   - Response: < 15 min                                               │
│   │   - Examples: Complete service down, data loss                        │
│   └─────────────────┘                                                     │
│                                                                             │
│   ┌─────────────────┐                                                     │
│   │   SEV2 - High    │  Major feature unavailable                        │
│   │   - Response: < 30 min                                               │
│   │   - Examples: Payment system down, login failure                      │
│   └─────────────────┘                                                     │
│                                                                             │
│   ┌─────────────────┐                                                     │
│   │   SEV3 - Medium  │  Partial functionality affected                    │
│   │   - Response: < 2 hours                                              │
│   │   - Examples: Slow performance, intermittent errors                   │
│   └─────────────────┘                                                     │
│                                                                             │
│   ┌─────────────────┐                                                     │
│   │   SEV4 - Low     │  Minor issues, workarounds available             │
│   │   - Response: < 24 hours                                             │
│   │   - Examples: UI bugs, documentation errors                          │
│   └─────────────────┘                                                     │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Incident Response Process

┌─────────────────────────────────────────────────────────────────────────────┐
│                        Incident Response Workflow                               │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   ┌─────────────────────────────────────────────────────────────────────┐ │
│   │                    1. Detection & Triage                             │ │
│   │                                                                       │ │
│   │   - Monitor detects issue (alert)                                    │ │
│   │   - On-call engineer acknowledges                                     │ │
│   │   - Initial assessment and severity assignment                      │ │
│   │   - Create incident ticket                                            │ │
│   │                                                                       │ │
│   └─────────────────────────────────────────────────────────────────────┘ │
│                                    │                                         │
│                                    ▼                                         │
│   ┌─────────────────────────────────────────────────────────────────────┐ │
│   │                    2. Response & Communication                       │ │
│   │                                                                       │ │
│   │   - Notify stakeholders                                              │ │
│   │   - Create incident channel (Slack/Teams)                            │ │
│   │   - Begin root cause analysis                                        │ │
│   │   - Regular status updates                                           │ │
│   │                                                                       │ │
│   └─────────────────────────────────────────────────────────────────────┘ │
│                                    │                                         │
│                                    ▼                                         │
│   ┌─────────────────────────────────────────────────────────────────────┐ │
│   │                    3. Resolution                                      │ │
│   │                                                                       │ │
│   │   - Implement fix                                                    │ │
│   │   - Verify resolution                                               │ │
│   │   - Monitor for recurrence                                           │ │
│   │   - Confirm service restoration                                      │ │
│   │                                                                       │ │
│   └─────────────────────────────────────────────────────────────────────┘ │
│                                    │                                         │
│                                    ▼                                         │
│   ┌─────────────────────────────────────────────────────────────────────┐ │
│   │                    4. Post-Incident                                  │ │
│   │                                                                       │ │
│   │   - Schedule post-mortem                                             │ │
│   │   - Document lessons learned                                         │ │
│   │   - Update runbooks                                                  │ │
│   │   - Implement improvements                                           │ │
│   │                                                                       │ │
│   └─────────────────────────────────────────────────────────────────────┘ │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Runbooks

# runbook-database-connection.md
# Database Connection Issues

## Symptoms
- High connection count
- Connection timeouts
- "Too many connections" errors

## Diagnosis
1. Check current connections:
   ```bash
   psql -h $DB_HOST -U $DB_USER -c "SELECT count(*) FROM pg_stat_activity;"

Check active queries:

psql -h $DB_HOST -U $DB_USER -c "SELECT * FROM pg_stat_activity WHERE state != 'idle';"

Check connection limits:

psql -h $DB_HOST -U $DB_USER -c "SHOW max_connections;"

Resolution Steps

If connections are high:

Identify long-running queries

psql -h $DB_HOST -U $DB_USER -c "SELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE state != 'idle' ORDER BY duration DESC LIMIT 10;"

Terminate problematic queries (if safe)

psql -h $DB_HOST -U $DB_USER -c "SELECT pg_terminate_backend(pid);"

Scale database connection pool

If connection pool is exhausted:

Check application connection leaks
Restart application pods (if needed)
Increase connection pool size temporarily

Escalation

SEV1: @oncall-platform
SEV2: @platform-team

## Post-Mortem Template

```markdown
# Post-Mortem: [Incident Title]

## Summary
Brief description of what happened and impact.

## Impact
- Users affected: [Number]
- Duration: [Start time] - [End time]
- Severity: [SEV level]

## Timeline (UTC)
- [Time] - Alert triggered
- [Time] - Incident acknowledged
- [Time] - Root cause identified
- [Time] - Fix deployed
- [Time] - Incident resolved

## Root Cause
Detailed explanation of what caused the incident.

## What Went Well
- Detection worked quickly
- Team responded effectively

## What Could Be Improved
- Alert thresholds too sensitive
- Documentation outdated

## Action Items
- [ ] Update alert threshold - @john - due: 2024-01-15
- [ ] Update runbook - @jane - due: 2024-01-20
- [ ] Implement circuit breaker - @team - due: 2024-02-01

## Lessons Learned
Key takeaways from this incident.

Incident Management Tools

┌─────────────────────────────────────────────────────────────────────────────┐
│                      Incident Management Tools                                  │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   PagerDuty:                                                                │
│   ┌───────────────────────────────────────────────────────────────────┐ │
│   │   - On-call scheduling                                              │ │
│   │   - Alert routing                                                   │ │
│   │   - Incident response                                               │ │
│   │   - Analytics                                                      │ │
│   └───────────────────────────────────────────────────────────────────┘ │
│                                                                             │
│   Opsgenie:                                                                 │
│   ┌───────────────────────────────────────────────────────────────────┐ │
│   │   - Alert management                                               │ │
│   │   - On-call schedules                                              │ │
│   │   - Integration with many tools                                    │ │
│   └───────────────────────────────────────────────────────────────────┘ │
│                                                                             │
│   VictorOps:                                                               │
│   ┌───────────────────────────────────────────────────────────────────┐ │
│   │   - Incident lifecycle management                                   │ │
│   │   - On-call rotation                                               │ │
│   │   - ChatOps integration                                            │ │
│   └───────────────────────────────────────────────────────────────────┘ │
│                                                                             │
│   FireHydrant:                                                             │
│   ┌───────────────────────────────────────────────────────────────────┐ │
│   │   - Incident management platform                                     │ │
│   │   - Post-mortem automation                                         │ │
│   │   - Status pages                                                   │ │
│   └───────────────────────────────────────────────────────────────────┘ │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Status Page Communication

# status-page-template.md
# Status Page Components

## Components
- API Service: Operational
- Web Application: Operational
- Mobile App: Operational
- Payment Processing: Operational
- Database: Operational

## Current Status
🟢 All Systems Operational

## Past Incidents
### [Date] - Scheduled Maintenance
**Status:** Completed
**Duration:** 30 minutes
**Description:** Database upgrades completed successfully.

### [Date] - Payment Processing Issue
**Status:** Resolved
**Duration:** 2 hours 15 minutes
**Impact:** Some users experienced payment failures
**Root Cause:** Database connection pool exhaustion
**Resolution:** Applied fix and monitoring improvements

Incident Response Automation

name: Auto-Response Pipeline

on:
  alert:
    types: [triggered]

jobs:
  triage:
    runs-on: ubuntu-latest
    steps:
      - name: Classify Alert
        id: classify
        uses: actions/alert-classifier@v1
        with:
          alert: ${{ toJson(github.event.alert) }}

      - name: Create Slack Channel
        if: steps.classify.outputs.severity == 'high'
        uses: peter-evans/create-or-update-comment@v1
        with:
          channel-id: ${{ secrets.INCIDENT_CHANNEL }}
          message: "New incident triggered: ${{ steps.classify.outputs.title }}"

      - name: Page On-Call
        if: steps.classify.outputs.severity == 'critical'
        uses: pagerduty/pagerduty-create-incident@v1
        with:
          service-id: ${{ secrets.PAGERDUTY_SERVICE_ID }}
          title: ${{ steps.classify.outputs.title }}
          urgency: high

      - name: Update Status Page
        if: steps.classify.outputs.severity != 'low'
        uses: burnett01/statuspage-update@v1
        with:
          status: investigating
          message: ${{ steps.classify.outputs.title }}

Metrics and KPIs

┌─────────────────────────────────────────────────────────────────────────────┐
│                        Incident Metrics                                        │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   Response Metrics:                                                         │
│   ┌───────────────────────────────────────────────────────────────────┐ │
│   │   MTTD - Mean Time To Detect                                        │ │
│   │   - Time from issue start to alert trigger                          │ │
│   │   - Target: < 5 minutes                                             │ │
│   │                                                                    │ │
│   │   MTTA - Mean Time To Acknowledge                                    │ │
│   │   - Time from alert to engineer acknowledging                      │ │
│   │   - Target: < 5 minutes                                             │ │
│   │                                                                    │ │
│   │   MTTR - Mean Time To Resolve                                       │ │
│   │   - Time from alert to service restoration                         │ │
│   │   - Target: < 30 minutes for SEV1                                   │ │
│   │                                                                    │ │
│   └───────────────────────────────────────────────────────────────────┘ │
│                                                                             │
│   Quality Metrics:                                                         │
│   ┌───────────────────────────────────────────────────────────────────┐ │
│   │   MTBF - Mean Time Between Failures                                │ │
│   │   - Average time between incidents                                  │ │
│   │                                                                    │ │
│   │   False Positive Rate                                              │ │
│   │   - Percentage of alerts that aren't real issues                   │ │
│   │                                                                    │ │
│   │   Repeat Incident Rate                                             │ │
│   │   - Incidents caused by same issue within 30 days                  │ │
│   │                                                                    │ │
│   └───────────────────────────────────────────────────────────────────┘ │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

On-Call Best Practices

┌─────────────────────────────────────────────────────────────────────────────┐
│                         On-Call Best Practices                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   1. Clear Escalation Paths                                                │
│      - Define who to contact at each level                                 │
│      - Document in runbooks                                                │
│                                                                             │
│   2. Reasonable Rotation                                                   │
│      - Rotate on-call weekly                                               │
│      - Avoid handovers on Mondays/Fridays                                  │
│      - Allow for time off after on-call                                   │
│                                                                             │
│   3. Alert Quality                                                         │
│      - Tune alerts to reduce noise                                         │
│      - Use composite alerts                                               │
│      - Set appropriate thresholds                                           │
│                                                                             │
│   4. Tools and Access                                                     │
│      - Ensure VPN/access from anywhere                                     │
│      - Mobile-friendly dashboards                                          │
│      - Quick access to runbooks                                            │
│                                                                             │
│   5. Post-On-Call Review                                                  │
│      - Review any incidents handled                                        │
│      - Update documentation                                                │
│      - Rest and recover                                                    │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Summary

In this chapter, you learned:

Incident Lifecycle: Detection, response, resolution, learning
Severity Levels: SEV1-4 classification
Response Process: Triage, communication, fix, post-mortem
Runbooks: Documentation and automation
Post-Mortems: Template and best practices
Tools: PagerDuty, Opsgenie, status pages
Metrics: MTTD, MTTR, MTBF
On-Call: Best practices for on-call engineers

Guide Complete!

You have completed the DevOps Tools Complete Guide. Here’s what was covered:

Docker (Chapters 1-10)

Docker fundamentals and architecture
Images, containers, networking
Dockerfiles, Docker Compose
Best practices

Kubernetes (Chapters 11-25)

Kubernetes architecture
Pods, Deployments, Services
ConfigMaps, Secrets, Storage
Advanced topics (Ingress, HPA, Helm)

Terraform/IaC (Chapters 33-40)

Infrastructure as Code concepts
Terraform basics, variables, state
Modules and best practices
CloudFormation

Ansible (Chapters 41-45)

Configuration management
Playbooks, roles, inventory
Advanced features

CI/CD (Chapters 46-49)

CI/CD pipeline concepts
Jenkins, GitHub Actions, GitLab CI

Advanced DevOps (Chapters 50-54)

GitOps, Platform Engineering
SRE, Observability
Chaos Engineering, Service Mesh
Container Security, Incident Management