Incident_management

Chapter 99: Incident Management

Overview

Incident management is a critical discipline in DevOps and SRE practices, focusing on responding to service disruptions efficiently while minimizing impact and preventing recurrence. This chapter covers incident response processes, severity levels, on-call best practices, communication strategies, post-mortems, and building a culture of continuous improvement. Understanding incident management is essential for SRE roles and demonstrates operational maturity in interviews.

99.1 Severity Levels

Severity Classification

┌─────────────────────────────────────────────────────────────────────────┐
│                     INCIDENT SEVERITY LEVELS                             │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   ┌─────────────────────────────────────────────────────────────────┐   │
│   │  SEV1 - CRITICAL                                                │   │
│   │  ┌─────────────────────────────────────────────────────────┐   │   │
│   │  │ • Complete service outage                                │   │   │
│   │  │ • All customers affected                                │   │   │
│   │  │ • Revenue impact                                       │   │   │
│   │  │ • Data loss potential                                  │   │   │
│   │  │ • Response: Immediate, all hands on deck               │   │   │
│   │  │ • Target resolution: < 1 hour                         │   │   │
│   │  │ • Communication: Executive and customer alerts         │   │   │
│   │  └─────────────────────────────────────────────────────────┘   │   │
│   │                                                                  │   │
│   │  SEV2 - HIGH                                                    │   │
│   │  ┌─────────────────────────────────────────────────────────┐   │   │
│   │  │ • Major feature unavailable                            │   │   │
│   │  │ • Significant customer impact                          │   │   │
│   │  │ • Critical functionality impaired                      │   │   │
│   │  │ • Response: Urgent, senior engineers                  │   │   │
│   │  │ • Target resolution: < 4 hours                        │   │   │
│   │  │ • Communication: Status page update                   │   │   │
│   │  └─────────────────────────────────────────────────────────┘   │   │
│   │                                                                  │   │
│   │  SEV3 - MEDIUM                                                  │   │
│   │  ┌─────────────────────────────────────────────────────────┐   │   │
│   │  │ • Moderate impact                                     │   │   │
│   │  │ • Workaround available                                │   │   │
│   │  │ • Secondary features affected                         │   │   │
│   │  │ • Response: During business hours                     │   │   │
│   │  │ • Target resolution: < 24 hours                      │   │   │
│   │  │ • Communication: Team notification                   │   │   │
│   │  └─────────────────────────────────────────────────────────┘   │   │
│   │                                                                  │   │
│   │  SEV4 - LOW                                                     │   │
│   │  ┌─────────────────────────────────────────────────────────┐   │   │
│   │  │ • Minor issue                                         │   │   │
│   │  │ • Cosmetic or documentation issues                     │   │   │
│   │  │ • Can wait for next business day                       │   │   │
│   │  │ • Response: Schedule for next sprint                   │   │   │
│   │  │ • Target resolution: < 1 week                         │   │   │
│   │  │ • Communication: Ticket tracking only                  │   │   │
│   │  └─────────────────────────────────────────────────────────┘   │   │
│   │                                                                  │   │
│   └─────────────────────────────────────────────────────────────────┘   │
│                                                                         │
│   Severity Determination Factors:                                       │
│   ┌─────────────────────────────────────────────────────────────────┐   │
│   │  • Customer impact (how many affected)                         │   │
│   │  • Revenue impact                                             │   │
│   │  • Data integrity/safety                                     │   │
│   │  • Recovery complexity                                        │   │
│   │  • Time to implement workaround                               │   │
│   └─────────────────────────────────────────────────────────────────┘   │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

99.2 Incident Response Process

Response Workflow

┌─────────────────────────────────────────────────────────────────────────┐
│                    INCIDENT RESPONSE WORKFLOW                           │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   ┌─────────────────────────────────────────────────────────────────┐   │
│   │                     DETECTION                                    │   │
│   │  ┌─────────────────────────────────────────────────────────┐   │   │
│   │  │ • Alert from monitoring                                  │   │   │
│   │  │ • Customer report                                        │   │   │
│   │  │ • Automated detection                                    │   │   │
│   │  └─────────────────────────────────────────────────────────┘   │   │
│   └─────────────────────────────┬───────────────────────────────────────┘   │
│                                 │                                        │
│                                 ▼                                        │
│   ┌─────────────────────────────────────────────────────────────────┐   │
│   │                    TRIAGE                                        │   │
│   │  ┌─────────────────────────────────────────────────────────┐   │   │
│   │  │ • Assess severity                                      │   │   │
│   │  │ • Identify affected systems                            │   │   │
│   │  │ • Assign incident commander                            │   │   │
│   │  │ • Initial communication                                │   │   │
│   │  └─────────────────────────────────────────────────────────┘   │   │
│   └─────────────────────────────┬───────────────────────────────────────┘   │
│                                 │                                        │
│                                 ▼                                        │
│   ┌─────────────────────────────────────────────────────────────────┐   │
│   │                    MITIGATION                                    │   │
│   │  ┌─────────────────────────────────────────────────────────┐   │   │
│   │  │ • Immediate actions to reduce impact                    │   │   │
│   │  │ • Apply workarounds                                    │   │   │
│   │  │ • Customer communication                                │   │   │
│   │  │ • Regular status updates                               │   │   │
│   │  └─────────────────────────────────────────────────────────┘   │   │
│   └─────────────────────────────┬───────────────────────────────────────┘   │
│                                 │                                        │
│                                 ▼                                        │
│   ┌─────────────────────────────────────────────────────────────────┐   │
│   │                    RESOLUTION                                    │   │
│   │  ┌─────────────────────────────────────────────────────────┐   │   │
│   │  │ • Fix root cause                                       │   │   │
│   │  │ • Verify resolution                                    │   │   │
│   │  │ • Final customer communication                          │   │   │
│   │  │ • Close incident                                       │   │   │
│   │  └─────────────────────────────────────────────────────────┘   │   │
│   └─────────────────────────────┬───────────────────────────────────────┘   │
│                                 │                                        │
│                                 ▼                                        │
│   ┌─────────────────────────────────────────────────────────────────┐   │
│   │                    POST-INcIDENT                                 │   │
│   │  ┌─────────────────────────────────────────────────────────┐   │   │
│   │  │ • Schedule post-mortem                                 │   │   │
│   │  │ • Document timeline                                    │   │   │
│   │  │ • Identify action items                                │   │   │
│   │  │ • Implement improvements                                │   │   │
│   │  └─────────────────────────────────────────────────────────┘   │   │
│   └─────────────────────────────────────────────────────────────────┘   │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

99.3 On-Call Best Practices

On-Call Responsibilities

# ============================================================
# ON-CALL BEST PRACTICES
# ============================================================

# Preparation
- Review active issues before shift
- Test alerting and escalation
- Know escalation contacts
- Review runbooks
- Ensure access to all tools
- Backup device charged

# During On-Call
- Acknowledge alerts quickly (< 15 min)
- Be present and responsive
- Update status page proactively
- Document actions taken
- Handover to next shift

# Handover Process
- Document current issues
- Share context for ongoing work
- Highlight items to watch
- Confirm next on-call acknowledged

# Self-Care
- Get adequate sleep
- Take regular breaks
- Exercise escalation when needed
- Use protected time after on-call
- Limit on-call frequency

# Escalation Path
- L1: On-call engineer
- L2: Senior engineer/team lead
- L3: Engineering manager
- L4: VP/Director

Incident Commander Role

┌─────────────────────────────────────────────────────────────────────────┐
│                   INCIDENT COMMANDER (IC) ROLE                            │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   Primary Responsibilities:                                            │
│   ┌─────────────────────────────────────────────────────────────────┐   │
│   │                                                                  │   │
│   │  1. COORDINATE RESPONSE                                         │   │
│   │     • Lead incident response efforts                            │   │
│   │     • Assign roles to responders                                │   │
│   │     • Ensure proper resources are engaged                      │   │
│   │                                                                  │   │
│   │  2. COMMUNICATE                                                 │   │
│   │     • Provide status updates                                     │   │
│   │     • Coordinate with stakeholders                              │   │
│   │     • Update status page                                        │   │
│   │     • Brief executive team                                       │   │
│   │                                                                  │   │
│   │  3. DECIDE                                                       │   │
│   │     • Make critical decisions                                    │   │
│   │     • Authorize emergency changes                               │   │
│   │     • Decide on escalation                                      │   │
│   │     • Determine when to declare incident closed                 │   │
│   │                                                                  │   │
│   │  4. DOCUMENT                                                     │   │
│   │     • Maintain incident timeline                                │   │
│   │     • Track actions taken                                       │   │
│   │     • Ensure post-mortem will be done                          │   │
│   │                                                                  │   │
│   └─────────────────────────────────────────────────────────────────┘   │
│                                                                         │
│   IC Should NOT:                                                       │
│   ┌─────────────────────────────────────────────────────────────────┐   │
│   │  • Try to fix the issue personally (delegate)                  │   │
│   │  • Ignore communication (prioritize it)                         │   │
│   │  • Hesitate to escalate                                        │   │
│   │  • Make decisions without input (consult team)                  │   │
│   └─────────────────────────────────────────────────────────────────┘   │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

99.4 Communication During Incidents

Communication Templates

# Initial Alert (SEV1/2)
**Incident**: [Brief title]
**Severity**: SEV1
**Status**: Investigating
**Impact**: [Customer impact description]
**Commander**: [Name]
**Updates**: Will post every [X] minutes

# Status Update
**Incident**: [Title]
**Status**: In Progress - Mitigating
**Update**: [What we're doing]
**Next Update**: [Time]
**ETA**: [If known]

# Resolution
**Incident**: [Title]
**Status**: Resolved
**Duration**: [X] hours
**Root Cause**: [Brief]
**Action Items**: [Link to ticket]

99.5 Post-Mortem Template

Comprehensive Post-Mortem

# Incident Post-Mortem

## Metadata
- **Incident ID**: INC-2024-001
- **Date**: January 15, 2024
- **Duration**: 2 hours 34 minutes
- **Severity**: SEV1
- **Author**: [Name]
- **Contributors**: [Names]

## Summary
[Brief paragraph describing what happened]

## Impact
- Customers affected: [Number or percentage]
- Revenue impact: [If applicable]
- Data impact: [If any]
- Internal impact: [Team downtime, etc.]

## Root Cause
[Detailed explanation of why the incident occurred]
[Technical details of the failure]

## Timeline (UTC)
| Time | Event |
|------|-------|
| 10:00 | Alert triggered |
| 10:05 | On-call acknowledged |
| 10:15 | SEV1 declared |
| 10:30 | Root cause identified |
| 11:45 | Mitigation applied |
| 12:30 | Service restored |
| 12:34 | Incident closed |

## Detection
- How was the incident detected?
- Time from failure to detection: [X] minutes

## Response
- Time to acknowledge: [X] minutes
- Time to mitigate: [X] minutes
- Time to resolve: [X] minutes

## What Went Well
- [Item 1]
- [Item 2]

## What Could Be Improved
- [Item 1]
- [Item 2]

## Action Items
| ID | Description | Owner | Due Date |
|----|-------------|-------|----------|
| 1 | Fix root cause | @name | Jan 30 |
| 2 | Add monitoring | @name | Feb 5 |
| 3 | Update runbook | @name | Feb 10 |

## Lessons Learned
[Any broader learnings for the team]

99.6 Interview Questions

┌─────────────────────────────────────────────────────────────────────────┐
│               INCIDENT MANAGEMENT INTERVIEW QUESTIONS                    │
├─────────────────────────────────────────────────────────────────────────┤
                                                                         │
Q1: What is the incident management process?                            │
                                                                         │
A1:                                                                       │
- Detection → Triage → Mitigation → Resolution → Post-Incident         │
- Severity classification based on impact                                │
- Clear roles: IC, Scribe, Tech Lead                                   │
- Communication throughout                                             │
- Post-mortem after resolution                                        │
                                                                         │
─────────────────────────────────────────────────────────────────────────┤
                                                                         │
Q2: How do you determine incident severity?                            │
                                                                         │
A2:                                                                       │
- Customer impact (how many affected)                                   │
- Revenue/data impact                                                   │
- Business criticality                                                 │
- Time to recovery expectation                                         │
- Workaround availability                                              │
                                                                         │
─────────────────────────────────────────────────────────────────────────┤
                                                                         │
Q3: What is the role of an Incident Commander?                        │
                                                                         │
A3:                                                                       │
- Coordinates response                                                 │
- Communicates status                                                  │
- Makes decisions                                                     │
- Documents timeline                                                  │
- NOT fixing issue personally, but coordinating                       │
                                                                         │
─────────────────────────────────────────────────────────────────────────┤
                                                                         │
Q4: How do you handle being on-call?                                   │
                                                                         │
A4:                                                                       │
- Prepare before shift                                                │
- Acknowledge quickly                                                  │
- Document actions                                                    │
- Escalate when needed                                               │
- Proper handover                                                    │
- Self-care afterward                                                 │
                                                                         │
─────────────────────────────────────────────────────────────────────────┤
                                                                         │
Q5: What should a post-mortem include?                                 │
                                                                         │
A5:                                                                       │
- What happened (summary)                                             │
- Impact (customers, revenue)                                         │
- Root cause analysis                                                 │
- Timeline                                                            │
- What went well                                                      │
- What could improve                                                  │
- Action items with owners                                            │
- Blameless focus                                                    │
                                                                         │
─────────────────────────────────────────────────────────────────────────┤
                                                                         │
Q6: How do you reduce alert fatigue?                                   │
                                                                         │
A6:                                                                       │
- Tune alert thresholds                                               │
- Remove duplicate alerts                                              │
- Create compound alerts                                               │
- Ensure alerts are actionable                                         │
- Regular alert reviews                                                │
- Automate where possible                                             │
                                                                         │
─────────────────────────────────────────────────────────────────────────┤
                                                                         │
Q7: What is the difference between incident and problem management?    │
                                                                         │
A7:                                                                       │
- Incident: Short-term response to restore service                    │
- Problem: Long-term fix to prevent recurrence                        │
- Incidents are symptoms, problems are root causes                     │
- Problem management often involves multiple incidents                 │
                                                                         │
─────────────────────────────────────────────────────────────────────────┤
                                                                         │
Q8: How do you communicate during an incident?                        │
                                                                         │
A8:                                                                       │
- Status page updates (for customers)                                 │
- Internal chat channel (for responders)                              │
- Regular updates (every X minutes)                                   │
- Executive briefings (for SEV1/2)                                   │
- Clear, concise, factual                                            │
- Set expectations for next update                                     │
                                                                         │
─────────────────────────────────────────────────────────────────────────┤
                                                                         │
Q9: What is a "war room" and when do you use one?                     │
                                                                         │
A9:                                                                       │
- Virtual or physical room for incident response                      │
- All responders join                                                │
- Used for major incidents (SEV1)                                     │
- Enables real-time collaboration                                    │
- IC coordinates from war room                                       │
- Reduces communication overhead                                      │
                                                                         │
─────────────────────────────────────────────────────────────────────────┤
                                                                         │
Q10: How do you prevent on-call burnout?                               │
                                                                         │
A10:                                                                      │
- Fair rotation schedules                                             │
- Protected time after on-call                                        │
- Limit consecutive on-call shifts                                    │
- Clear escalation paths                                              │
- Good runbooks to reduce manual work                                │
- Manager support                                                     │
- Self-care and boundaries                                            │
                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Summary

Severity: SEV1-4 based on customer and business impact
Process: Detect → Triage → Mitigate → Resolve → Post-Incident
Roles: Incident Commander, Tech Lead, Scribe
Communication: Regular updates, status page
Post-Mortem: Blameless, action items, continuous improvement

Next Chapter

Chapter 100: Career Development

Last Updated: February 2026