Skip to content

Incident Management

Incident management is a critical discipline in DevOps and SRE practices, focusing on responding to service disruptions efficiently while minimizing impact and preventing recurrence. This chapter covers incident response processes, severity levels, on-call best practices, communication strategies, post-mortems, and building a culture of continuous improvement. Understanding incident management is essential for SRE roles and demonstrates operational maturity in interviews.


┌─────────────────────────────────────────────────────────────────────────┐
│ INCIDENT SEVERITY LEVELS │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ SEV1 - CRITICAL │ │
│ │ ┌─────────────────────────────────────────────────────────┐ │ │
│ │ │ • Complete service outage │ │ │
│ │ │ • All customers affected │ │ │
│ │ │ • Revenue impact │ │ │
│ │ │ • Data loss potential │ │ │
│ │ │ • Response: Immediate, all hands on deck │ │ │
│ │ │ • Target resolution: < 1 hour │ │ │
│ │ │ • Communication: Executive and customer alerts │ │ │
│ │ └─────────────────────────────────────────────────────────┘ │ │
│ │ │ │
│ │ SEV2 - HIGH │ │
│ │ ┌─────────────────────────────────────────────────────────┐ │ │
│ │ │ • Major feature unavailable │ │ │
│ │ │ • Significant customer impact │ │ │
│ │ │ • Critical functionality impaired │ │ │
│ │ │ • Response: Urgent, senior engineers │ │ │
│ │ │ • Target resolution: < 4 hours │ │ │
│ │ │ • Communication: Status page update │ │ │
│ │ └─────────────────────────────────────────────────────────┘ │ │
│ │ │ │
│ │ SEV3 - MEDIUM │ │
│ │ ┌─────────────────────────────────────────────────────────┐ │ │
│ │ │ • Moderate impact │ │ │
│ │ │ • Workaround available │ │ │
│ │ │ • Secondary features affected │ │ │
│ │ │ • Response: During business hours │ │ │
│ │ │ • Target resolution: < 24 hours │ │ │
│ │ │ • Communication: Team notification │ │ │
│ │ └─────────────────────────────────────────────────────────┘ │ │
│ │ │ │
│ │ SEV4 - LOW │ │
│ │ ┌─────────────────────────────────────────────────────────┐ │ │
│ │ │ • Minor issue │ │ │
│ │ │ • Cosmetic or documentation issues │ │ │
│ │ │ • Can wait for next business day │ │ │
│ │ │ • Response: Schedule for next sprint │ │ │
│ │ │ • Target resolution: < 1 week │ │ │
│ │ │ • Communication: Ticket tracking only │ │ │
│ │ └─────────────────────────────────────────────────────────┘ │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ Severity Determination Factors: │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ • Customer impact (how many affected) │ │
│ │ • Revenue impact │ │
│ │ • Data integrity/safety │ │
│ │ • Recovery complexity │ │
│ │ • Time to implement workaround │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────┐
│ INCIDENT RESPONSE WORKFLOW │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ DETECTION │ │
│ │ ┌─────────────────────────────────────────────────────────┐ │ │
│ │ │ • Alert from monitoring │ │ │
│ │ │ • Customer report │ │ │
│ │ │ • Automated detection │ │ │
│ │ └─────────────────────────────────────────────────────────┘ │ │
│ └─────────────────────────────┬───────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ TRIAGE │ │
│ │ ┌─────────────────────────────────────────────────────────┐ │ │
│ │ │ • Assess severity │ │ │
│ │ │ • Identify affected systems │ │ │
│ │ │ • Assign incident commander │ │ │
│ │ │ • Initial communication │ │ │
│ │ └─────────────────────────────────────────────────────────┘ │ │
│ └─────────────────────────────┬───────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ MITIGATION │ │
│ │ ┌─────────────────────────────────────────────────────────┐ │ │
│ │ │ • Immediate actions to reduce impact │ │ │
│ │ │ • Apply workarounds │ │ │
│ │ │ • Customer communication │ │ │
│ │ │ • Regular status updates │ │ │
│ │ └─────────────────────────────────────────────────────────┘ │ │
│ └─────────────────────────────┬───────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ RESOLUTION │ │
│ │ ┌─────────────────────────────────────────────────────────┐ │ │
│ │ │ • Fix root cause │ │ │
│ │ │ • Verify resolution │ │ │
│ │ │ • Final customer communication │ │ │
│ │ │ • Close incident │ │ │
│ │ └─────────────────────────────────────────────────────────┘ │ │
│ └─────────────────────────────┬───────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ POST-INcIDENT │ │
│ │ ┌─────────────────────────────────────────────────────────┐ │ │
│ │ │ • Schedule post-mortem │ │ │
│ │ │ • Document timeline │ │ │
│ │ │ • Identify action items │ │ │
│ │ │ • Implement improvements │ │ │
│ │ └─────────────────────────────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘

Terminal window
# ============================================================
# ON-CALL BEST PRACTICES
# ============================================================
# Preparation
- Review active issues before shift
- Test alerting and escalation
- Know escalation contacts
- Review runbooks
- Ensure access to all tools
- Backup device charged
# During On-Call
- Acknowledge alerts quickly (< 15 min)
- Be present and responsive
- Update status page proactively
- Document actions taken
- Handover to next shift
# Handover Process
- Document current issues
- Share context for ongoing work
- Highlight items to watch
- Confirm next on-call acknowledged
# Self-Care
- Get adequate sleep
- Take regular breaks
- Exercise escalation when needed
- Use protected time after on-call
- Limit on-call frequency
# Escalation Path
- L1: On-call engineer
- L2: Senior engineer/team lead
- L3: Engineering manager
- L4: VP/Director
┌─────────────────────────────────────────────────────────────────────────┐
│ INCIDENT COMMANDER (IC) ROLE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ Primary Responsibilities: │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ 1. COORDINATE RESPONSE │ │
│ │ • Lead incident response efforts │ │
│ │ • Assign roles to responders │ │
│ │ • Ensure proper resources are engaged │ │
│ │ │ │
│ │ 2. COMMUNICATE │ │
│ │ • Provide status updates │ │
│ │ • Coordinate with stakeholders │ │
│ │ • Update status page │ │
│ │ • Brief executive team │ │
│ │ │ │
│ │ 3. DECIDE │ │
│ │ • Make critical decisions │ │
│ │ • Authorize emergency changes │ │
│ │ • Decide on escalation │ │
│ │ • Determine when to declare incident closed │ │
│ │ │ │
│ │ 4. DOCUMENT │ │
│ │ • Maintain incident timeline │ │
│ │ • Track actions taken │ │
│ │ • Ensure post-mortem will be done │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ IC Should NOT: │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ • Try to fix the issue personally (delegate) │ │
│ │ • Ignore communication (prioritize it) │ │
│ │ • Hesitate to escalate │ │
│ │ • Make decisions without input (consult team) │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘

# Initial Alert (SEV1/2)
**Incident**: [Brief title]
**Severity**: SEV1
**Status**: Investigating
**Impact**: [Customer impact description]
**Commander**: [Name]
**Updates**: Will post every [X] minutes
# Status Update
**Incident**: [Title]
**Status**: In Progress - Mitigating
**Update**: [What we're doing]
**Next Update**: [Time]
**ETA**: [If known]
# Resolution
**Incident**: [Title]
**Status**: Resolved
**Duration**: [X] hours
**Root Cause**: [Brief]
**Action Items**: [Link to ticket]

# Incident Post-Mortem
## Metadata
- **Incident ID**: INC-2024-001
- **Date**: January 15, 2024
- **Duration**: 2 hours 34 minutes
- **Severity**: SEV1
- **Author**: [Name]
- **Contributors**: [Names]
## Summary
[Brief paragraph describing what happened]
## Impact
- Customers affected: [Number or percentage]
- Revenue impact: [If applicable]
- Data impact: [If any]
- Internal impact: [Team downtime, etc.]
## Root Cause
[Detailed explanation of why the incident occurred]
[Technical details of the failure]
## Timeline (UTC)
| Time | Event |
|------|-------|
| 10:00 | Alert triggered |
| 10:05 | On-call acknowledged |
| 10:15 | SEV1 declared |
| 10:30 | Root cause identified |
| 11:45 | Mitigation applied |
| 12:30 | Service restored |
| 12:34 | Incident closed |
## Detection
- How was the incident detected?
- Time from failure to detection: [X] minutes
## Response
- Time to acknowledge: [X] minutes
- Time to mitigate: [X] minutes
- Time to resolve: [X] minutes
## What Went Well
- [Item 1]
- [Item 2]
## What Could Be Improved
- [Item 1]
- [Item 2]
## Action Items
| ID | Description | Owner | Due Date |
|----|-------------|-------|----------|
| 1 | Fix root cause | @name | Jan 30 |
| 2 | Add monitoring | @name | Feb 5 |
| 3 | Update runbook | @name | Feb 10 |
## Lessons Learned
[Any broader learnings for the team]

┌─────────────────────────────────────────────────────────────────────────┐
│ INCIDENT MANAGEMENT INTERVIEW QUESTIONS │
├─────────────────────────────────────────────────────────────────────────┤
Q1: What is the incident management process? │
A1: │
- Detection → Triage → Mitigation → Resolution → Post-Incident │
- Severity classification based on impact │
- Clear roles: IC, Scribe, Tech Lead │
- Communication throughout │
- Post-mortem after resolution │
─────────────────────────────────────────────────────────────────────────┤
Q2: How do you determine incident severity? │
A2: │
- Customer impact (how many affected) │
- Revenue/data impact │
- Business criticality │
- Time to recovery expectation │
- Workaround availability │
─────────────────────────────────────────────────────────────────────────┤
Q3: What is the role of an Incident Commander? │
A3: │
- Coordinates response │
- Communicates status │
- Makes decisions │
- Documents timeline │
- NOT fixing issue personally, but coordinating │
─────────────────────────────────────────────────────────────────────────┤
Q4: How do you handle being on-call? │
A4: │
- Prepare before shift │
- Acknowledge quickly │
- Document actions │
- Escalate when needed │
- Proper handover │
- Self-care afterward │
─────────────────────────────────────────────────────────────────────────┤
Q5: What should a post-mortem include? │
A5: │
- What happened (summary) │
- Impact (customers, revenue) │
- Root cause analysis │
- Timeline │
- What went well │
- What could improve │
- Action items with owners │
- Blameless focus │
─────────────────────────────────────────────────────────────────────────┤
Q6: How do you reduce alert fatigue? │
A6: │
- Tune alert thresholds │
- Remove duplicate alerts │
- Create compound alerts │
- Ensure alerts are actionable │
- Regular alert reviews │
- Automate where possible │
─────────────────────────────────────────────────────────────────────────┤
Q7: What is the difference between incident and problem management? │
A7: │
- Incident: Short-term response to restore service │
- Problem: Long-term fix to prevent recurrence │
- Incidents are symptoms, problems are root causes │
- Problem management often involves multiple incidents │
─────────────────────────────────────────────────────────────────────────┤
Q8: How do you communicate during an incident? │
A8: │
- Status page updates (for customers) │
- Internal chat channel (for responders) │
- Regular updates (every X minutes) │
- Executive briefings (for SEV1/2) │
- Clear, concise, factual │
- Set expectations for next update │
─────────────────────────────────────────────────────────────────────────┤
Q9: What is a "war room" and when do you use one? │
A9: │
- Virtual or physical room for incident response │
- All responders join │
- Used for major incidents (SEV1) │
- Enables real-time collaboration │
- IC coordinates from war room │
- Reduces communication overhead │
─────────────────────────────────────────────────────────────────────────┤
Q10: How do you prevent on-call burnout? │
A10: │
- Fair rotation schedules │
- Protected time after on-call │
- Limit consecutive on-call shifts │
- Clear escalation paths │
- Good runbooks to reduce manual work │
- Manager support │
- Self-care and boundaries │
└─────────────────────────────────────────────────────────────────────────┘

  • Severity: SEV1-4 based on customer and business impact
  • Process: Detect → Triage → Mitigate → Resolve → Post-Incident
  • Roles: Incident Commander, Tech Lead, Scribe
  • Communication: Regular updates, status page
  • Post-Mortem: Blameless, action items, continuous improvement

Chapter 100: Career Development


Last Updated: February 2026