Incident Management
Chapter 99: Incident Management
Section titled “Chapter 99: Incident Management”Overview
Section titled “Overview”Incident management is a critical discipline in DevOps and SRE practices, focusing on responding to service disruptions efficiently while minimizing impact and preventing recurrence. This chapter covers incident response processes, severity levels, on-call best practices, communication strategies, post-mortems, and building a culture of continuous improvement. Understanding incident management is essential for SRE roles and demonstrates operational maturity in interviews.
99.1 Severity Levels
Section titled “99.1 Severity Levels”Severity Classification
Section titled “Severity Classification”┌─────────────────────────────────────────────────────────────────────────┐│ INCIDENT SEVERITY LEVELS │├─────────────────────────────────────────────────────────────────────────┤│ ││ ┌─────────────────────────────────────────────────────────────────┐ ││ │ SEV1 - CRITICAL │ ││ │ ┌─────────────────────────────────────────────────────────┐ │ ││ │ │ • Complete service outage │ │ ││ │ │ • All customers affected │ │ ││ │ │ • Revenue impact │ │ ││ │ │ • Data loss potential │ │ ││ │ │ • Response: Immediate, all hands on deck │ │ ││ │ │ • Target resolution: < 1 hour │ │ ││ │ │ • Communication: Executive and customer alerts │ │ ││ │ └─────────────────────────────────────────────────────────┘ │ ││ │ │ ││ │ SEV2 - HIGH │ ││ │ ┌─────────────────────────────────────────────────────────┐ │ ││ │ │ • Major feature unavailable │ │ ││ │ │ • Significant customer impact │ │ ││ │ │ • Critical functionality impaired │ │ ││ │ │ • Response: Urgent, senior engineers │ │ ││ │ │ • Target resolution: < 4 hours │ │ ││ │ │ • Communication: Status page update │ │ ││ │ └─────────────────────────────────────────────────────────┘ │ ││ │ │ ││ │ SEV3 - MEDIUM │ ││ │ ┌─────────────────────────────────────────────────────────┐ │ ││ │ │ • Moderate impact │ │ ││ │ │ • Workaround available │ │ ││ │ │ • Secondary features affected │ │ ││ │ │ • Response: During business hours │ │ ││ │ │ • Target resolution: < 24 hours │ │ ││ │ │ • Communication: Team notification │ │ ││ │ └─────────────────────────────────────────────────────────┘ │ ││ │ │ ││ │ SEV4 - LOW │ ││ │ ┌─────────────────────────────────────────────────────────┐ │ ││ │ │ • Minor issue │ │ ││ │ │ • Cosmetic or documentation issues │ │ ││ │ │ • Can wait for next business day │ │ ││ │ │ • Response: Schedule for next sprint │ │ ││ │ │ • Target resolution: < 1 week │ │ ││ │ │ • Communication: Ticket tracking only │ │ ││ │ └─────────────────────────────────────────────────────────┘ │ ││ │ │ ││ └─────────────────────────────────────────────────────────────────┘ ││ ││ Severity Determination Factors: ││ ┌─────────────────────────────────────────────────────────────────┐ ││ │ • Customer impact (how many affected) │ ││ │ • Revenue impact │ ││ │ • Data integrity/safety │ ││ │ • Recovery complexity │ ││ │ • Time to implement workaround │ ││ └─────────────────────────────────────────────────────────────────┘ ││ │└─────────────────────────────────────────────────────────────────────────┘99.2 Incident Response Process
Section titled “99.2 Incident Response Process”Response Workflow
Section titled “Response Workflow”┌─────────────────────────────────────────────────────────────────────────┐│ INCIDENT RESPONSE WORKFLOW │├─────────────────────────────────────────────────────────────────────────┤│ ││ ┌─────────────────────────────────────────────────────────────────┐ ││ │ DETECTION │ ││ │ ┌─────────────────────────────────────────────────────────┐ │ ││ │ │ • Alert from monitoring │ │ ││ │ │ • Customer report │ │ ││ │ │ • Automated detection │ │ ││ │ └─────────────────────────────────────────────────────────┘ │ ││ └─────────────────────────────┬───────────────────────────────────────┘ ││ │ ││ ▼ ││ ┌─────────────────────────────────────────────────────────────────┐ ││ │ TRIAGE │ ││ │ ┌─────────────────────────────────────────────────────────┐ │ ││ │ │ • Assess severity │ │ ││ │ │ • Identify affected systems │ │ ││ │ │ • Assign incident commander │ │ ││ │ │ • Initial communication │ │ ││ │ └─────────────────────────────────────────────────────────┘ │ ││ └─────────────────────────────┬───────────────────────────────────────┘ ││ │ ││ ▼ ││ ┌─────────────────────────────────────────────────────────────────┐ ││ │ MITIGATION │ ││ │ ┌─────────────────────────────────────────────────────────┐ │ ││ │ │ • Immediate actions to reduce impact │ │ ││ │ │ • Apply workarounds │ │ ││ │ │ • Customer communication │ │ ││ │ │ • Regular status updates │ │ ││ │ └─────────────────────────────────────────────────────────┘ │ ││ └─────────────────────────────┬───────────────────────────────────────┘ ││ │ ││ ▼ ││ ┌─────────────────────────────────────────────────────────────────┐ ││ │ RESOLUTION │ ││ │ ┌─────────────────────────────────────────────────────────┐ │ ││ │ │ • Fix root cause │ │ ││ │ │ • Verify resolution │ │ ││ │ │ • Final customer communication │ │ ││ │ │ • Close incident │ │ ││ │ └─────────────────────────────────────────────────────────┘ │ ││ └─────────────────────────────┬───────────────────────────────────────┘ ││ │ ││ ▼ ││ ┌─────────────────────────────────────────────────────────────────┐ ││ │ POST-INcIDENT │ ││ │ ┌─────────────────────────────────────────────────────────┐ │ ││ │ │ • Schedule post-mortem │ │ ││ │ │ • Document timeline │ │ ││ │ │ • Identify action items │ │ ││ │ │ • Implement improvements │ │ ││ │ └─────────────────────────────────────────────────────────┘ │ ││ └─────────────────────────────────────────────────────────────────┘ ││ │└─────────────────────────────────────────────────────────────────────────┘99.3 On-Call Best Practices
Section titled “99.3 On-Call Best Practices”On-Call Responsibilities
Section titled “On-Call Responsibilities”# ============================================================# ON-CALL BEST PRACTICES# ============================================================
# Preparation- Review active issues before shift- Test alerting and escalation- Know escalation contacts- Review runbooks- Ensure access to all tools- Backup device charged
# During On-Call- Acknowledge alerts quickly (< 15 min)- Be present and responsive- Update status page proactively- Document actions taken- Handover to next shift
# Handover Process- Document current issues- Share context for ongoing work- Highlight items to watch- Confirm next on-call acknowledged
# Self-Care- Get adequate sleep- Take regular breaks- Exercise escalation when needed- Use protected time after on-call- Limit on-call frequency
# Escalation Path- L1: On-call engineer- L2: Senior engineer/team lead- L3: Engineering manager- L4: VP/DirectorIncident Commander Role
Section titled “Incident Commander Role”┌─────────────────────────────────────────────────────────────────────────┐│ INCIDENT COMMANDER (IC) ROLE │├─────────────────────────────────────────────────────────────────────────┤│ ││ Primary Responsibilities: ││ ┌─────────────────────────────────────────────────────────────────┐ ││ │ │ ││ │ 1. COORDINATE RESPONSE │ ││ │ • Lead incident response efforts │ ││ │ • Assign roles to responders │ ││ │ • Ensure proper resources are engaged │ ││ │ │ ││ │ 2. COMMUNICATE │ ││ │ • Provide status updates │ ││ │ • Coordinate with stakeholders │ ││ │ • Update status page │ ││ │ • Brief executive team │ ││ │ │ ││ │ 3. DECIDE │ ││ │ • Make critical decisions │ ││ │ • Authorize emergency changes │ ││ │ • Decide on escalation │ ││ │ • Determine when to declare incident closed │ ││ │ │ ││ │ 4. DOCUMENT │ ││ │ • Maintain incident timeline │ ││ │ • Track actions taken │ ││ │ • Ensure post-mortem will be done │ ││ │ │ ││ └─────────────────────────────────────────────────────────────────┘ ││ ││ IC Should NOT: ││ ┌─────────────────────────────────────────────────────────────────┐ ││ │ • Try to fix the issue personally (delegate) │ ││ │ • Ignore communication (prioritize it) │ ││ │ • Hesitate to escalate │ ││ │ • Make decisions without input (consult team) │ ││ └─────────────────────────────────────────────────────────────────┘ ││ │└─────────────────────────────────────────────────────────────────────────┘99.4 Communication During Incidents
Section titled “99.4 Communication During Incidents”Communication Templates
Section titled “Communication Templates”# Initial Alert (SEV1/2)**Incident**: [Brief title]**Severity**: SEV1**Status**: Investigating**Impact**: [Customer impact description]**Commander**: [Name]**Updates**: Will post every [X] minutes
# Status Update**Incident**: [Title]**Status**: In Progress - Mitigating**Update**: [What we're doing]**Next Update**: [Time]**ETA**: [If known]
# Resolution**Incident**: [Title]**Status**: Resolved**Duration**: [X] hours**Root Cause**: [Brief]**Action Items**: [Link to ticket]99.5 Post-Mortem Template
Section titled “99.5 Post-Mortem Template”Comprehensive Post-Mortem
Section titled “Comprehensive Post-Mortem”# Incident Post-Mortem
## Metadata- **Incident ID**: INC-2024-001- **Date**: January 15, 2024- **Duration**: 2 hours 34 minutes- **Severity**: SEV1- **Author**: [Name]- **Contributors**: [Names]
## Summary[Brief paragraph describing what happened]
## Impact- Customers affected: [Number or percentage]- Revenue impact: [If applicable]- Data impact: [If any]- Internal impact: [Team downtime, etc.]
## Root Cause[Detailed explanation of why the incident occurred][Technical details of the failure]
## Timeline (UTC)| Time | Event ||------|-------|| 10:00 | Alert triggered || 10:05 | On-call acknowledged || 10:15 | SEV1 declared || 10:30 | Root cause identified || 11:45 | Mitigation applied || 12:30 | Service restored || 12:34 | Incident closed |
## Detection- How was the incident detected?- Time from failure to detection: [X] minutes
## Response- Time to acknowledge: [X] minutes- Time to mitigate: [X] minutes- Time to resolve: [X] minutes
## What Went Well- [Item 1]- [Item 2]
## What Could Be Improved- [Item 1]- [Item 2]
## Action Items| ID | Description | Owner | Due Date ||----|-------------|-------|----------|| 1 | Fix root cause | @name | Jan 30 || 2 | Add monitoring | @name | Feb 5 || 3 | Update runbook | @name | Feb 10 |
## Lessons Learned[Any broader learnings for the team]99.6 Interview Questions
Section titled “99.6 Interview Questions”┌─────────────────────────────────────────────────────────────────────────┐│ INCIDENT MANAGEMENT INTERVIEW QUESTIONS │├─────────────────────────────────────────────────────────────────────────┤ │Q1: What is the incident management process? │ │A1: │- Detection → Triage → Mitigation → Resolution → Post-Incident │- Severity classification based on impact │- Clear roles: IC, Scribe, Tech Lead │- Communication throughout │- Post-mortem after resolution │ │─────────────────────────────────────────────────────────────────────────┤ │Q2: How do you determine incident severity? │ │A2: │- Customer impact (how many affected) │- Revenue/data impact │- Business criticality │- Time to recovery expectation │- Workaround availability │ │─────────────────────────────────────────────────────────────────────────┤ │Q3: What is the role of an Incident Commander? │ │A3: │- Coordinates response │- Communicates status │- Makes decisions │- Documents timeline │- NOT fixing issue personally, but coordinating │ │─────────────────────────────────────────────────────────────────────────┤ │Q4: How do you handle being on-call? │ │A4: │- Prepare before shift │- Acknowledge quickly │- Document actions │- Escalate when needed │- Proper handover │- Self-care afterward │ │─────────────────────────────────────────────────────────────────────────┤ │Q5: What should a post-mortem include? │ │A5: │- What happened (summary) │- Impact (customers, revenue) │- Root cause analysis │- Timeline │- What went well │- What could improve │- Action items with owners │- Blameless focus │ │─────────────────────────────────────────────────────────────────────────┤ │Q6: How do you reduce alert fatigue? │ │A6: │- Tune alert thresholds │- Remove duplicate alerts │- Create compound alerts │- Ensure alerts are actionable │- Regular alert reviews │- Automate where possible │ │─────────────────────────────────────────────────────────────────────────┤ │Q7: What is the difference between incident and problem management? │ │A7: │- Incident: Short-term response to restore service │- Problem: Long-term fix to prevent recurrence │- Incidents are symptoms, problems are root causes │- Problem management often involves multiple incidents │ │─────────────────────────────────────────────────────────────────────────┤ │Q8: How do you communicate during an incident? │ │A8: │- Status page updates (for customers) │- Internal chat channel (for responders) │- Regular updates (every X minutes) │- Executive briefings (for SEV1/2) │- Clear, concise, factual │- Set expectations for next update │ │─────────────────────────────────────────────────────────────────────────┤ │Q9: What is a "war room" and when do you use one? │ │A9: │- Virtual or physical room for incident response │- All responders join │- Used for major incidents (SEV1) │- Enables real-time collaboration │- IC coordinates from war room │- Reduces communication overhead │ │─────────────────────────────────────────────────────────────────────────┤ │Q10: How do you prevent on-call burnout? │ │A10: │- Fair rotation schedules │- Protected time after on-call │- Limit consecutive on-call shifts │- Clear escalation paths │- Good runbooks to reduce manual work │- Manager support │- Self-care and boundaries │ │└─────────────────────────────────────────────────────────────────────────┘Summary
Section titled “Summary”- Severity: SEV1-4 based on customer and business impact
- Process: Detect → Triage → Mitigate → Resolve → Post-Incident
- Roles: Incident Commander, Tech Lead, Scribe
- Communication: Regular updates, status page
- Post-Mortem: Blameless, action items, continuous improvement
Next Chapter
Section titled “Next Chapter”Chapter 100: Career Development
Last Updated: February 2026