Incident_management
Chapter 99: Incident Management
Section titled βChapter 99: Incident ManagementβOverview
Section titled βOverviewβIncident management is a critical discipline in DevOps and SRE practices, focusing on responding to service disruptions efficiently while minimizing impact and preventing recurrence. This chapter covers incident response processes, severity levels, on-call best practices, communication strategies, post-mortems, and building a culture of continuous improvement. Understanding incident management is essential for SRE roles and demonstrates operational maturity in interviews.
99.1 Severity Levels
Section titled β99.1 Severity LevelsβSeverity Classification
Section titled βSeverity Classificationβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ INCIDENT SEVERITY LEVELS ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€β ββ βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ ββ β SEV1 - CRITICAL β ββ β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β ββ β β β’ Complete service outage β β ββ β β β’ All customers affected β β ββ β β β’ Revenue impact β β ββ β β β’ Data loss potential β β ββ β β β’ Response: Immediate, all hands on deck β β ββ β β β’ Target resolution: < 1 hour β β ββ β β β’ Communication: Executive and customer alerts β β ββ β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β ββ β β ββ β SEV2 - HIGH β ββ β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β ββ β β β’ Major feature unavailable β β ββ β β β’ Significant customer impact β β ββ β β β’ Critical functionality impaired β β ββ β β β’ Response: Urgent, senior engineers β β ββ β β β’ Target resolution: < 4 hours β β ββ β β β’ Communication: Status page update β β ββ β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β ββ β β ββ β SEV3 - MEDIUM β ββ β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β ββ β β β’ Moderate impact β β ββ β β β’ Workaround available β β ββ β β β’ Secondary features affected β β ββ β β β’ Response: During business hours β β ββ β β β’ Target resolution: < 24 hours β β ββ β β β’ Communication: Team notification β β ββ β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β ββ β β ββ β SEV4 - LOW β ββ β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β ββ β β β’ Minor issue β β ββ β β β’ Cosmetic or documentation issues β β ββ β β β’ Can wait for next business day β β ββ β β β’ Response: Schedule for next sprint β β ββ β β β’ Target resolution: < 1 week β β ββ β β β’ Communication: Ticket tracking only β β ββ β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β ββ β β ββ βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ ββ ββ Severity Determination Factors: ββ βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ ββ β β’ Customer impact (how many affected) β ββ β β’ Revenue impact β ββ β β’ Data integrity/safety β ββ β β’ Recovery complexity β ββ β β’ Time to implement workaround β ββ βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ ββ ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ99.2 Incident Response Process
Section titled β99.2 Incident Response ProcessβResponse Workflow
Section titled βResponse Workflowβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ INCIDENT RESPONSE WORKFLOW ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€β ββ βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ ββ β DETECTION β ββ β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β ββ β β β’ Alert from monitoring β β ββ β β β’ Customer report β β ββ β β β’ Automated detection β β ββ β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β ββ βββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββ ββ β ββ βΌ ββ βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ ββ β TRIAGE β ββ β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β ββ β β β’ Assess severity β β ββ β β β’ Identify affected systems β β ββ β β β’ Assign incident commander β β ββ β β β’ Initial communication β β ββ β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β ββ βββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββ ββ β ββ βΌ ββ βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ ββ β MITIGATION β ββ β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β ββ β β β’ Immediate actions to reduce impact β β ββ β β β’ Apply workarounds β β ββ β β β’ Customer communication β β ββ β β β’ Regular status updates β β ββ β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β ββ βββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββ ββ β ββ βΌ ββ βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ ββ β RESOLUTION β ββ β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β ββ β β β’ Fix root cause β β ββ β β β’ Verify resolution β β ββ β β β’ Final customer communication β β ββ β β β’ Close incident β β ββ β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β ββ βββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββ ββ β ββ βΌ ββ βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ ββ β POST-INcIDENT β ββ β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β ββ β β β’ Schedule post-mortem β β ββ β β β’ Document timeline β β ββ β β β’ Identify action items β β ββ β β β’ Implement improvements β β ββ β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β ββ βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ ββ ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ99.3 On-Call Best Practices
Section titled β99.3 On-Call Best PracticesβOn-Call Responsibilities
Section titled βOn-Call Responsibilitiesβ# ============================================================# ON-CALL BEST PRACTICES# ============================================================
# Preparation- Review active issues before shift- Test alerting and escalation- Know escalation contacts- Review runbooks- Ensure access to all tools- Backup device charged
# During On-Call- Acknowledge alerts quickly (< 15 min)- Be present and responsive- Update status page proactively- Document actions taken- Handover to next shift
# Handover Process- Document current issues- Share context for ongoing work- Highlight items to watch- Confirm next on-call acknowledged
# Self-Care- Get adequate sleep- Take regular breaks- Exercise escalation when needed- Use protected time after on-call- Limit on-call frequency
# Escalation Path- L1: On-call engineer- L2: Senior engineer/team lead- L3: Engineering manager- L4: VP/DirectorIncident Commander Role
Section titled βIncident Commander Roleβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ INCIDENT COMMANDER (IC) ROLE ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€β ββ Primary Responsibilities: ββ βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ ββ β β ββ β 1. COORDINATE RESPONSE β ββ β β’ Lead incident response efforts β ββ β β’ Assign roles to responders β ββ β β’ Ensure proper resources are engaged β ββ β β ββ β 2. COMMUNICATE β ββ β β’ Provide status updates β ββ β β’ Coordinate with stakeholders β ββ β β’ Update status page β ββ β β’ Brief executive team β ββ β β ββ β 3. DECIDE β ββ β β’ Make critical decisions β ββ β β’ Authorize emergency changes β ββ β β’ Decide on escalation β ββ β β’ Determine when to declare incident closed β ββ β β ββ β 4. DOCUMENT β ββ β β’ Maintain incident timeline β ββ β β’ Track actions taken β ββ β β’ Ensure post-mortem will be done β ββ β β ββ βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ ββ ββ IC Should NOT: ββ βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ ββ β β’ Try to fix the issue personally (delegate) β ββ β β’ Ignore communication (prioritize it) β ββ β β’ Hesitate to escalate β ββ β β’ Make decisions without input (consult team) β ββ βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ ββ ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ99.4 Communication During Incidents
Section titled β99.4 Communication During IncidentsβCommunication Templates
Section titled βCommunication Templatesβ# Initial Alert (SEV1/2)**Incident**: [Brief title]**Severity**: SEV1**Status**: Investigating**Impact**: [Customer impact description]**Commander**: [Name]**Updates**: Will post every [X] minutes
# Status Update**Incident**: [Title]**Status**: In Progress - Mitigating**Update**: [What we're doing]**Next Update**: [Time]**ETA**: [If known]
# Resolution**Incident**: [Title]**Status**: Resolved**Duration**: [X] hours**Root Cause**: [Brief]**Action Items**: [Link to ticket]99.5 Post-Mortem Template
Section titled β99.5 Post-Mortem TemplateβComprehensive Post-Mortem
Section titled βComprehensive Post-Mortemβ# Incident Post-Mortem
## Metadata- **Incident ID**: INC-2024-001- **Date**: January 15, 2024- **Duration**: 2 hours 34 minutes- **Severity**: SEV1- **Author**: [Name]- **Contributors**: [Names]
## Summary[Brief paragraph describing what happened]
## Impact- Customers affected: [Number or percentage]- Revenue impact: [If applicable]- Data impact: [If any]- Internal impact: [Team downtime, etc.]
## Root Cause[Detailed explanation of why the incident occurred][Technical details of the failure]
## Timeline (UTC)| Time | Event ||------|-------|| 10:00 | Alert triggered || 10:05 | On-call acknowledged || 10:15 | SEV1 declared || 10:30 | Root cause identified || 11:45 | Mitigation applied || 12:30 | Service restored || 12:34 | Incident closed |
## Detection- How was the incident detected?- Time from failure to detection: [X] minutes
## Response- Time to acknowledge: [X] minutes- Time to mitigate: [X] minutes- Time to resolve: [X] minutes
## What Went Well- [Item 1]- [Item 2]
## What Could Be Improved- [Item 1]- [Item 2]
## Action Items| ID | Description | Owner | Due Date ||----|-------------|-------|----------|| 1 | Fix root cause | @name | Jan 30 || 2 | Add monitoring | @name | Feb 5 || 3 | Update runbook | @name | Feb 10 |
## Lessons Learned[Any broader learnings for the team]99.6 Interview Questions
Section titled β99.6 Interview Questionsβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ INCIDENT MANAGEMENT INTERVIEW QUESTIONS ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ βQ1: What is the incident management process? β βA1: β- Detection β Triage β Mitigation β Resolution β Post-Incident β- Severity classification based on impact β- Clear roles: IC, Scribe, Tech Lead β- Communication throughout β- Post-mortem after resolution β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ βQ2: How do you determine incident severity? β βA2: β- Customer impact (how many affected) β- Revenue/data impact β- Business criticality β- Time to recovery expectation β- Workaround availability β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ βQ3: What is the role of an Incident Commander? β βA3: β- Coordinates response β- Communicates status β- Makes decisions β- Documents timeline β- NOT fixing issue personally, but coordinating β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ βQ4: How do you handle being on-call? β βA4: β- Prepare before shift β- Acknowledge quickly β- Document actions β- Escalate when needed β- Proper handover β- Self-care afterward β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ βQ5: What should a post-mortem include? β βA5: β- What happened (summary) β- Impact (customers, revenue) β- Root cause analysis β- Timeline β- What went well β- What could improve β- Action items with owners β- Blameless focus β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ βQ6: How do you reduce alert fatigue? β βA6: β- Tune alert thresholds β- Remove duplicate alerts β- Create compound alerts β- Ensure alerts are actionable β- Regular alert reviews β- Automate where possible β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ βQ7: What is the difference between incident and problem management? β βA7: β- Incident: Short-term response to restore service β- Problem: Long-term fix to prevent recurrence β- Incidents are symptoms, problems are root causes β- Problem management often involves multiple incidents β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ βQ8: How do you communicate during an incident? β βA8: β- Status page updates (for customers) β- Internal chat channel (for responders) β- Regular updates (every X minutes) β- Executive briefings (for SEV1/2) β- Clear, concise, factual β- Set expectations for next update β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ βQ9: What is a "war room" and when do you use one? β βA9: β- Virtual or physical room for incident response β- All responders join β- Used for major incidents (SEV1) β- Enables real-time collaboration β- IC coordinates from war room β- Reduces communication overhead β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ βQ10: How do you prevent on-call burnout? β βA10: β- Fair rotation schedules β- Protected time after on-call β- Limit consecutive on-call shifts β- Clear escalation paths β- Good runbooks to reduce manual work β- Manager support β- Self-care and boundaries β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββSummary
Section titled βSummaryβ- Severity: SEV1-4 based on customer and business impact
- Process: Detect β Triage β Mitigate β Resolve β Post-Incident
- Roles: Incident Commander, Tech Lead, Scribe
- Communication: Regular updates, status page
- Post-Mortem: Blameless, action items, continuous improvement
Next Chapter
Section titled βNext ChapterβChapter 100: Career Development
Last Updated: February 2026