Skip to content

Incident_management

Incident Management is the practice of responding to and resolving outages and service disruptions effectively.

┌─────────────────────────────────────────────────────────────────────────────┐
│ Incident Lifecycle │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Incident Lifecycle │ │
│ │ │ │
│ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │
│ │ │ Detect │───▶│ Respond │───▶│ Resolve │───▶│ Learn │ │ │
│ │ │ │ │ │ │ │ │ │ │ │
│ │ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │ │
│ │ │ │
│ │ - Monitoring - Triage - Fix - Post-Mortem │ │
│ │ - Alerting - Escalate - Verify - Process Improve│ │
│ │ - Reporting - Communicate - Notify - Document │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ Key Metrics: │
│ ✓ MTTR - Mean Time To Recovery │
│ ✓ MTTD - Mean Time To Detect │
│ ✓ MTBF - Mean Time Between Failures │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ Incident Severity Levels │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────┐ │
│ │ SEV1 - Critical │ Complete outage affecting all users │
│ │ - Response: < 15 min │
│ │ - Examples: Complete service down, data loss │
│ └─────────────────┘ │
│ │
│ ┌─────────────────┐ │
│ │ SEV2 - High │ Major feature unavailable │
│ │ - Response: < 30 min │
│ │ - Examples: Payment system down, login failure │
│ └─────────────────┘ │
│ │
│ ┌─────────────────┐ │
│ │ SEV3 - Medium │ Partial functionality affected │
│ │ - Response: < 2 hours │
│ │ - Examples: Slow performance, intermittent errors │
│ └─────────────────┘ │
│ │
│ ┌─────────────────┐ │
│ │ SEV4 - Low │ Minor issues, workarounds available │
│ │ - Response: < 24 hours │
│ │ - Examples: UI bugs, documentation errors │
│ └─────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ Incident Response Workflow │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ 1. Detection & Triage │ │
│ │ │ │
│ │ - Monitor detects issue (alert) │ │
│ │ - On-call engineer acknowledges │ │
│ │ - Initial assessment and severity assignment │ │
│ │ - Create incident ticket │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ 2. Response & Communication │ │
│ │ │ │
│ │ - Notify stakeholders │ │
│ │ - Create incident channel (Slack/Teams) │ │
│ │ - Begin root cause analysis │ │
│ │ - Regular status updates │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ 3. Resolution │ │
│ │ │ │
│ │ - Implement fix │ │
│ │ - Verify resolution │ │
│ │ - Monitor for recurrence │ │
│ │ - Confirm service restoration │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ 4. Post-Incident │ │
│ │ │ │
│ │ - Schedule post-mortem │ │
│ │ - Document lessons learned │ │
│ │ - Update runbooks │ │
│ │ - Implement improvements │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
# runbook-database-connection.md
# Database Connection Issues
## Symptoms
- High connection count
- Connection timeouts
- "Too many connections" errors
## Diagnosis
1. Check current connections:
```bash
psql -h $DB_HOST -U $DB_USER -c "SELECT count(*) FROM pg_stat_activity;"
  1. Check active queries:

    Terminal window
    psql -h $DB_HOST -U $DB_USER -c "SELECT * FROM pg_stat_activity WHERE state != 'idle';"
  2. Check connection limits:

    Terminal window
    psql -h $DB_HOST -U $DB_USER -c "SHOW max_connections;"
  1. Identify long-running queries

    Terminal window
    psql -h $DB_HOST -U $DB_USER -c "SELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE state != 'idle' ORDER BY duration DESC LIMIT 10;"
  2. Terminate problematic queries (if safe)

    Terminal window
    psql -h $DB_HOST -U $DB_USER -c "SELECT pg_terminate_backend(pid);"
  3. Scale database connection pool

  1. Check application connection leaks
  2. Restart application pods (if needed)
  3. Increase connection pool size temporarily
  • SEV1: @oncall-platform
  • SEV2: @platform-team
## Post-Mortem Template
```markdown
# Post-Mortem: [Incident Title]
## Summary
Brief description of what happened and impact.
## Impact
- Users affected: [Number]
- Duration: [Start time] - [End time]
- Severity: [SEV level]
## Timeline (UTC)
- [Time] - Alert triggered
- [Time] - Incident acknowledged
- [Time] - Root cause identified
- [Time] - Fix deployed
- [Time] - Incident resolved
## Root Cause
Detailed explanation of what caused the incident.
## What Went Well
- Detection worked quickly
- Team responded effectively
## What Could Be Improved
- Alert thresholds too sensitive
- Documentation outdated
## Action Items
- [ ] Update alert threshold - @john - due: 2024-01-15
- [ ] Update runbook - @jane - due: 2024-01-20
- [ ] Implement circuit breaker - @team - due: 2024-02-01
## Lessons Learned
Key takeaways from this incident.
┌─────────────────────────────────────────────────────────────────────────────┐
│ Incident Management Tools │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ PagerDuty: │
│ ┌───────────────────────────────────────────────────────────────────┐ │
│ │ - On-call scheduling │ │
│ │ - Alert routing │ │
│ │ - Incident response │ │
│ │ - Analytics │ │
│ └───────────────────────────────────────────────────────────────────┘ │
│ │
│ Opsgenie: │
│ ┌───────────────────────────────────────────────────────────────────┐ │
│ │ - Alert management │ │
│ │ - On-call schedules │ │
│ │ - Integration with many tools │ │
│ └───────────────────────────────────────────────────────────────────┘ │
│ │
│ VictorOps: │
│ ┌───────────────────────────────────────────────────────────────────┐ │
│ │ - Incident lifecycle management │ │
│ │ - On-call rotation │ │
│ │ - ChatOps integration │ │
│ └───────────────────────────────────────────────────────────────────┘ │
│ │
│ FireHydrant: │
│ ┌───────────────────────────────────────────────────────────────────┐ │
│ │ - Incident management platform │ │
│ │ - Post-mortem automation │ │
│ │ - Status pages │ │
│ └───────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
# status-page-template.md
# Status Page Components
## Components
- API Service: Operational
- Web Application: Operational
- Mobile App: Operational
- Payment Processing: Operational
- Database: Operational
## Current Status
🟢 All Systems Operational
## Past Incidents
### [Date] - Scheduled Maintenance
**Status:** Completed
**Duration:** 30 minutes
**Description:** Database upgrades completed successfully.
### [Date] - Payment Processing Issue
**Status:** Resolved
**Duration:** 2 hours 15 minutes
**Impact:** Some users experienced payment failures
**Root Cause:** Database connection pool exhaustion
**Resolution:** Applied fix and monitoring improvements
incident-response-automation.yaml
name: Auto-Response Pipeline
on:
alert:
types: [triggered]
jobs:
triage:
runs-on: ubuntu-latest
steps:
- name: Classify Alert
id: classify
uses: actions/alert-classifier@v1
with:
alert: ${{ toJson(github.event.alert) }}
- name: Create Slack Channel
if: steps.classify.outputs.severity == 'high'
uses: peter-evans/create-or-update-comment@v1
with:
channel-id: ${{ secrets.INCIDENT_CHANNEL }}
message: "New incident triggered: ${{ steps.classify.outputs.title }}"
- name: Page On-Call
if: steps.classify.outputs.severity == 'critical'
uses: pagerduty/pagerduty-create-incident@v1
with:
service-id: ${{ secrets.PAGERDUTY_SERVICE_ID }}
title: ${{ steps.classify.outputs.title }}
urgency: high
- name: Update Status Page
if: steps.classify.outputs.severity != 'low'
uses: burnett01/statuspage-update@v1
with:
status: investigating
message: ${{ steps.classify.outputs.title }}
┌─────────────────────────────────────────────────────────────────────────────┐
│ Incident Metrics │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Response Metrics: │
│ ┌───────────────────────────────────────────────────────────────────┐ │
│ │ MTTD - Mean Time To Detect │ │
│ │ - Time from issue start to alert trigger │ │
│ │ - Target: < 5 minutes │ │
│ │ │ │
│ │ MTTA - Mean Time To Acknowledge │ │
│ │ - Time from alert to engineer acknowledging │ │
│ │ - Target: < 5 minutes │ │
│ │ │ │
│ │ MTTR - Mean Time To Resolve │ │
│ │ - Time from alert to service restoration │ │
│ │ - Target: < 30 minutes for SEV1 │ │
│ │ │ │
│ └───────────────────────────────────────────────────────────────────┘ │
│ │
│ Quality Metrics: │
│ ┌───────────────────────────────────────────────────────────────────┐ │
│ │ MTBF - Mean Time Between Failures │ │
│ │ - Average time between incidents │ │
│ │ │ │
│ │ False Positive Rate │ │
│ │ - Percentage of alerts that aren't real issues │ │
│ │ │ │
│ │ Repeat Incident Rate │ │
│ │ - Incidents caused by same issue within 30 days │ │
│ │ │ │
│ └───────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ On-Call Best Practices │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ 1. Clear Escalation Paths │
│ - Define who to contact at each level │
│ - Document in runbooks │
│ │
│ 2. Reasonable Rotation │
│ - Rotate on-call weekly │
│ - Avoid handovers on Mondays/Fridays │
│ - Allow for time off after on-call │
│ │
│ 3. Alert Quality │
│ - Tune alerts to reduce noise │
│ - Use composite alerts │
│ - Set appropriate thresholds │
│ │
│ 4. Tools and Access │
│ - Ensure VPN/access from anywhere │
│ - Mobile-friendly dashboards │
│ - Quick access to runbooks │
│ │
│ 5. Post-On-Call Review │
│ - Review any incidents handled │
│ - Update documentation │
│ - Rest and recover │
│ │
└─────────────────────────────────────────────────────────────────────────────┘

In this chapter, you learned:

  • Incident Lifecycle: Detection, response, resolution, learning
  • Severity Levels: SEV1-4 classification
  • Response Process: Triage, communication, fix, post-mortem
  • Runbooks: Documentation and automation
  • Post-Mortems: Template and best practices
  • Tools: PagerDuty, Opsgenie, status pages
  • Metrics: MTTD, MTTR, MTBF
  • On-Call: Best practices for on-call engineers

You have completed the DevOps Tools Complete Guide. Here’s what was covered:

  • Docker fundamentals and architecture
  • Images, containers, networking
  • Dockerfiles, Docker Compose
  • Best practices
  • Kubernetes architecture
  • Pods, Deployments, Services
  • ConfigMaps, Secrets, Storage
  • Advanced topics (Ingress, HPA, Helm)
  • Infrastructure as Code concepts
  • Terraform basics, variables, state
  • Modules and best practices
  • CloudFormation
  • Configuration management
  • Playbooks, roles, inventory
  • Advanced features
  • CI/CD pipeline concepts
  • Jenkins, GitHub Actions, GitLab CI
  • GitOps, Platform Engineering
  • SRE, Observability
  • Chaos Engineering, Service Mesh
  • Container Security, Incident Management