DevOps Intermediate
Incident Management — A Complete Guide¶
Incident ManagementSREOn-callProcess 6 min read
The incident management process from detection to resolution. Severity levels, roles, communication and escalation.
Severity Levels¶
- P1 (Critical) — service unavailable, impact on revenue/security. Response: 5 min
- P2 (High) — degraded performance, partial outage. Response: 15 min
- P3 (Medium) — minor feature not working. Response: 1 hour
- P4 (Low) — cosmetic issue. Response: next business day
Incident Roles¶
- Incident Commander (IC) — coordinates response, decides on escalation
- Technical Lead — leads technical investigation
- Communications Lead — informs stakeholders, status page
- Scribe — documents timeline and decisions
Response Process¶
- Detect — alert or report from a user
- Triage — determine severity and IC
- Investigate — diagnostics, identify root cause
- Mitigate — restore the service (rollback, restart, failover)
- Resolve — permanent fix
- Postmortem — within 48h, blameless
Communication¶
# Status page update template
[Investigating] Increased error rate on API Gateway.
Affected services: API, Checkout.
The team is working on identifying the cause.
[Identified] Cause: high memory usage after deployment v2.3.1.
Mitigation: rollback to v2.3.0 in progress.
[Monitoring] Rollback complete. Error rate is decreasing.
Services are gradually recovering.
[Resolved] Incident resolved. Services fully operational.
Postmortem will be published within 48h.
Summary¶
Effective incident management requires clear roles, severity levels and communication processes. Practice regularly.
Need Help with Implementation?¶
Our team has experience designing and implementing modern architectures. We’re happy to help.