Skip to content
_CORE
AI & Agentic Systems Core Information Systems Cloud & Platform Engineering Data Platform & Integration Security & Compliance QA, Testing & Observability IoT, Automation & Robotics Mobile & Digital Banking & Finance Insurance Public Administration Defense & Security Healthcare Energy & Utilities Telco & Media Manufacturing Logistics & E-commerce Retail & Loyalty
References Technologies Blog Know-how Tools
About Collaboration Careers
CS EN DE
Let's talk

Alerting That Makes Sense

11. 09. 2023 1 min read intermediate

Every alert should be actionable. If not, it’s noise.

Rule #1: Alert on Symptoms, Not Causes

Alert on “CPU > 90%” is noise. Alert on “5xx error rate > 1%” is a symptom affecting users.

Severity Levels

  • Critical — users are affected NOW → wake on-call
  • Warning — will be a problem soon → fix during business hours
  • Info — FYI → just log/dashboard

What to Monitor

  • Error rate (5xx)
  • Latency (P95, P99)
  • Saturation (CPU, memory, disk)
  • Queue depth
  • Certificate expiry
  • Disk space

Anti-patterns

  • Too sensitive thresholds → alert fatigue
  • Alerting on things that self-heal
  • No runbook → nobody knows what to do
  • Duplicate alerts

Runbook Template

Alert: HighErrorRate

Severity: Critical Meaning: 5xx error rate > 1% for 5 minutes Impact: Users see errors Steps: 1. Check deployment history 2. Look at logs 3. Rollback if recent deploy 4. Escalate to #oncall

Summary

Fewer alerts = more attention. Every alert must have a runbook and clear action.

alertingmonitoringsre
Share:

CORE SYSTEMS team

We build core systems and AI agents that keep operations running. 15 years of experience with enterprise IT.