You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
Alerting is the bridge between observability and action. The purpose of an alert is to notify the right person at the right time about a condition that requires human intervention. Done well, alerting catches problems early and reduces incident impact. Done poorly, it leads to alert fatigue, missed issues, and burned-out engineers.
Google's SRE book established foundational principles for alerting:
Rule of thumb: If an alert fires and the on-call engineer says "I can ignore this," the alert should be removed or changed.
| Component | Role |
|---|---|
| Alert rules | Conditions evaluated against metrics (e.g., in Prometheus) |
| Alertmanager | Routes, groups, deduplicates, and silences alerts |
| Notification channels | Where alerts are delivered (PagerDuty, Slack, email, OpsGenie) |
| Runbooks | Step-by-step guides for responding to specific alerts |
| Escalation policies | What happens if the primary responder does not acknowledge |
Alert rules in Prometheus are defined in YAML and evaluated at regular intervals:
groups:
- name: api-alerts
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/ sum(rate(http_requests_total[5m])) by (service)
> 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate on {{ $labels.service }}"
description: "Error rate is {{ $value | humanizePercentage }} over the last 5 minutes."
runbook: "https://wiki.example.com/runbooks/high-error-rate"
- alert: HighLatency
expr: |
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
> 1.0
for: 5m
labels:
severity: warning
annotations:
summary: "High p95 latency on {{ $labels.service }}"
description: "95th percentile latency is {{ $value }}s."
| Field | Purpose |
|---|---|
expr | The PromQL condition to evaluate |
for | How long the condition must be true before firing (avoids flapping) |
labels | Metadata for routing (severity, team, service) |
annotations | Human-readable context (summary, description, runbook link) |
Alertmanager handles routing alerts to the correct recipients:
global:
resolve_timeout: 5m
route:
receiver: 'default-slack'
group_by: ['alertname', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
receiver: 'pagerduty-critical'
- match:
severity: warning
receiver: 'slack-warning'
receivers:
- name: 'pagerduty-critical'
pagerduty_configs:
- routing_key: '<pagerduty-integration-key>'
- name: 'slack-warning'
slack_configs:
- api_url: 'https://hooks.slack.com/services/xxx'
channel: '#alerts-warning'
- name: 'default-slack'
slack_configs:
- api_url: 'https://hooks.slack.com/services/xxx'
channel: '#alerts'
inhibit_rules:
- source_match:
severity: critical
target_match:
severity: warning
equal: ['alertname', 'service']
Subscribe to continue reading
Get full access to this lesson and all 10 lessons in this course.