Alerting and On-Call

Alerting is the bridge between observability and action. The purpose of an alert is to notify the right person at the right time about a condition that requires human intervention. Done well, alerting catches problems early and reduces incident impact. Done poorly, it leads to alert fatigue, missed issues, and burned-out engineers.

Alerting Philosophy

Google's SRE book established foundational principles for alerting:

Every alert should be actionable — if no one needs to act, it should not page
Alert on symptoms, not causes — alert on "error rate is high" rather than "CPU is at 90%"
Alert on SLO burn rate — tie alerts to what matters to users
Reduce noise ruthlessly — fewer, better alerts lead to faster response

Rule of thumb: If an alert fires and the on-call engineer says "I can ignore this," the alert should be removed or changed.

Alerting Components

Component	Role
Alert rules	Conditions evaluated against metrics (e.g., in Prometheus)
Alertmanager	Routes, groups, deduplicates, and silences alerts
Notification channels	Where alerts are delivered (PagerDuty, Slack, email, OpsGenie)
Runbooks	Step-by-step guides for responding to specific alerts
Escalation policies	What happens if the primary responder does not acknowledge

Prometheus Alerting Rules

Alert rules in Prometheus are defined in YAML and evaluated at regular intervals:

groups:
  - name: api-alerts
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
          / sum(rate(http_requests_total[5m])) by (service)
          > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate on {{ $labels.service }}"
          description: "Error rate is {{ $value | humanizePercentage }} over the last 5 minutes."
          runbook: "https://wiki.example.com/runbooks/high-error-rate"

      - alert: HighLatency
        expr: |
          histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
          > 1.0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High p95 latency on {{ $labels.service }}"
          description: "95th percentile latency is {{ $value }}s."

Key Fields

Field	Purpose
`expr`	The PromQL condition to evaluate
`for`	How long the condition must be true before firing (avoids flapping)
`labels`	Metadata for routing (severity, team, service)
`annotations`	Human-readable context (summary, description, runbook link)

Alertmanager Configuration

Alertmanager handles routing alerts to the correct recipients:

global:
  resolve_timeout: 5m

route:
  receiver: 'default-slack'
  group_by: ['alertname', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty-critical'
    - match:
        severity: warning
      receiver: 'slack-warning'

receivers:
  - name: 'pagerduty-critical'
    pagerduty_configs:
      - routing_key: '<pagerduty-integration-key>'
  - name: 'slack-warning'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/xxx'
        channel: '#alerts-warning'
  - name: 'default-slack'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/xxx'
        channel: '#alerts'

inhibit_rules:
  - source_match:
      severity: critical
    target_match:
      severity: warning
    equal: ['alertname', 'service']

Alerting and On-Call

Alerting and On-Call

Alerting Philosophy

Alerting Components

Prometheus Alerting Rules

Key Fields

Alertmanager Configuration

Key Concepts

More in DevOps