CloudWatch Alarms and Actions

Metrics alone only give you visibility. Alarms add the crucial ability to detect when something goes wrong and respond automatically. A CloudWatch Alarm watches a single metric (or a metric math expression) over a period of time and changes state when the value crosses a threshold you define.

Alarm States

Every CloudWatch Alarm is always in exactly one of three states:

State	Meaning
OK	The metric is within the acceptable range
ALARM	The metric has breached the threshold
INSUFFICIENT_DATA	Not enough data points to evaluate the alarm

When you first create an alarm, it starts in INSUFFICIENT_DATA until enough data points arrive to make a determination.

Creating an Alarm

An alarm is defined by five key components:

Metric — the metric name, namespace, and dimensions to watch
Statistic — how to aggregate data points (Average, Sum, Maximum, p99, etc.)
Period — the evaluation window in seconds (e.g. 300 = 5 minutes)
Evaluation periods — how many consecutive periods must breach the threshold
Threshold — the value and comparison operator (GreaterThanThreshold, LessThanOrEqualToThreshold, etc.)

Example: High CPU Alarm

Suppose you want to be alerted when an EC2 instance's CPU stays above 80 % for 10 minutes. The configuration would be:

Parameter	Value
Namespace	AWS/EC2
Metric name	CPUUtilization
Dimension	InstanceId = i-0abcdef1234567890
Statistic	Average
Period	300 seconds (5 minutes)
Evaluation periods	2
Threshold	> 80

The alarm enters the ALARM state only after two consecutive 5-minute periods average above 80 %. This prevents a brief spike from triggering a false alert.

Datapoints to Alarm (M of N)

For noisier metrics, you can use the "M of N" evaluation model. Instead of requiring every evaluation period to breach, you can say "3 out of 5 periods must breach." This further reduces false positives:

Evaluation periods: 5
Datapoints to alarm: 3

If 3 of the last 5 periods exceed the threshold, the alarm fires.

Alarm Actions

The real power of alarms is in actions — what happens when the state changes. You can attach actions to three transitions:

Transition	Typical Use
OK → ALARM	Notify on-call, scale out, trigger remediation
ALARM → OK	Send "all clear" notification
Any → INSUFFICIENT_DATA	Investigate missing data

Common Action Types

1. Amazon SNS Notification

The most common action. The alarm publishes a message to an SNS topic, which can fan out to:

Email subscribers
SMS subscribers
Lambda functions for custom logic
HTTPS endpoints (e.g. PagerDuty, Opsgenie, Slack webhooks)

2. Auto Scaling Action

Trigger a scale-out or scale-in policy. For example:

ALARM: add 2 instances to the Auto Scaling Group
OK: remove the extra instances

This is the foundation of reactive auto scaling on AWS.

3. EC2 Action

Directly act on an EC2 instance:

Stop — stop a development instance that is idle
Terminate — clean up a stuck instance
Reboot — restart an unresponsive instance
Recover — move the instance to new hardware (keeps the same IP and metadata)

CloudWatch Alarms and Actions

CloudWatch Alarms and Actions

Alarm States

Creating an Alarm

Example: High CPU Alarm

Datapoints to Alarm (M of N)

Alarm Actions

Common Action Types

1. Amazon SNS Notification

2. Auto Scaling Action

3. EC2 Action

4. Systems Manager Action

More in Cloud