Monitoring Best Practices

This final lesson brings together everything covered in the course into a comprehensive set of best practices for monitoring and observability on Google Cloud Platform. Effective monitoring is not just about deploying tools — it requires a thoughtful strategy that aligns with your organisation's reliability goals, operational maturity, and team structure.

The Four Golden Signals

Google's SRE handbook defines four key metrics that every service should monitor. These "golden signals" provide a comprehensive view of service health:

Signal	Description	GCP Metric Example
Latency	How long it takes to serve a request	`run.googleapis.com/request_latencies`
Traffic	How much demand is being placed on the system	`run.googleapis.com/request_count`
Errors	The rate of failed requests	Request count filtered by `response_code_class != "2xx"`
Saturation	How full the most constrained resource is	`compute.googleapis.com/instance/cpu/utilization`

Why These Four?

Latency captures user experience — high latency degrades usability even when errors are zero
Traffic provides context — a latency spike during a 10x traffic surge has a different root cause than one during normal traffic
Errors indicate functional failures — the system is not doing what users expect
Saturation predicts future problems — a service at 95% CPU may be one spike away from failure

Observability Layers

A mature monitoring strategy covers five distinct layers:

Layer 1: Infrastructure

What to Monitor	Key Metrics
Compute Engine instances	CPU, memory, disk I/O, network throughput
GKE clusters	Node readiness, pod restarts, container resource usage
Cloud SQL instances	CPU, memory, connections, replication lag, disk usage
Networking	VPC flow logs, load balancer latency, Cloud NAT connections

Layer 2: Platform

What to Monitor	Key Metrics
Cloud Run services	Request latency, error rate, instance count, cold starts
Cloud Functions	Execution time, invocation count, error count, active instances
Pub/Sub	Message backlog, publish latency, subscription age
Cloud Storage	Request count, error count, bucket size

Layer 3: Application

What to Monitor	Key Metrics
Business logic	Custom metrics (orders processed, payments completed, user sign-ups)
Dependencies	External API call latency and error rates
Cache performance	Hit rate, miss rate, eviction rate
Queue depth	Messages pending processing

Layer 4: User Experience

What to Monitor	Key Metrics
Uptime checks	External availability from global probe locations
Synthetic monitoring	End-to-end user journey completion time
SLO compliance	Error budget burn rate
Real user monitoring	Client-side latency, error rates, page load times

Layer 5: Security and Compliance

What to Monitor	Key Metrics
Audit logs	IAM policy changes, resource creation/deletion
Security Command Center	Vulnerability findings, misconfigurations
VPC Service Controls	Perimeter violations
Binary Authorization	Container image policy violations

Dashboard Strategy

Dashboard Hierarchy

Level	Dashboard	Audience	Refresh Rate
L0	Executive overview	Leadership	Hourly
L1	Service health	Service owners, on-call	Real-time
L2	Component detail	Engineers	Real-time
L3	Debugging	Developers during incidents	Real-time

Dashboard Design Rules

Answer one question — each dashboard should answer a specific question ("Is the Orders service healthy?")
Start with user impact — show latency and error rate before CPU and memory
Add context — include text widgets explaining what each section shows
Use consistent scales — align all charts to the same time range
Limit widgets — 10-15 widgets maximum per dashboard
Include SLO charts — show SLO compliance and error budget remaining

Monitoring Best Practices

Monitoring Best Practices

The Four Golden Signals

Why These Four?

Observability Layers

Layer 1: Infrastructure

Layer 2: Platform

Layer 3: Application

Layer 4: User Experience

Layer 5: Security and Compliance

Dashboard Strategy

Dashboard Hierarchy

Dashboard Design Rules

Alerting Strategy

Alert Tiers

More in Cloud