You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
The Reliability pillar of the GCP Architecture Framework focuses on building workloads that resist failure, recover quickly when failures occur, and meet availability targets consistently. Reliability is not about preventing all failures — it is about designing systems that tolerate and gracefully handle the inevitable failures that occur in distributed systems.
| Principle | Description |
|---|---|
| Design for failure | Assume every component can and will fail — design accordingly |
| Eliminate single points of failure | Ensure no single component's failure can bring down the entire system |
| Use managed services | Let Google handle the reliability engineering for commodity infrastructure |
| Test reliability | Regularly test failure scenarios — do not wait for production incidents |
| Define clear SLOs | Set explicit reliability targets and measure against them |
Understanding availability targets helps you make informed architecture decisions:
| Availability | Monthly Downtime | Architecture Required |
|---|---|---|
| 99% | ~7.3 hours | Single zone |
| 99.9% | ~43 minutes | Multi-zone |
| 99.95% | ~22 minutes | Multi-zone with redundancy |
| 99.99% | ~4.3 minutes | Multi-region |
| 99.999% | ~26 seconds | Multi-region active-active |
Not every workload needs 99.99% availability. Over-engineering reliability wastes money and engineering effort. Choose your target based on:
Deploy across multiple zones within a region to survive zone-level failures:
# Create a regional GKE cluster (spans 3 zones automatically)
gcloud container clusters create production-cluster \
--region=europe-west2 \
--num-nodes=2 \
--enable-autorepair \
--enable-autoupgrade
# Create a regional managed instance group
gcloud compute instance-groups managed create web-mig \
--region=europe-west2 \
--template=web-template \
--size=6 \
--zones=europe-west2-a,europe-west2-b,europe-west2-c
GCP regional services automatically replicate across zones:
| Service | Regional Behaviour |
|---|---|
| Cloud SQL (HA) | Synchronous replication to a standby in another zone |
| Cloud Spanner | Multi-zone replication within a regional configuration |
| GKE Regional | Control plane and nodes distributed across 3 zones |
| Cloud Storage | Data replicated across zones within the region |
| Memorystore | Standard tier provides cross-zone replication |
For the highest availability, distribute workloads across multiple regions:
# Create a global HTTP(S) load balancer with backends in multiple regions
gcloud compute backend-services create web-backend \
--global \
--protocol=HTTP \
--health-checks=web-health-check \
--load-balancing-scheme=EXTERNAL_MANAGED
# Add backends from multiple regions
gcloud compute backend-services add-backend web-backend \
--global \
--instance-group=web-mig-europe \
--instance-group-region=europe-west2
gcloud compute backend-services add-backend web-backend \
--global \
--instance-group=web-mig-us \
--instance-group-region=us-central1
Subscribe to continue reading
Get full access to this lesson and all 10 lessons in this course.