You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
Site Reliability Engineering (SRE) is a discipline that applies software engineering practices to operations problems. Originating at Google, SRE provides a framework for running reliable, scalable services by defining service level objectives, managing error budgets, and automating operational work. Google Cloud provides native tooling that aligns directly with SRE principles, making it the natural platform for adopting SRE practices.
SRE was created at Google in 2003 to address a fundamental question: how do you operate large-scale, reliable services without an ever-growing operations team? The answer is to treat operations as a software engineering problem — automate repetitive tasks, define reliability targets precisely, and make data-driven decisions about where to invest effort.
| Principle | Description |
|---|---|
| Embracing risk | 100% reliability is neither possible nor desirable — define an acceptable level of unreliability |
| Service level objectives (SLOs) | Quantitative reliability targets that define "good enough" |
| Error budgets | The allowed amount of unreliability — the gap between 100% and your SLO |
| Eliminating toil | Automate repetitive, manual operational work |
| Monitoring | Measure what matters to users, not just what is easy to measure |
| Release engineering | Make deployments safe, fast, and reversible |
| Simplicity | Complex systems fail in complex ways — simplify wherever possible |
These three concepts form the foundation of SRE reliability management:
An SLI is a quantitative measure of a specific aspect of the service's behaviour that matters to users:
| SLI Category | Description | Example |
|---|---|---|
| Availability | The proportion of requests that succeed | 99.95% of HTTP requests return 2xx or 3xx |
| Latency | The proportion of requests served within a threshold | 95% of requests complete in under 200ms |
| Throughput | The rate of successful operations | 99.9% of queue messages processed within 30 seconds |
| Error rate | The proportion of requests that fail | Less than 0.1% of API calls return 5xx errors |
| Correctness | The proportion of operations that produce correct results | 99.99% of transactions are accurately recorded |
An SLO is a target value for an SLI over a defined time window:
SLO: 99.9% of HTTP requests to the Orders API
will return a successful response (2xx/3xx)
within 500ms, measured over a rolling 30-day window.
An SLA is a contractual commitment to customers, typically with financial penalties for breach. SLAs should always be less strict than internal SLOs to provide a buffer:
Internal SLO: 99.95% availability
External SLA: 99.9% availability (with service credits if breached)
Buffer: 0.05% — room for error without contractual consequences
The error budget is the allowed amount of unreliability — the inverse of the SLO:
| SLO | Error Budget (30 days) | Allowed Downtime |
|---|---|---|
| 99.9% | 0.1% | ~43 minutes |
| 99.95% | 0.05% | ~22 minutes |
| 99.99% | 0.01% | ~4.3 minutes |
| 99.999% | 0.001% | ~26 seconds |
An error budget policy defines what happens when the budget is consumed:
| Budget Remaining | Action |
|---|---|
| > 50% | Normal operations — deploy new features, run experiments |
| 25-50% | Caution — increase testing, reduce deployment frequency |
| < 25% | Slow down — focus on reliability work, halt risky changes |
| Exhausted | Freeze — stop feature deployments, focus entirely on reliability |
Google Cloud provides native SLO monitoring through the Service Monitoring feature:
# Create a service (for a Cloud Run service, it is auto-detected)
gcloud monitoring services list
Subscribe to continue reading
Get full access to this lesson and all 10 lessons in this course.