You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
Reliability is the ability of a system to function correctly even when things go wrong. Observability is the ability to understand the internal state of a system from its external outputs. Together, they ensure you can build systems that work and know when they do not.
| Term | Meaning | Example |
|---|---|---|
| SLI (Service Level Indicator) | What we measure | request latency, error rate, throughput |
| SLO (Service Level Objective) | What we aim for | 99.9% of requests complete in < 200ms |
| SLA (Service Level Agreement) | What we promise (with consequences) | 99.9% uptime or customer gets credits |
| SLI | How to Measure | Typical SLO |
|---|---|---|
| Availability | Successful requests / total requests | 99.9% - 99.99% |
| Latency | Time to respond (p50, p95, p99) | p99 < 200ms |
| Throughput | Requests per second | > 10,000 RPS |
| Error rate | Failed requests / total requests | < 0.1% |
| Freshness | Time since last successful data update | < 1 minute |
| Availability | Downtime per Year | Downtime per Month |
|---|---|---|
| 99% | 3.65 days | 7.3 hours |
| 99.9% | 8.77 hours | 43.8 minutes |
| 99.95% | 4.38 hours | 21.9 minutes |
| 99.99% | 52.6 minutes | 4.38 minutes |
| 99.999% | 5.26 minutes | 26.3 seconds |
An error budget is the maximum amount of unreliability you can tolerate, derived from your SLO.
| Budget Status | Action |
|---|---|
| Plenty remaining | Ship features, run experiments |
| Getting low | Slow down changes, focus on stability |
| Exhausted | Feature freeze, focus entirely on reliability |
A circuit breaker prevents cascading failures by stopping calls to a failing service.
graph LR
Closed["CLOSED (normal)"] -->|"Failures exceed threshold"| Open["OPEN (failing)"]
Open -->|"After timeout"| HalfOpen["HALF-OPEN (testing)"]
HalfOpen -->|"Success"| Closed
HalfOpen -->|"Failure"| Open
| Parameter | Description | Typical Value |
|---|---|---|
| Failure threshold | Failures before opening | 5 failures |
| Timeout duration | How long to stay open | 30 seconds |
| Success threshold | Successes in half-open to close | 3 successes |
| Window size | Time window for counting failures | 60 seconds |
When a request fails, retry with increasing delays to avoid overwhelming the failing service.
Subscribe to continue reading
Get full access to this lesson and all 10 lessons in this course.