You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
Reliability is the ability of a system to function correctly even when things go wrong. Observability is the ability to understand the internal state of a system from its external outputs. Together, they ensure you can build systems that work and know when they do not.
┌───────────────────────────────────────────────────────────┐
│ │
│ SLI (Service Level Indicator) │
│ "What we measure" │
│ e.g. request latency, error rate, throughput │
│ │
│ SLO (Service Level Objective) │
│ "What we aim for" │
│ e.g. 99.9% of requests complete in < 200ms │
│ │
│ SLA (Service Level Agreement) │
│ "What we promise (with consequences)" │
│ e.g. 99.9% uptime or customer gets credits │
│ │
└───────────────────────────────────────────────────────────┘
| SLI | How to Measure | Typical SLO |
|---|---|---|
| Availability | Successful requests / total requests | 99.9% - 99.99% |
| Latency | Time to respond (p50, p95, p99) | p99 < 200ms |
| Throughput | Requests per second | > 10,000 RPS |
| Error rate | Failed requests / total requests | < 0.1% |
| Freshness | Time since last successful data update | < 1 minute |
| Availability | Downtime per Year | Downtime per Month |
|---|---|---|
| 99% | 3.65 days | 7.3 hours |
| 99.9% | 8.77 hours | 43.8 minutes |
| 99.95% | 4.38 hours | 21.9 minutes |
| 99.99% | 52.6 minutes | 4.38 minutes |
| 99.999% | 5.26 minutes | 26.3 seconds |
An error budget is the maximum amount of unreliability you can tolerate, derived from your SLO.
SLO = 99.9% availability
Error Budget = 100% - 99.9% = 0.1%
In a 30-day month (43,200 minutes):
Error Budget = 43,200 × 0.001 = 43.2 minutes of downtime
┌─────────────────────────────────────────┐
│ Error Budget This Month │
│ │
│ Budget: 43.2 min [█████████████░░] │
│ Used: 28.5 min (66% consumed) │
│ Remaining: 14.7 min │
│ │
│ Status: CAUTION — slow down changes │
└─────────────────────────────────────────┘
| Budget Status | Action |
|---|---|
| Plenty remaining | Ship features, run experiments |
| Getting low | Slow down changes, focus on stability |
| Exhausted | Feature freeze, focus entirely on reliability |
A circuit breaker prevents cascading failures by stopping calls to a failing service.
States:
┌──────────┐ Failures exceed ┌──────────┐
│ CLOSED │───── threshold ──────▶│ OPEN │
│ (normal) │ │ (failing)│
└──────────┘ └────┬─────┘
▲ │
│ After timeout
│ │
│ Success ┌───▼──────┐
└─────────────────────────────│HALF-OPEN │
│ (testing) │
Failure └───────────┘
──────────────────▶ Back to OPEN
| Parameter | Description | Typical Value |
|---|---|---|
| Failure threshold | Failures before opening | 5 failures |
| Timeout duration | How long to stay open | 30 seconds |
| Success threshold | Successes in half-open to close | 3 successes |
| Window size | Time window for counting failures | 60 seconds |
When a request fails, retry with increasing delays to avoid overwhelming the failing service.
Attempt 1: Wait 0s → Request
Attempt 2: Wait 1s → Request
Attempt 3: Wait 2s → Request
Attempt 4: Wait 4s → Request
Attempt 5: Wait 8s → Request (give up after max retries)
With jitter (randomisation):
Attempt 2: Wait 1s + random(0, 500ms) → Request
Attempt 3: Wait 2s + random(0, 1s) → Request
Subscribe to continue reading
Get full access to this lesson and all 10 lessons in this course.