Reliability & Observability

Reliability is the ability of a system to function correctly even when things go wrong. Observability is the ability to understand the internal state of a system from its external outputs. Together, they ensure you can build systems that work and know when they do not.

SLOs, SLIs, and SLAs

┌───────────────────────────────────────────────────────────┐
│                                                            │
│  SLI (Service Level Indicator)                             │
│  "What we measure"                                         │
│  e.g. request latency, error rate, throughput              │
│                                                            │
│  SLO (Service Level Objective)                             │
│  "What we aim for"                                         │
│  e.g. 99.9% of requests complete in < 200ms               │
│                                                            │
│  SLA (Service Level Agreement)                             │
│  "What we promise (with consequences)"                     │
│  e.g. 99.9% uptime or customer gets credits               │
│                                                            │
└───────────────────────────────────────────────────────────┘

Common SLIs

SLI	How to Measure	Typical SLO
Availability	Successful requests / total requests	99.9% - 99.99%
Latency	Time to respond (p50, p95, p99)	p99 < 200ms
Throughput	Requests per second	> 10,000 RPS
Error rate	Failed requests / total requests	< 0.1%
Freshness	Time since last successful data update	< 1 minute

Availability in Practice

Availability	Downtime per Year	Downtime per Month
99%	3.65 days	7.3 hours
99.9%	8.77 hours	43.8 minutes
99.95%	4.38 hours	21.9 minutes
99.99%	52.6 minutes	4.38 minutes
99.999%	5.26 minutes	26.3 seconds

Error Budgets

An error budget is the maximum amount of unreliability you can tolerate, derived from your SLO.

SLO = 99.9% availability

Error Budget = 100% - 99.9% = 0.1%

In a 30-day month (43,200 minutes):
Error Budget = 43,200 × 0.001 = 43.2 minutes of downtime

┌─────────────────────────────────────────┐
│         Error Budget This Month          │
│                                          │
│  Budget:  43.2 min   [█████████████░░]  │
│  Used:    28.5 min   (66% consumed)      │
│  Remaining: 14.7 min                     │
│                                          │
│  Status: CAUTION — slow down changes     │
└─────────────────────────────────────────┘

How to Use Error Budgets

Budget Status	Action
Plenty remaining	Ship features, run experiments
Getting low	Slow down changes, focus on stability
Exhausted	Feature freeze, focus entirely on reliability

Circuit Breakers

A circuit breaker prevents cascading failures by stopping calls to a failing service.

States:

  ┌──────────┐    Failures exceed    ┌──────────┐
  │  CLOSED  │───── threshold ──────▶│   OPEN   │
  │ (normal) │                       │ (failing)│
  └──────────┘                       └────┬─────┘
       ▲                                  │
       │                            After timeout
       │                                  │
       │         Success              ┌───▼──────┐
       └─────────────────────────────│HALF-OPEN │
                                     │ (testing) │
                 Failure             └───────────┘
                 ──────────────────▶ Back to OPEN

Circuit Breaker Configuration

Parameter	Description	Typical Value
Failure threshold	Failures before opening	5 failures
Timeout duration	How long to stay open	30 seconds
Success threshold	Successes in half-open to close	3 successes
Window size	Time window for counting failures	60 seconds

Retries with Exponential Backoff

When a request fails, retry with increasing delays to avoid overwhelming the failing service.

Attempt 1: Wait 0s     → Request
Attempt 2: Wait 1s     → Request
Attempt 3: Wait 2s     → Request
Attempt 4: Wait 4s     → Request
Attempt 5: Wait 8s     → Request (give up after max retries)

With jitter (randomisation):
Attempt 2: Wait 1s + random(0, 500ms) → Request
Attempt 3: Wait 2s + random(0, 1s)    → Request

Reliability & Observability

Reliability & Observability

SLOs, SLIs, and SLAs

Common SLIs

Availability in Practice

Error Budgets

How to Use Error Budgets

Circuit Breakers

Circuit Breaker Configuration

Retries with Exponential Backoff

Retry Best Practices

More in Programming