You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
What is Observability
What is Observability
Observability is the ability to understand the internal state of a system by examining its external outputs — metrics, logs, and traces. It originates from control theory and has become a foundational discipline in modern software engineering, enabling teams to diagnose problems, understand behaviour, and improve reliability.
Monitoring vs Observability
These terms are often used interchangeably, but they represent different approaches:
| Aspect | Monitoring | Observability |
|---|---|---|
| Approach | Predefined checks and thresholds | Exploratory, ask arbitrary questions |
| Focus | Known failure modes | Unknown unknowns |
| Signals | Dashboards and alerts | Metrics, logs, traces — correlated |
| Question | "Is the system up?" | "Why is it behaving this way?" |
Monitoring tells you when something is wrong. Observability helps you understand why.
The Three Pillars
Observability is commonly described in terms of three pillars:
1. Metrics
Numeric measurements collected over time — CPU usage, request latency, error rates. Metrics are cheap to store and ideal for alerting.
http_requests_total{method="GET", status="200"} 14523
2. Logs
Timestamped, immutable records of discrete events — application errors, audit trails, debug output.
{
"timestamp": "2025-03-15T10:23:45Z",
"level": "ERROR",
"service": "payment-api",
"message": "Failed to charge card",
"trace_id": "abc123def456"
}
3. Traces
Records of a request's journey through a distributed system, showing timing and dependencies across services.
[Frontend] → [API Gateway] → [Payment Service] → [Database]
12ms 3ms 45ms 8ms
Beyond the Three Pillars
Modern observability extends beyond metrics, logs, and traces:
- Profiling — continuous profiling of CPU, memory, and goroutines in production
- Events — deployment markers, feature flag changes, incident annotations
- Real User Monitoring (RUM) — client-side performance data from browsers and mobile apps
- Synthetic monitoring — automated checks that simulate user journeys
Why Observability Matters
In modern distributed systems (microservices, Kubernetes, serverless), failures are:
- Inevitable — hardware fails, networks partition, code has bugs
- Complex — a single request may traverse dozens of services
- Emergent — novel failure modes arise from service interactions
Without observability, you are flying blind. With it, you can:
- Detect issues before users are impacted
- Diagnose root causes quickly during incidents
- Understand system behaviour under load
- Optimise performance and resource usage
- Validate changes after deployments
Tip: Observability is not a product you buy — it is a property of your system that you build into your architecture from the start.
Key Terminology
| Term | Definition |
|---|---|
| Telemetry | Data emitted by a system about its behaviour (metrics, logs, traces) |
| Instrumentation | The code that generates telemetry data |
| Cardinality | The number of unique values for a label or tag — high cardinality increases storage costs |
| Dimensionality | The number of labels or tags attached to a data point |
| SLI | Service Level Indicator — a quantitative measure of service behaviour |
| SLO | Service Level Objective — a target value for an SLI |
| MTTD | Mean Time to Detect — how long it takes to notice a problem |
| MTTR | Mean Time to Resolve — how long it takes to fix a problem |
The Observability Landscape
The observability ecosystem includes many tools and standards:
| Category | Examples |
|---|---|
| Metrics | Prometheus, Datadog, Grafana Mimir, InfluxDB |
| Logging | Elasticsearch (ELK), Loki, Splunk, Fluentd |
| Tracing | Jaeger, Zipkin, Tempo, Honeycomb |
| All-in-one | Datadog, New Relic, Dynatrace, Grafana Cloud |
| Standards | OpenTelemetry, StatsD, Prometheus exposition format |
OpenTelemetry (OTel) is emerging as the industry standard for instrumentation, providing vendor-neutral APIs, SDKs, and collectors for all three pillars.
Summary
Observability is the practice of understanding system behaviour through telemetry data — metrics, logs, and traces. Unlike traditional monitoring that checks for known issues, observability lets you explore unknown problems in complex distributed systems. In the following lessons, we will dive deep into each pillar, learn the tools of the trade, and build a complete observability stack.