Observability in Kubernetes

Kubernetes introduces unique observability challenges and opportunities. Pods are ephemeral, workloads are dynamic, and the platform itself generates rich telemetry. This lesson covers how to collect metrics, logs, and traces in Kubernetes, and how to monitor the cluster itself.

Kubernetes Observability Challenges

Challenge	Description
Ephemeral pods	Pods are created and destroyed constantly — static monitoring does not work
Dynamic discovery	Services scale up and down — targets must be discovered automatically
Multi-layer	You need to monitor the application, the pod, the node, and the cluster
Distributed	Microservices on Kubernetes generate complex request flows
Volume	A large cluster generates enormous amounts of telemetry data

Metrics in Kubernetes

Built-in Metrics Sources

Source	Metrics	Access
kube-state-metrics	Kubernetes object state (deployments, pods, nodes)	Scrape `:8080/metrics`
cAdvisor	Container resource usage (CPU, memory, network)	Built into Kubelet, scrape `:10250/metrics/cadvisor`
Kubelet	Node-level and pod-level metrics	Scrape `:10250/metrics`
API Server	Control plane metrics (request latency, etcd health)	Scrape `:6443/metrics`
Node Exporter	Host-level OS metrics	Deploy as DaemonSet

Prometheus on Kubernetes

The standard way to run Prometheus on Kubernetes is with the kube-prometheus-stack Helm chart:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install monitoring prometheus-community/kube-prometheus-stack -n monitoring --create-namespace

This installs:

Prometheus (with service discovery)
Alertmanager
Grafana (with pre-built dashboards)
kube-state-metrics
Node Exporter
Default alerting rules

ServiceMonitor and PodMonitor

The Prometheus Operator uses custom resources to configure scraping:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: my-app
  labels:
    release: monitoring
spec:
  selector:
    matchLabels:
      app: my-app
  endpoints:
    - port: metrics
      interval: 15s
      path: /metrics

Key Kubernetes Metrics to Monitor

Category	Metrics
Cluster	Node count, node readiness, API server latency
Nodes	CPU utilisation, memory usage, disk pressure, network I/O
Pods	CPU/memory requests vs usage, restart count, OOMKilled events
Deployments	Replica count vs desired, rollout status
Containers	CPU throttling, memory limits, container restarts

# Pods in CrashLoopBackOff
kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"} > 0

# CPU throttling
rate(container_cpu_cfs_throttled_seconds_total[5m]) > 0

# Memory close to limits
container_memory_working_set_bytes / container_spec_memory_limit_bytes > 0.9

# Pod restarts
increase(kube_pod_container_status_restarts_total[1h]) > 3

Observability in Kubernetes

Observability in Kubernetes

Kubernetes Observability Challenges

Metrics in Kubernetes

Built-in Metrics Sources

Prometheus on Kubernetes

ServiceMonitor and PodMonitor

Key Kubernetes Metrics to Monitor

Logging in Kubernetes

Log Collection Patterns

More in DevOps