You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
Kubernetes introduces unique observability challenges and opportunities. Pods are ephemeral, workloads are dynamic, and the platform itself generates rich telemetry. This lesson covers how to collect metrics, logs, and traces in Kubernetes, and how to monitor the cluster itself.
| Challenge | Description |
|---|---|
| Ephemeral pods | Pods are created and destroyed constantly — static monitoring does not work |
| Dynamic discovery | Services scale up and down — targets must be discovered automatically |
| Multi-layer | You need to monitor the application, the pod, the node, and the cluster |
| Distributed | Microservices on Kubernetes generate complex request flows |
| Volume | A large cluster generates enormous amounts of telemetry data |
| Source | Metrics | Access |
|---|---|---|
| kube-state-metrics | Kubernetes object state (deployments, pods, nodes) | Scrape :8080/metrics |
| cAdvisor | Container resource usage (CPU, memory, network) | Built into Kubelet, scrape :10250/metrics/cadvisor |
| Kubelet | Node-level and pod-level metrics | Scrape :10250/metrics |
| API Server | Control plane metrics (request latency, etcd health) | Scrape :6443/metrics |
| Node Exporter | Host-level OS metrics | Deploy as DaemonSet |
The standard way to run Prometheus on Kubernetes is with the kube-prometheus-stack Helm chart:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install monitoring prometheus-community/kube-prometheus-stack -n monitoring --create-namespace
This installs:
The Prometheus Operator uses custom resources to configure scraping:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: my-app
labels:
release: monitoring
spec:
selector:
matchLabels:
app: my-app
endpoints:
- port: metrics
interval: 15s
path: /metrics
| Category | Metrics |
|---|---|
| Cluster | Node count, node readiness, API server latency |
| Nodes | CPU utilisation, memory usage, disk pressure, network I/O |
| Pods | CPU/memory requests vs usage, restart count, OOMKilled events |
| Deployments | Replica count vs desired, rollout status |
| Containers | CPU throttling, memory limits, container restarts |
# Pods in CrashLoopBackOff
kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"} > 0
# CPU throttling
rate(container_cpu_cfs_throttled_seconds_total[5m]) > 0
# Memory close to limits
container_memory_working_set_bytes / container_spec_memory_limit_bytes > 0.9
# Pod restarts
increase(kube_pod_container_status_restarts_total[1h]) > 3
Subscribe to continue reading
Get full access to this lesson and all 10 lessons in this course.