Observability & Monitoring

You cannot manage what you cannot see. Observability in Kubernetes covers three pillars: metrics, logs, and traces. This lesson covers the Prometheus/Grafana monitoring stack, logging with Fluentd and Loki, distributed tracing with Jaeger, and alerting strategies.

The Three Pillars of Observability

Pillar	What It Answers	Tools
Metrics	How is the system performing?	Prometheus, Grafana, metrics-server
Logs	What happened and why?	Fluentd, Loki, Elasticsearch
Traces	How does a request flow through services?	Jaeger, Zipkin, Tempo

graph TD
  GRAF["Grafana (Unified Dashboard)"]
  GRAF --> PROM["Prometheus (Metrics)"]
  GRAF --> LOKI["Loki (Logs)"]
  GRAF --> JAEGER["Jaeger (Traces)"]
  PROM --> EXP["Exporters /metrics"]
  LOKI --> FLU["Fluentd / Promtail"]
  JAEGER --> OTEL["OpenTelemetry SDK"]

Prometheus — Metrics Collection

Prometheus is the de-facto standard for Kubernetes metrics.

Installing with kube-prometheus-stack

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

helm install monitoring prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  -f monitoring-values.yaml

How Prometheus Works

Scraping — Prometheus pulls metrics from /metrics endpoints at regular intervals
Storage — Time-series data is stored locally on the Prometheus server
Querying — PromQL queries data for dashboards and alerts
Alerting — Alertmanager routes alerts to Slack, PagerDuty, email, etc.

PromQL Examples

# CPU usage per pod
rate(container_cpu_usage_seconds_total{namespace="production"}[5m])

# Memory usage as a percentage
container_memory_working_set_bytes{namespace="production"}
/ on(pod) kube_pod_container_resource_limits{resource="memory"} * 100

# Request rate per service
rate(http_requests_total{namespace="production"}[5m])

# 99th percentile latency
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

# Error rate
rate(http_requests_total{status=~"5.."}[5m])
/ rate(http_requests_total[5m]) * 100

ServiceMonitor — Configuring Scrape Targets

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: web-api-monitor
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: web-api
  namespaceSelector:
    matchNames:
      - production
  endpoints:
    - port: metrics
      interval: 15s
      path: /metrics

Grafana — Dashboards

Grafana visualises metrics from Prometheus (and other data sources).

Key Dashboards for Kubernetes

Dashboard	What It Shows
Kubernetes Cluster Overview	Node CPU, memory, pod counts
Namespace Resources	Per-namespace resource usage
Pod Details	Individual pod metrics
Node Exporter	Host-level metrics (disk, network)
CoreDNS	DNS query rates, latency, errors

Custom Dashboard JSON (Example Panel)

{
  "title": "Request Rate by Service",
  "type": "timeseries",
  "datasource": "Prometheus",
  "targets": [
    {
      "expr": "sum(rate(http_requests_total{namespace=\"production\"}[5m])) by (service)",
      "legendFormat": "{{ service }}"
    }
  ]
}

metrics-server

metrics-server provides real-time CPU and memory metrics for pods and nodes — used by HPA and kubectl top.

# Install metrics-server
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

# View node metrics
kubectl top nodes

# View pod metrics
kubectl top pods -n production --sort-by=memory

Observability & Monitoring

Observability & Monitoring

The Three Pillars of Observability

Prometheus — Metrics Collection

Installing with kube-prometheus-stack

How Prometheus Works

PromQL Examples

ServiceMonitor — Configuring Scrape Targets

Grafana — Dashboards

Key Dashboards for Kubernetes

Custom Dashboard JSON (Example Panel)

metrics-server

More in DevOps