GKE Best Practices

Running production workloads on GKE requires attention to cluster design, resource management, cost optimisation, monitoring, and operational practices. This lesson consolidates best practices for building reliable, secure, and cost-effective GKE deployments.

Cluster Design

Use Regional Clusters for Production

Regional clusters have three control plane replicas across zones, providing high availability. Zonal clusters have a single control plane that becomes unavailable during upgrades or zone outages.

Use Autopilot for Reduced Operations

For most new workloads, start with Autopilot. It eliminates node management, enforces security best practices, and bills per pod. Only choose Standard mode when you need custom DaemonSets, privileged containers, or full node control.

Use Private Clusters

Private clusters prevent nodes from having public IP addresses, reducing the attack surface. Use authorised networks to control which CIDR ranges can access the control plane.

Use Release Channels

Enable release channels (Regular for most workloads, Stable for risk-averse environments) to receive automatic Kubernetes version upgrades with Google-tested patches.

Resource Management

Always Set Resource Requests and Limits

Every container should define CPU and memory requests (guaranteed minimum) and memory limits (hard ceiling). Without requests, the scheduler cannot make informed placement decisions. Without memory limits, a container can consume all node memory and cause OOM kills for other pods.

resources:
  requests:
    cpu: 250m
    memory: 256Mi
  limits:
    memory: 512Mi

Use Horizontal Pod Autoscaler

Configure HPA for workloads with variable traffic. Set target CPU utilisation to 60-80% to leave headroom for traffic spikes.

kubectl autoscale deployment web-api \
  --cpu-percent=70 \
  --min=2 \
  --max=20

Use Vertical Pod Autoscaler for Right-Sizing

VPA analyses pod resource usage over time and recommends (or automatically adjusts) resource requests. Run VPA in recommendation mode first to understand your workloads before enabling auto mode.

Use Pod Disruption Budgets

Pod Disruption Budgets (PDBs) ensure that a minimum number of pods remain available during voluntary disruptions (node upgrades, cluster autoscaler scale-down).

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: web-api-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: web-api

Cost Optimisation

Use Spot VMs for Fault-Tolerant Workloads

Spot VM node pools offer up to 91% discount. Use them for batch processing, CI/CD, dev/test, and any workload that can tolerate interruptions.

Right-Size Node Pools

Monitor actual resource utilisation with Cloud Monitoring. If nodes are consistently under-utilised, reduce the machine type size. If pods are often pending due to insufficient resources, increase size or enable cluster autoscaler.

Use Node Auto-Provisioning

NAP automatically creates and deletes node pools based on pod requirements. This eliminates the need to pre-define node pools for every workload type and ensures optimal machine types are used.

Use Committed Use Discounts

For stable baseline workloads, purchase Committed Use Discounts (CUDs) for 1 or 3 years to save up to 57% on Compute Engine resources.

Clean Up Unused Resources

Regularly audit and delete:

Orphaned PersistentVolumeClaims and PersistentVolumes
Unused load balancers (from deleted Services)
Idle clusters (especially in dev/test environments)
Old container images in Artifact Registry

Monitoring and Observability

Enable GKE Monitoring

GKE integrates with Cloud Monitoring and Cloud Logging by default. Enable the managed collection of system and workload metrics.

GKE Best Practices

GKE Best Practices

Cluster Design

Use Regional Clusters for Production

Use Autopilot for Reduced Operations

Use Private Clusters

Use Release Channels

Resource Management

Always Set Resource Requests and Limits

Use Horizontal Pod Autoscaler

Use Vertical Pod Autoscaler for Right-Sizing

Use Pod Disruption Budgets

Cost Optimisation

Use Spot VMs for Fault-Tolerant Workloads

Right-Size Node Pools

Use Node Auto-Provisioning

Use Committed Use Discounts

Clean Up Unused Resources

Monitoring and Observability

Enable GKE Monitoring

Set Up Alerts

More in Cloud