Performance Optimisation

The Performance Optimisation pillar of the GCP Architecture Framework focuses on ensuring your workloads meet performance requirements while using resources efficiently. It covers compute selection, scaling strategies, caching, database optimisation, and network performance. A well-optimised workload delivers fast response times, handles traffic spikes gracefully, and avoids wasting resources on over-provisioned infrastructure.

Design Principles

Principle	Description
Measure first	Establish baselines and set performance targets before optimising
Choose the right compute	Match the compute service to your workload characteristics
Scale automatically	Use autoscaling to match capacity to demand
Cache aggressively	Reduce latency and backend load with strategic caching
Optimise data access	Choose the right database, design efficient schemas, and use connection pooling
Minimise distance	Place compute close to users and data close to compute

Choosing the Right Compute

The single most impactful performance decision is choosing the correct compute platform:

Compute Service	Best For	Scaling Model
Cloud Run	Stateless HTTP services, APIs, event-driven processing	Scale to zero, per-request autoscaling
GKE	Complex microservices, stateful applications, GPU workloads	Horizontal pod autoscaling, node autoscaling
Cloud Functions	Event handlers, lightweight processing, glue logic	Per-invocation scaling, scale to zero
Compute Engine	Custom OS, legacy applications, HPC, specialised hardware	Managed instance groups with autoscaling
App Engine	Simple web applications, rapid prototyping	Automatic scaling based on traffic

Machine Type Selection

For Compute Engine and GKE, choosing the right machine type is critical:

Series	Optimised For
E2	General-purpose, cost-effective workloads
N2/N2D	Balanced performance for most workloads
C2/C2D	Compute-intensive workloads (high single-thread performance)
M2/M3	Memory-intensive workloads (SAP HANA, in-memory databases)
A2/G2	GPU workloads (ML training, rendering, HPC)
T2D/T2A	ARM-based, cost-efficient for cloud-native workloads

Right-Sizing

# Use the recommender to identify over-provisioned VMs
gcloud recommender recommendations list \
  --project=my-project \
  --location=europe-west2-a \
  --recommender=google.compute.instance.MachineTypeRecommender

Autoscaling

GKE Autoscaling

GKE provides three levels of autoscaling:

Level	Mechanism	What It Scales
Pod	Horizontal Pod Autoscaler (HPA)	Number of pod replicas based on CPU, memory, or custom metrics
Pod	Vertical Pod Autoscaler (VPA)	Pod resource requests and limits based on actual usage
Node	Cluster Autoscaler	Number of nodes to accommodate pending pods

# Horizontal Pod Autoscaler configuration
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: order-service-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: order-service
  minReplicas: 3
  maxReplicas: 50
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

Cloud Run Scaling

# Configure Cloud Run autoscaling
gcloud run deploy my-service \
  --min-instances=1 \
  --max-instances=100 \
  --concurrency=80 \
  --cpu=2 \
  --memory=1Gi

Scaling Best Practices

Scale out aggressively, scale in cautiously — add capacity quickly, remove it slowly
Set appropriate cooldown periods — prevent thrashing between scale-up and scale-down
Use custom metrics — scale on business metrics (queue depth, active users) not just CPU
Pre-warm for predictable spikes — use scheduled scaling for known traffic patterns
Set minimum instances — avoid cold starts for latency-sensitive services

Caching

Caching reduces latency and backend load by serving frequently accessed data from faster storage:

Performance Optimisation

Performance Optimisation

Design Principles

Choosing the Right Compute

Machine Type Selection

Right-Sizing

Autoscaling

GKE Autoscaling

Cloud Run Scaling

Scaling Best Practices

Caching

Caching Layers on GCP

More in Cloud