Operational Excellence

The Operational Excellence pillar of the GCP Architecture Framework focuses on your ability to run, monitor, and continuously improve your cloud workloads. It covers the processes, culture, and tooling that enable teams to operate services reliably and efficiently. A workload with strong operational excellence is observable, automated, and continuously improving.

Design Principles

Principle	Description
Automate everything	Manual processes are error-prone and do not scale — automate deployments, testing, and remediation
Monitor and observe	You cannot improve what you cannot measure — instrument every layer of your stack
Learn from failure	Use incidents as opportunities to improve systems and processes through blameless post-mortems
Manage change safely	Deploy frequently in small batches with automated rollback capabilities
Codify operations	Treat operational runbooks, configurations, and policies as code

Infrastructure as Code (IaC)

All infrastructure on GCP should be defined and managed as code. This ensures consistency, repeatability, and auditability.

Terraform on GCP

# Define a GKE cluster with Terraform
resource "google_container_cluster" "primary" {
  name     = "production-cluster"
  location = "europe-west2"

  remove_default_node_pool = true
  initial_node_count       = 1

  workload_identity_config {
    workload_pool = "my-project.svc.id.goog"
  }
}

resource "google_container_node_pool" "primary_nodes" {
  name       = "primary-node-pool"
  location   = "europe-west2"
  cluster    = google_container_cluster.primary.name
  node_count = 3

  node_config {
    machine_type = "e2-standard-4"
    oauth_scopes = [
      "https://www.googleapis.com/auth/cloud-platform",
    ]
  }
}

IaC Best Practices

Practice	Description
Version control	Store all IaC in Git with branch protection and code review
Remote state	Use GCS backend for Terraform state with state locking
Modular design	Create reusable modules for common patterns (VPC, GKE, Cloud SQL)
Plan before apply	Always run `terraform plan` and review changes before applying
Drift detection	Regularly compare actual state against desired state
Secret management	Never store secrets in IaC files — use Secret Manager references

CI/CD on GCP

Continuous integration and continuous delivery automate the build, test, and deployment process:

Cloud Build

Cloud Build is GCP's native CI/CD service:

# cloudbuild.yaml
steps:
  # Run tests
  - name: 'node:18'
    entrypoint: 'npm'
    args: ['test']

  # Build container image
  - name: 'gcr.io/cloud-builders/docker'
    args: ['build', '-t', 'europe-west2-docker.pkg.dev/my-project/my-repo/my-app:latest', '.']

  # Push to Artifact Registry
  - name: 'gcr.io/cloud-builders/docker'
    args: ['push', 'europe-west2-docker.pkg.dev/my-project/my-repo/my-app:latest']

  # Deploy to Cloud Run
  - name: 'gcr.io/cloud-builders/gcloud'
    args: ['run', 'deploy', 'my-app',
           '--image', 'europe-west2-docker.pkg.dev/my-project/my-repo/my-app:latest',
           '--region', 'europe-west2',
           '--platform', 'managed']

Deployment Strategies

Strategy	Description	GCP Implementation
Rolling update	Replace instances gradually	GKE rolling deployments, MIG rolling updates
Blue-green	Run two environments, switch traffic	Cloud Run traffic splitting, GKE with multiple deployments
Canary	Route a small percentage of traffic to the new version	Cloud Run traffic splitting, Istio on GKE
Feature flags	Enable features for specific users without deploying	Firebase Remote Config, custom feature flag service

Monitoring and Observability

Operational Excellence requires comprehensive observability:

Operational Excellence

Operational Excellence

Design Principles

Infrastructure as Code (IaC)

Terraform on GCP

IaC Best Practices

CI/CD on GCP

Cloud Build

Deployment Strategies

Monitoring and Observability

The Three Pillars of Observability

More in Cloud