Monitoring and Best Practices

A CI/CD pipeline does not end at deployment. Monitoring production, tracking pipeline health, and fostering a CI/CD culture are essential for long-term success. This lesson covers observability, pipeline metrics, and best practices for mature CI/CD adoption.

Post-Deployment Monitoring

Deploying code is only half the job. You need to know if the deployment is working correctly in production.

The Four Golden Signals

Google's SRE team defines four key metrics for monitoring any system:

Signal	What It Measures	Example
Latency	How long requests take	p50: 50ms, p99: 200ms
Traffic	How much demand is on the system	1,000 requests/sec
Errors	Rate of failed requests	0.1% error rate
Saturation	How full the system is	CPU at 70%, memory at 85%

Monitoring Stack

Tool	Purpose
Prometheus	Metrics collection and alerting
Grafana	Dashboards and visualisation
Datadog	All-in-one monitoring SaaS
New Relic	Application performance monitoring
PagerDuty / Opsgenie	Incident alerting and on-call management
Sentry	Error tracking and crash reporting
OpenTelemetry	Vendor-neutral telemetry collection

Deployment Verification

After every deployment, verify that the new version is healthy:

Automated Smoke Tests

# Run smoke tests after deployment
deploy:
  steps:
    - name: Deploy to production
      run: ./deploy.sh

    - name: Smoke test
      run: |
        # Wait for deployment to stabilise
        sleep 30

        # Check health endpoint
        STATUS=$(curl -s -o /dev/null -w "%{http_code}" https://myapp.example.com/health)
        if [ "$STATUS" != "200" ]; then
          echo "Health check failed with status $STATUS"
          exit 1
        fi

        # Check critical endpoint
        STATUS=$(curl -s -o /dev/null -w "%{http_code}" https://myapp.example.com/api/status)
        if [ "$STATUS" != "200" ]; then
          echo "API status check failed"
          exit 1
        fi

        echo "All smoke tests passed"

Automatic Rollback

If post-deployment checks fail, automatically roll back:

    - name: Rollback on failure
      if: failure()
      run: |
        echo "Deployment verification failed — rolling back"
        kubectl rollout undo deployment/myapp

Pipeline Metrics

Track the health of your CI/CD pipeline itself:

Key Metrics

Metric	What It Measures	Target
Build duration	Time from push to artefact	< 5 minutes
Deploy frequency	How often you deploy	Multiple times per day
Lead time for changes	Commit to production	< 1 hour
Change failure rate	Deploys that cause incidents	< 5%
Mean time to recover (MTTR)	Time to fix a production failure	< 1 hour
Pipeline success rate	Percentage of passing builds	> 95%
Flaky test rate	Tests that pass/fail intermittently	< 1%

DORA Metrics

The DORA (DevOps Research and Assessment) team identified four key metrics that correlate with high-performing teams:

DORA Metric	Elite	High	Medium	Low
Deploy frequency	On-demand (multiple/day)	Weekly-monthly	Monthly-6 monthly	6 months+
Lead time for changes	< 1 hour	1 day-1 week	1-6 months	6 months+
Change failure rate	0-15%	16-30%	16-30%	46-60%
Time to restore	< 1 hour	< 1 day	1 day-1 week	6 months+

CI/CD Best Practices

1. Keep the Build Fast

Target: < 5 minutes for the feedback loop

Strategies:
├── Cache dependencies (node_modules, pip, Maven)
├── Run tests in parallel
├── Use incremental builds
├── Split large test suites
└── Use faster runners (larger machines)

2. Fail Fast

Order your pipeline stages from fastest to slowest:

Monitoring and Best Practices

Monitoring and Best Practices

Post-Deployment Monitoring

The Four Golden Signals

Monitoring Stack

Deployment Verification

Automated Smoke Tests

Automatic Rollback

Pipeline Metrics

Key Metrics

DORA Metrics

CI/CD Best Practices

1. Keep the Build Fast

2. Fail Fast

More in DevOps