Deploying AI Applications to Production

Building an AI application is only half the battle. Deploying it reliably, monitoring its behaviour, managing costs, and ensuring security requires a distinct set of practices. This lesson covers the key patterns for production AI deployments.

Production Architecture

┌──────────┐     ┌──────────────┐     ┌───────────────┐
│  Client   │────▶│  API Gateway │────▶│  AI Service   │
│  (Web/    │     │  (Auth,      │     │  (Your App)   │
│   Mobile) │     │   Rate Limit)│     │               │
└──────────┘     └──────────────┘     └──────┬────────┘
                                              │
                          ┌───────────────────┼──────────────┐
                          │                   │              │
                          ▼                   ▼              ▼
                   ┌────────────┐     ┌────────────┐  ┌──────────┐
                   │  LLM API   │     │  Vector DB │  │ Cache    │
                   │  (Primary) │     │            │  │ (Redis)  │
                   └────────────┘     └────────────┘  └──────────┘
                          │
                          ▼
                   ┌────────────┐
                   │  LLM API   │
                   │  (Fallback)│
                   └────────────┘

API Gateway Patterns

An API gateway sits between clients and your AI service, handling cross-cutting concerns.

Key Responsibilities

Concern	Implementation
Authentication	API keys, JWT tokens, OAuth
Rate limiting	Per-user or per-tier request limits
Request validation	Schema validation before reaching your service
Logging	Log all requests and responses for debugging
Caching	Cache identical requests to reduce cost

Example with FastAPI

from fastapi import FastAPI, HTTPException, Depends
from fastapi.security import HTTPBearer
from pydantic import BaseModel
import time

app = FastAPI()
security = HTTPBearer()

class ChatRequest(BaseModel):
    message: str
    session_id: str

class ChatResponse(BaseModel):
    response: str
    tokens_used: int
    latency_ms: float

@app.post("/chat", response_model=ChatResponse)
async def chat(request: ChatRequest, token=Depends(security)):
    start = time.time()

    # Validate and process
    response_text = await generate_response(request.message, request.session_id)
    latency = (time.time() - start) * 1000

    return ChatResponse(
        response=response_text,
        tokens_used=count_tokens(response_text),
        latency_ms=round(latency, 2),
    )

Monitoring and Observability

What to Monitor

Metric	Why It Matters
Latency (p50, p95, p99)	User experience, SLA compliance
Error rate	Service reliability
Token usage	Cost tracking
Request volume	Capacity planning
Model response quality	Catch degradation early

Structured Logging

import logging
import json

logger = logging.getLogger("ai_service")

def log_request(request_id, input_text, output_text, model,
                input_tokens, output_tokens, latency_ms, error=None):
    logger.info(json.dumps({
        "request_id": request_id,
        "model": model,
        "input_tokens": input_tokens,
        "output_tokens": output_tokens,
        "latency_ms": latency_ms,
        "error": str(error) if error else None,
        "timestamp": time.time(),
    }))

Alerting Rules

Set up alerts for:

Latency p95 exceeds 5 seconds
Error rate exceeds 1% over 5 minutes
Daily cost exceeds budget threshold
Token usage spike (possible abuse or prompt injection)

Cost Management

Cost Dashboard

Track costs in real time:

class CostDashboard:
    def __init__(self, daily_budget: float):
        self.daily_budget = daily_budget
        self.daily_cost = 0.0
        self.requests_today = 0

    def record(self, input_tokens: int, output_tokens: int, model: str):
        cost = self._calculate_cost(input_tokens, output_tokens, model)
        self.daily_cost += cost
        self.requests_today += 1

        if self.daily_cost > self.daily_budget * 0.8:
            self._alert(f"80% of daily budget used: ${self.daily_cost:.2f}")

        if self.daily_cost > self.daily_budget:
            self._alert(f"BUDGET EXCEEDED: ${self.daily_cost:.2f}")

    def _calculate_cost(self, input_tokens, output_tokens, model):
        prices = {
            "gpt-4o-mini": (0.15, 0.60),
            "gpt-4o": (2.50, 10.00),
        }
        input_price, output_price = prices.get(model, (1.0, 3.0))
        return (input_tokens / 1e6 * input_price) + (output_tokens / 1e6 * output_price)

    def _alert(self, message):
        logger.warning(f"COST ALERT: {message}")

Deploying AI Applications to Production

Deploying AI Applications to Production

Production Architecture

API Gateway Patterns

Key Responsibilities

Example with FastAPI

Monitoring and Observability

What to Monitor

Structured Logging

Alerting Rules

Cost Management

Cost Dashboard

Cost Reduction Strategies

More in AI