You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
Building an AI application is only half the battle. Deploying it reliably, monitoring its behaviour, managing costs, and ensuring security requires a distinct set of practices. This lesson covers the key patterns for production AI deployments.
┌──────────┐ ┌──────────────┐ ┌───────────────┐
│ Client │────▶│ API Gateway │────▶│ AI Service │
│ (Web/ │ │ (Auth, │ │ (Your App) │
│ Mobile) │ │ Rate Limit)│ │ │
└──────────┘ └──────────────┘ └──────┬────────┘
│
┌───────────────────┼──────────────┐
│ │ │
▼ ▼ ▼
┌────────────┐ ┌────────────┐ ┌──────────┐
│ LLM API │ │ Vector DB │ │ Cache │
│ (Primary) │ │ │ │ (Redis) │
└────────────┘ └────────────┘ └──────────┘
│
▼
┌────────────┐
│ LLM API │
│ (Fallback)│
└────────────┘
An API gateway sits between clients and your AI service, handling cross-cutting concerns.
| Concern | Implementation |
|---|---|
| Authentication | API keys, JWT tokens, OAuth |
| Rate limiting | Per-user or per-tier request limits |
| Request validation | Schema validation before reaching your service |
| Logging | Log all requests and responses for debugging |
| Caching | Cache identical requests to reduce cost |
from fastapi import FastAPI, HTTPException, Depends
from fastapi.security import HTTPBearer
from pydantic import BaseModel
import time
app = FastAPI()
security = HTTPBearer()
class ChatRequest(BaseModel):
message: str
session_id: str
class ChatResponse(BaseModel):
response: str
tokens_used: int
latency_ms: float
@app.post("/chat", response_model=ChatResponse)
async def chat(request: ChatRequest, token=Depends(security)):
start = time.time()
# Validate and process
response_text = await generate_response(request.message, request.session_id)
latency = (time.time() - start) * 1000
return ChatResponse(
response=response_text,
tokens_used=count_tokens(response_text),
latency_ms=round(latency, 2),
)
| Metric | Why It Matters |
|---|---|
| Latency (p50, p95, p99) | User experience, SLA compliance |
| Error rate | Service reliability |
| Token usage | Cost tracking |
| Request volume | Capacity planning |
| Model response quality | Catch degradation early |
import logging
import json
logger = logging.getLogger("ai_service")
def log_request(request_id, input_text, output_text, model,
input_tokens, output_tokens, latency_ms, error=None):
logger.info(json.dumps({
"request_id": request_id,
"model": model,
"input_tokens": input_tokens,
"output_tokens": output_tokens,
"latency_ms": latency_ms,
"error": str(error) if error else None,
"timestamp": time.time(),
}))
Set up alerts for:
Track costs in real time:
class CostDashboard:
def __init__(self, daily_budget: float):
self.daily_budget = daily_budget
self.daily_cost = 0.0
self.requests_today = 0
def record(self, input_tokens: int, output_tokens: int, model: str):
cost = self._calculate_cost(input_tokens, output_tokens, model)
self.daily_cost += cost
self.requests_today += 1
if self.daily_cost > self.daily_budget * 0.8:
self._alert(f"80% of daily budget used: ${self.daily_cost:.2f}")
if self.daily_cost > self.daily_budget:
self._alert(f"BUDGET EXCEEDED: ${self.daily_cost:.2f}")
def _calculate_cost(self, input_tokens, output_tokens, model):
prices = {
"gpt-4o-mini": (0.15, 0.60),
"gpt-4o": (2.50, 10.00),
}
input_price, output_price = prices.get(model, (1.0, 3.0))
return (input_tokens / 1e6 * input_price) + (output_tokens / 1e6 * output_price)
def _alert(self, message):
logger.warning(f"COST ALERT: {message}")
Subscribe to continue reading
Get full access to this lesson and all 10 lessons in this course.