Deploying Deep Learning Models

Training a deep learning model is only half the challenge. To create real-world impact, models must be deployed — packaged, optimised, and served so that applications and users can make predictions in production.

From Training to Production

Stage	Description	Tools
Training	Train and validate the model	PyTorch, TensorFlow, GPUs
Export	Convert the model to a deployment-friendly format	ONNX, TorchScript, SavedModel
Optimise	Reduce model size and latency	Quantisation, pruning, distillation
Serve	Host the model behind an API or embed it	FastAPI, TorchServe, TF Serving
Monitor	Track performance, data drift, and errors	MLflow, Prometheus, Grafana

Exporting Models

TorchScript (PyTorch)

TorchScript converts a PyTorch model to a format that can run without Python.

import torch

# Method 1: Tracing (works for models without control flow)
model.eval()
example_input = torch.randn(1, 3, 224, 224)
traced_model = torch.jit.trace(model, example_input)
traced_model.save('model_traced.pt')

# Method 2: Scripting (handles control flow like if/else)
scripted_model = torch.jit.script(model)
scripted_model.save('model_scripted.pt')

# Load in any environment (no Python needed)
loaded = torch.jit.load('model_traced.pt')
output = loaded(example_input)

ONNX (Open Neural Network Exchange)

ONNX provides a universal format that works across frameworks.

import torch

model.eval()
dummy_input = torch.randn(1, 3, 224, 224)

torch.onnx.export(
    model,
    dummy_input,
    'model.onnx',
    input_names=['input'],
    output_names=['output'],
    dynamic_axes={'input': {0: 'batch_size'}, 'output': {0: 'batch_size'}},
)

Model Optimisation

Quantisation

Quantisation reduces the precision of model weights and activations from 32-bit floating point to 8-bit integers, reducing model size and increasing inference speed.

Type	Description	Speed-Up
Dynamic quantisation	Quantises weights ahead of time; activations quantised at runtime	2–3x
Static quantisation	Quantises both weights and activations using calibration data	3–4x
Quantisation-Aware Training (QAT)	Simulates quantisation during training for best accuracy	3–4x

import torch

# Dynamic quantisation (simplest approach)
quantised_model = torch.quantization.quantize_dynamic(
    model,
    {torch.nn.Linear},   # Layers to quantise
    dtype=torch.qint8,
)

# Compare model sizes
import os
torch.save(model.state_dict(), 'original.pth')
torch.save(quantised_model.state_dict(), 'quantised.pth')
original_size = os.path.getsize('original.pth') / 1e6
quantised_size = os.path.getsize('quantised.pth') / 1e6
print(f"Original: {original_size:.1f} MB")
print(f"Quantised: {quantised_size:.1f} MB")

Pruning

Pruning removes unnecessary weights (setting them to zero), reducing model size and computation.

import torch.nn.utils.prune as prune

# Prune 30% of weights with smallest magnitude
prune.l1_unstructured(model.fc1, name='weight', amount=0.3)

# Count zero weights
total = model.fc1.weight.nelement()
zeros = (model.fc1.weight == 0).sum().item()
print(f"Sparsity: {100 * zeros / total:.1f}%")

Knowledge Distillation

Knowledge distillation trains a smaller "student" model to mimic the predictions of a larger "teacher" model.

import torch.nn.functional as F

def distillation_loss(student_logits, teacher_logits, labels, temperature=4.0, alpha=0.7):
    # Soft targets from teacher
    soft_loss = F.kl_div(
        F.log_softmax(student_logits / temperature, dim=1),
        F.softmax(teacher_logits / temperature, dim=1),
        reduction='batchmean',
    ) * (temperature ** 2)

    # Hard targets (true labels)
    hard_loss = F.cross_entropy(student_logits, labels)

Deploying Deep Learning Models

Deploying Deep Learning Models

From Training to Production

Exporting Models

TorchScript (PyTorch)

ONNX (Open Neural Network Exchange)

Model Optimisation

Quantisation

Pruning

Knowledge Distillation

More in Data Science