You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
Training a deep learning model is only half the challenge. To create real-world impact, models must be deployed — packaged, optimised, and served so that applications and users can make predictions in production.
| Stage | Description | Tools |
|---|---|---|
| Training | Train and validate the model | PyTorch, TensorFlow, GPUs |
| Export | Convert the model to a deployment-friendly format | ONNX, TorchScript, SavedModel |
| Optimise | Reduce model size and latency | Quantisation, pruning, distillation |
| Serve | Host the model behind an API or embed it | FastAPI, TorchServe, TF Serving |
| Monitor | Track performance, data drift, and errors | MLflow, Prometheus, Grafana |
TorchScript converts a PyTorch model to a format that can run without Python.
import torch
# Method 1: Tracing (works for models without control flow)
model.eval()
example_input = torch.randn(1, 3, 224, 224)
traced_model = torch.jit.trace(model, example_input)
traced_model.save('model_traced.pt')
# Method 2: Scripting (handles control flow like if/else)
scripted_model = torch.jit.script(model)
scripted_model.save('model_scripted.pt')
# Load in any environment (no Python needed)
loaded = torch.jit.load('model_traced.pt')
output = loaded(example_input)
ONNX provides a universal format that works across frameworks.
import torch
model.eval()
dummy_input = torch.randn(1, 3, 224, 224)
torch.onnx.export(
model,
dummy_input,
'model.onnx',
input_names=['input'],
output_names=['output'],
dynamic_axes={'input': {0: 'batch_size'}, 'output': {0: 'batch_size'}},
)
Quantisation reduces the precision of model weights and activations from 32-bit floating point to 8-bit integers, reducing model size and increasing inference speed.
| Type | Description | Speed-Up |
|---|---|---|
| Dynamic quantisation | Quantises weights ahead of time; activations quantised at runtime | 2–3x |
| Static quantisation | Quantises both weights and activations using calibration data | 3–4x |
| Quantisation-Aware Training (QAT) | Simulates quantisation during training for best accuracy | 3–4x |
import torch
# Dynamic quantisation (simplest approach)
quantised_model = torch.quantization.quantize_dynamic(
model,
{torch.nn.Linear}, # Layers to quantise
dtype=torch.qint8,
)
# Compare model sizes
import os
torch.save(model.state_dict(), 'original.pth')
torch.save(quantised_model.state_dict(), 'quantised.pth')
original_size = os.path.getsize('original.pth') / 1e6
quantised_size = os.path.getsize('quantised.pth') / 1e6
print(f"Original: {original_size:.1f} MB")
print(f"Quantised: {quantised_size:.1f} MB")
Pruning removes unnecessary weights (setting them to zero), reducing model size and computation.
import torch.nn.utils.prune as prune
# Prune 30% of weights with smallest magnitude
prune.l1_unstructured(model.fc1, name='weight', amount=0.3)
# Count zero weights
total = model.fc1.weight.nelement()
zeros = (model.fc1.weight == 0).sum().item()
print(f"Sparsity: {100 * zeros / total:.1f}%")
Knowledge distillation trains a smaller "student" model to mimic the predictions of a larger "teacher" model.
import torch.nn.functional as F
def distillation_loss(student_logits, teacher_logits, labels, temperature=4.0, alpha=0.7):
# Soft targets from teacher
soft_loss = F.kl_div(
F.log_softmax(student_logits / temperature, dim=1),
F.softmax(teacher_logits / temperature, dim=1),
reduction='batchmean',
) * (temperature ** 2)
# Hard targets (true labels)
hard_loss = F.cross_entropy(student_logits, labels)
Subscribe to continue reading
Get full access to this lesson and all 10 lessons in this course.