You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
Training deep neural networks is as much about preventing overfitting and navigating complex loss landscapes as it is about building the right architecture. This lesson covers the key regularisation techniques and optimisation strategies that make deep learning work in practice.
Deep neural networks have millions of parameters — far more than the number of training examples in most datasets. This gives them enormous capacity to memorise the training data rather than learning generalisable patterns. Regularisation is the set of techniques used to prevent overfitting and improve generalisation.
| Symptom | Diagnosis | Solution |
|---|---|---|
| Training loss low, validation loss high | Overfitting | Apply regularisation |
| Both losses high | Underfitting | Increase model capacity or train longer |
| Both losses low and similar | Good generalisation | Model is well-regularised |
Dropout (Srivastava et al., 2014) randomly sets a fraction of neurons to zero during each training step. This prevents neurons from co-adapting and forces the network to learn redundant representations.
import torch.nn as nn
class RegularisedNet(nn.Module):
def __init__(self):
super().__init__()
self.fc1 = nn.Linear(784, 256)
self.dropout1 = nn.Dropout(p=0.5) # 50% dropout rate
self.fc2 = nn.Linear(256, 128)
self.dropout2 = nn.Dropout(p=0.3)
self.fc3 = nn.Linear(128, 10)
self.relu = nn.ReLU()
def forward(self, x):
x = self.relu(self.fc1(x))
x = self.dropout1(x) # Applied during training only
x = self.relu(self.fc2(x))
x = self.dropout2(x)
x = self.fc3(x)
return x
| Dropout Rate | Use Case |
|---|---|
| 0.2 – 0.3 | Input layer, convolutional layers |
| 0.5 | Hidden fully-connected layers (classic default) |
| 0.0 | Output layer (never apply dropout to the output) |
Tip: Remember to call
model.train()during training (dropout active) andmodel.eval()during evaluation (dropout disabled).
Weight decay adds a penalty proportional to the square of the weights to the loss function, discouraging large weight values.
Total Loss = Data Loss + lambda * sum(w^2)
import torch.optim as optim
# Weight decay is built into the optimiser
optimizer = optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-4)
# AdamW (decoupled weight decay) — preferred for Transformers
optimizer = optim.AdamW(model.parameters(), lr=0.001, weight_decay=0.01)
Batch normalisation (Ioffe and Szegedy, 2015) normalises the output of each layer to have zero mean and unit variance within each mini-batch. This reduces internal covariate shift and enables faster, more stable training.
# For fully-connected layers
bn = nn.BatchNorm1d(num_features=256)
# For convolutional layers
bn2d = nn.BatchNorm2d(num_features=64)
# Typical usage: Conv → BatchNorm → ReLU
layer = nn.Sequential(
nn.Conv2d(3, 64, kernel_size=3, padding=1),
nn.BatchNorm2d(64),
nn.ReLU(),
)
Layer normalisation normalises across the features (rather than the batch). It is preferred for RNNs and Transformers because it does not depend on batch size.
# Layer norm — normalises across features
ln = nn.LayerNorm(normalized_shape=256)
| Technique | Normalises Across | Best For |
|---|---|---|
| Batch Norm | Batch dimension | CNNs |
| Layer Norm | Feature dimension | Transformers, RNNs |
| Instance Norm | Spatial dimensions per sample | Style transfer |
| Group Norm | Groups of channels | Small batch sizes |
Early stopping monitors the validation loss during training and stops when it begins to increase (indicating overfitting).
best_val_loss = float('inf')
patience = 10
counter = 0
for epoch in range(max_epochs):
train_loss = train_one_epoch()
val_loss = validate()
Subscribe to continue reading
Get full access to this lesson and all 10 lessons in this course.