Regularisation and Optimisation

Training deep neural networks is as much about preventing overfitting and navigating complex loss landscapes as it is about building the right architecture. This lesson covers the key regularisation techniques and optimisation strategies that make deep learning work in practice.

Why Regularisation Matters

Deep neural networks have millions of parameters — far more than the number of training examples in most datasets. This gives them enormous capacity to memorise the training data rather than learning generalisable patterns. Regularisation is the set of techniques used to prevent overfitting and improve generalisation.

Symptom	Diagnosis	Solution
Training loss low, validation loss high	Overfitting	Apply regularisation
Both losses high	Underfitting	Increase model capacity or train longer
Both losses low and similar	Good generalisation	Model is well-regularised

Dropout

Dropout (Srivastava et al., 2014) randomly sets a fraction of neurons to zero during each training step. This prevents neurons from co-adapting and forces the network to learn redundant representations.

import torch.nn as nn

class RegularisedNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(784, 256)
        self.dropout1 = nn.Dropout(p=0.5)  # 50% dropout rate
        self.fc2 = nn.Linear(256, 128)
        self.dropout2 = nn.Dropout(p=0.3)
        self.fc3 = nn.Linear(128, 10)
        self.relu = nn.ReLU()

    def forward(self, x):
        x = self.relu(self.fc1(x))
        x = self.dropout1(x)             # Applied during training only
        x = self.relu(self.fc2(x))
        x = self.dropout2(x)
        x = self.fc3(x)
        return x

Dropout Rate	Use Case
0.2 – 0.3	Input layer, convolutional layers
0.5	Hidden fully-connected layers (classic default)
0.0	Output layer (never apply dropout to the output)

Tip: Remember to call model.train() during training (dropout active) and model.eval() during evaluation (dropout disabled).

Weight Decay (L2 Regularisation)

Weight decay adds a penalty proportional to the square of the weights to the loss function, discouraging large weight values.

Total Loss = Data Loss + lambda * sum(w^2)

import torch.optim as optim

# Weight decay is built into the optimiser
optimizer = optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-4)

# AdamW (decoupled weight decay) — preferred for Transformers
optimizer = optim.AdamW(model.parameters(), lr=0.001, weight_decay=0.01)

Batch Normalisation

Batch normalisation (Ioffe and Szegedy, 2015) normalises the output of each layer to have zero mean and unit variance within each mini-batch. This reduces internal covariate shift and enables faster, more stable training.

# For fully-connected layers
bn = nn.BatchNorm1d(num_features=256)

# For convolutional layers
bn2d = nn.BatchNorm2d(num_features=64)

# Typical usage: Conv → BatchNorm → ReLU
layer = nn.Sequential(
    nn.Conv2d(3, 64, kernel_size=3, padding=1),
    nn.BatchNorm2d(64),
    nn.ReLU(),
)

Layer Normalisation

Layer normalisation normalises across the features (rather than the batch). It is preferred for RNNs and Transformers because it does not depend on batch size.

# Layer norm — normalises across features
ln = nn.LayerNorm(normalized_shape=256)

Technique	Normalises Across	Best For
Batch Norm	Batch dimension	CNNs
Layer Norm	Feature dimension	Transformers, RNNs
Instance Norm	Spatial dimensions per sample	Style transfer
Group Norm	Groups of channels	Small batch sizes

Early Stopping

Early stopping monitors the validation loss during training and stops when it begins to increase (indicating overfitting).

best_val_loss = float('inf')
patience = 10
counter = 0

for epoch in range(max_epochs):
    train_loss = train_one_epoch()
    val_loss = validate()

Regularisation and Optimisation

Regularisation and Optimisation

Why Regularisation Matters

Dropout

Weight Decay (L2 Regularisation)

Batch Normalisation

Layer Normalisation

Early Stopping

More in Data Science