Transformers and Attention

The Transformer architecture, introduced in the 2017 paper Attention Is All You Need by Vaswani et al., revolutionised NLP and machine learning. Transformers replaced recurrent networks with self-attention, enabling parallel processing, better long-range dependency modelling, and massive scalability.

Why Transformers?

RNNs and LSTMs process text sequentially — one token at a time. This creates two major problems:

Problem	Description
Slow training	Cannot parallelise across sequence positions
Long-range dependencies	Information from early tokens fades as the sequence grows

Transformers solve both problems using self-attention, which processes all tokens in parallel and directly connects every position to every other position.

The Transformer Architecture

The Transformer consists of an encoder stack and a decoder stack, each made of repeated layers.

Encoder Layer

Each encoder layer contains:

Multi-Head Self-Attention — allows each token to attend to all other tokens
Feed-Forward Network — two linear layers with a ReLU activation
Layer Normalisation and Residual Connections around each sub-layer

Decoder Layer

Each decoder layer contains:

Masked Multi-Head Self-Attention — each token can only attend to previous tokens (no peeking at the future)
Multi-Head Cross-Attention — attends to the encoder output
Feed-Forward Network — same as encoder
Layer Normalisation and Residual Connections

Self-Attention (Scaled Dot-Product Attention)

Self-attention allows each token to compute a weighted combination of all other tokens in the sequence.

Step-by-Step

For each token, compute three vectors: Query (Q), Key (K), Value (V)
Compute attention scores: score = Q * K^T / sqrt(d_k)
Apply softmax to get attention weights
Multiply weights by Values to get the output

The Formula

Attention(Q, K, V) = softmax(Q * K^T / sqrt(d_k)) * V

Component	Analogy	Role
Query (Q)	"What am I looking for?"	The current token's search vector
Key (K)	"What do I contain?"	Each token's identifier
Value (V)	"What information do I provide?"	The actual content to aggregate
sqrt(d_k)	Scaling factor	Prevents dot products from growing too large

Self-Attention in PyTorch

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

def scaled_dot_product_attention(Q, K, V, mask=None):
    d_k = Q.size(-1)
    scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float('-inf'))
    weights = F.softmax(scores, dim=-1)
    output = torch.matmul(weights, V)
    return output, weights

Multi-Head Attention

Instead of performing a single attention computation, the Transformer uses multiple attention heads in parallel. Each head learns to attend to different aspects of the input.

MultiHead(Q, K, V) = Concat(head_1, ..., head_h) * W_O
where head_i = Attention(Q * W_Q_i, K * W_K_i, V * W_V_i)

Aspect	Detail
Number of heads	Typically 8 or 12
Head dimension	d_model / num_heads (e.g., 768 / 12 = 64)
What different heads learn	Syntactic relationships, semantic similarity, positional patterns, etc.

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        self.num_heads = num_heads
        self.d_k = d_model // num_heads

        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)

Transformers and Attention

Transformers and Attention

Why Transformers?

The Transformer Architecture

Encoder Layer

Decoder Layer

Self-Attention (Scaled Dot-Product Attention)

Step-by-Step

The Formula

Self-Attention in PyTorch

Multi-Head Attention

More in Data Science