You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
The Transformer architecture, introduced in the 2017 paper Attention Is All You Need by Vaswani et al., revolutionised NLP and machine learning. Transformers replaced recurrent networks with self-attention, enabling parallel processing, better long-range dependency modelling, and massive scalability.
RNNs and LSTMs process text sequentially — one token at a time. This creates two major problems:
| Problem | Description |
|---|---|
| Slow training | Cannot parallelise across sequence positions |
| Long-range dependencies | Information from early tokens fades as the sequence grows |
Transformers solve both problems using self-attention, which processes all tokens in parallel and directly connects every position to every other position.
The Transformer consists of an encoder stack and a decoder stack, each made of repeated layers.
Each encoder layer contains:
Each decoder layer contains:
Self-attention allows each token to compute a weighted combination of all other tokens in the sequence.
Attention(Q, K, V) = softmax(Q * K^T / sqrt(d_k)) * V
| Component | Analogy | Role |
|---|---|---|
| Query (Q) | "What am I looking for?" | The current token's search vector |
| Key (K) | "What do I contain?" | Each token's identifier |
| Value (V) | "What information do I provide?" | The actual content to aggregate |
| sqrt(d_k) | Scaling factor | Prevents dot products from growing too large |
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
def scaled_dot_product_attention(Q, K, V, mask=None):
d_k = Q.size(-1)
scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
if mask is not None:
scores = scores.masked_fill(mask == 0, float('-inf'))
weights = F.softmax(scores, dim=-1)
output = torch.matmul(weights, V)
return output, weights
Instead of performing a single attention computation, the Transformer uses multiple attention heads in parallel. Each head learns to attend to different aspects of the input.
MultiHead(Q, K, V) = Concat(head_1, ..., head_h) * W_O
where head_i = Attention(Q * W_Q_i, K * W_K_i, V * W_V_i)
| Aspect | Detail |
|---|---|
| Number of heads | Typically 8 or 12 |
| Head dimension | d_model / num_heads (e.g., 768 / 12 = 64) |
| What different heads learn | Syntactic relationships, semantic similarity, positional patterns, etc. |
class MultiHeadAttention(nn.Module):
def __init__(self, d_model, num_heads):
super().__init__()
self.num_heads = num_heads
self.d_k = d_model // num_heads
self.W_q = nn.Linear(d_model, d_model)
self.W_k = nn.Linear(d_model, d_model)
self.W_v = nn.Linear(d_model, d_model)
self.W_o = nn.Linear(d_model, d_model)
Subscribe to continue reading
Get full access to this lesson and all 10 lessons in this course.