You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
The Transformer architecture, introduced in the 2017 paper Attention Is All You Need by Vaswani et al., revolutionised NLP and machine learning. Transformers replaced recurrent networks with self-attention, enabling parallel processing, better long-range dependency modelling, and massive scalability.
RNNs and LSTMs process text sequentially — one token at a time. This creates two major problems:
| Problem | Description |
|---|---|
| Slow training | Cannot parallelise across sequence positions |
| Long-range dependencies | Information from early tokens fades as the sequence grows |
Transformers solve both problems using self-attention, which processes all tokens in parallel and directly connects every position to every other position.
The Transformer consists of an encoder stack and a decoder stack, each made of repeated layers.
Each encoder layer contains:
Subscribe to continue reading
Get full access to this lesson and all 10 lessons in this course.