Large Language Models & Generative AI

Large Language Models (LLMs) are the technology behind the most widely discussed AI systems today. These models can generate human-like text, answer questions, write code, translate languages, and reason about complex problems.

The Transformer Architecture

The foundation of modern LLMs is the Transformer, introduced in the 2017 paper "Attention Is All You Need" by Vaswani et al. Transformers solve two key problems with RNNs: slow sequential training and lost long-range dependencies.

Self-Attention Mechanism

Self-attention allows every token to attend to every other token simultaneously:

Self-attention scores when processing the word "it" in the sentence "The cat sat on the mat because it was tired":

Token	The	cat	sat	on	the	mat	because	it	was	tired
Score	0.05	0.45	0.05	0.02	0.03	0.10	0.05	0.05	0.05	0.15

"it" attends most strongly to "cat" and "tired".

Self-attention uses three projections: Query (what am I looking for?), Key (what do I contain?), Value (what information do I provide?). Multi-head attention runs several such computations in parallel.

Transformer Block

graph TD
  A["Multi-Head Attention<br/>+ Residual + Layer Norm"] --> B["Feed-Forward Network<br/>+ Residual + Layer Norm"]
  B --> C["(repeated N times)"]

How LLMs Are Trained

Stage 1: Pre-training

Trained on a massive corpus (trillions of tokens) to predict the next token. Extraordinarily expensive — frontier models can cost hundreds of millions in compute.

Stage 2: Fine-tuning

Further trained on curated datasets: Supervised Fine-Tuning (SFT) on high-quality examples, or domain-specific fine-tuning for particular fields.

Stage 3: RLHF

graph TD
  A["Prompt"] --> B["LLM generates multiple outputs"]
  B --> C["Humans rank outputs"]
  C --> D["Train reward model"]
  D --> E["Fine-tune LLM with RL"]
  E --> F["Improved LLM"]

Tip: Anthropic developed Constitutional AI, using AI-generated feedback to supplement human feedback for more scalable alignment.

How Text Generation Works

LLMs generate text one token at a time via next-token prediction.

Temperature controls randomness (0 = deterministic, 0.7 = balanced, >1.0 = very random). Top-p (nucleus sampling) considers only the smallest set of tokens whose cumulative probability exceeds p.

Large Language Models & Generative AI

Large Language Models & Generative AI

The Transformer Architecture

Self-Attention Mechanism

Transformer Block

How LLMs Are Trained

Stage 1: Pre-training

Stage 2: Fine-tuning

Stage 3: RLHF

How Text Generation Works

Key Large Language Models

More in AI