Pipelining

If you watch a single instruction make its way through a processor, it spends most of its life waiting. While the execute unit is busy with one instruction, the fetch hardware sits idle; while the next instruction is being fetched, the execute unit sits idle. Pipelining is the technique that puts all that idle hardware to work: it lets several instructions occupy different stages of the cycle at the same time, so that — once the pipeline is full — one instruction completes every clock cycle even though each individual instruction still takes several cycles from start to finish.

Pipelining is one of the most heavily examined topics in A-Level computer architecture, because it ties together the FDE cycle, the RISC/CISC distinction and the Modified Harvard cache split, and because it has a satisfyingly precise timing model you can calculate with. This lesson develops the timing diagram, derives the speed-up formula, and then tackles the part candidates find hardest: the three kinds of pipeline hazard (data, control and structural), with concrete instruction-level examples and the techniques used to overcome each.

Spec Mapping

This lesson maps to the AQA A-Level Computer Science (7517) specification, §4.7.3 Structure and role of the processor and its components:

The concept of pipelining as the overlapping of the fetch, decode and execute stages of consecutive instructions to increase processor throughput.
How pipelining improves performance, and the limits imposed by pipeline hazards (data, control/branch and structural hazards) and the resulting stalls.
The relationship between pipelining and processor design choices (fixed- vs variable-length instructions, RISC vs CISC, separate instruction/data access).

The Concept

Without pipelining, the processor completes one full Fetch-Decode-Execute cycle before starting the next instruction. The hardware for each stage is used one-third of the time:

Cycle	1	2	3	4	5	6	7	8	9
Instruction 1	F	D	E
Instruction 2				F	D	E
Instruction 3							F	D	E

Three instructions take 9 cycles — one instruction completes every 3 cycles.

With pipelining, as soon as instruction 1 leaves the Fetch stage, instruction 2 can enter it. The stages overlap:

Cycle	1	2	3	4	5
Instruction 1	F	D	E
Instruction 2		F	D	E
Instruction 3			F	D	E

Now three instructions take just 5 cycles. Once the pipeline is full (from cycle 3 onwards), one instruction completes on every clock cycle, even though each instruction still individually takes three cycles end-to-end. Pipelining does not make any single instruction faster — it raises throughput (instructions finished per unit time).

An everyday analogy

A laundry analogy makes the idea intuitive. Suppose washing one load involves three stages — wash (30 min), dry (30 min), fold (30 min) — and you have three loads.

Without pipelining you wash, dry and fold load 1 completely (90 min) before starting load 2: total 270 minutes.
With pipelining you start washing load 2 the moment load 1 moves to the dryer, and load 3 enters the washer as load 2 moves to the dryer. The washer, dryer and folder are all busy at once. Total: 150 minutes.

Crucially, each load still takes 90 minutes from start to finish — pipelining did not speed up any individual load. What improved is the rate at which loads complete once the system is full: one every 30 minutes. That is precisely the distinction between latency (time for one item, unchanged) and throughput (items per unit time, increased) that pipelining delivers in a processor.

flowchart LR
    F["Fetch"] --> D["Decode"] --> E["Execute"]
    note["At any instant: one instruction in F,<br/>another in D, another in E"]

How Pipelining Improves Throughput

	Without pipelining	With 3-stage pipeline
3 instructions take	9 cycles	5 cycles
Throughput (steady state)	1 instruction per 3 cycles	1 instruction per cycle

For a pipeline of n stages executing k instructions:

$\text{Cycles without pipelining} = n \times k$ $\text{Cycles with pipelining} = n + (k - 1)$

The first n cycles fill the pipeline; thereafter each new cycle retires one instruction.

To make the formula concrete, here is a full 5-instruction program through a 3-stage (F, D, E) pipeline:

Cycle	1	2	3	4	5	6	7
Instruction 1	F	D	E
Instruction 2		F	D	E
Instruction 3			F	D	E
Instruction 4				F	D	E
Instruction 5					F	D	E

This takes $n + (k-1) = 3 + (5-1) = 7$ cycles, versus $n \times k = 3 \times 5 = 15$ without pipelining. From cycle 3 onwards one instruction completes every cycle (instruction 1 finishes in cycle 3, instruction 2 in cycle 4, and so on). The first two cycles are the unavoidable fill cost during which no instruction has yet completed.

As k grows large the speed-up approaches n (the number of stages). For example, with a 3-stage pipeline and 100 instructions:

Without pipelining: $3 \times 100 = 300$ cycles.
With pipelining: $3 + (100 - 1) = 102$ cycles.
Speed-up: $300 / 102 \approx 2.94\times$ — close to the theoretical maximum of 3.

Exam Tip: You may be asked to calculate the cycles or speed-up for given n and k. Memorise the two formulae — $n \times k$ versus $n + (k-1)$ — and remember the speed-up tends towards n only for long instruction streams with no stalls.

The deeper the pipeline (more stages), the higher the potential speed-up — but the greater the penalty when a hazard forces the pipeline to be flushed, because more partially completed instructions are lost.

Modern Pipeline Stages

Real processors split the work into many more than three stages so each stage is shorter and the clock can run faster. A common five/six-stage RISC pipeline is:

Instruction Fetch (IF) — fetch the instruction from instruction cache.
Instruction Decode (ID) — decode the opcode and read source registers.
Execute (EX) — perform the ALU operation or compute an address.
Memory Access (MEM) — read or write data memory (load/store instructions only).
Write Back (WB) — write the result back into a register.

Some designs add a separate Operand Fetch stage. Historically, the Intel Pentium 4 (NetBurst) pushed this idea to extremes with pipelines of around 20–31 stages to chase very high clock speeds — at the cost of huge flush penalties on a mispredicted branch.

Why split into more stages at all?

The benefit of more stages is subtle: the clock speed is limited by the slowest single stage, because every stage must complete within one clock tick. By dividing the work into more, smaller stages, each stage does less, so it can complete faster, so the clock can tick faster. For example, splitting one 4-nanosecond stage into two 2-nanosecond stages could double the clock rate. This is why the drive for higher GHz in the late 1990s and early 2000s went hand-in-hand with ever-deeper pipelines.

The catch — as the worked calculation later shows — is that deeper pipelines lose more cycles per branch misprediction, so there is an optimum depth beyond which real-world performance falls. Stage balance matters too: if the stages are unequal, the clock is throttled by the slowest one and the faster stages sit partly idle, so designers work hard to divide the pipeline into stages of roughly equal duration.

Pipeline Hazards

Pipelining only delivers its ideal throughput if a fresh instruction can enter every cycle and proceed without interference. A hazard is any situation that prevents the next instruction from executing in the next cycle, forcing the processor to insert a stall (also called a bubble — effectively a wasted cycle). There are three types.

1. Data Hazards

A data hazard occurs when an instruction needs the result of an earlier instruction that has not yet been written back.

ADD R1, R2, R3    # R1 = R2 + R3
SUB R4, R1, R5    # R4 = R1 - R5   <-- needs the NEW R1, not yet written back

The SUB reaches its execute stage before the ADD has written R1 back, so without intervention it would read a stale value.

Cycle	1	2	3	4	5
`ADD R1,R2,R3`	IF	ID	EX	MEM	WB
`SUB R4,R1,R5`		IF	ID	EX(needs R1!)	…

Solutions:

Forwarding (bypassing): route the ALU result of ADD directly to the input of SUB as soon as it is computed, without waiting for write-back. This is the primary hardware fix and removes most data-hazard stalls.
Stalling (inserting NOPs): hold the dependent instruction for one or more cycles until the result is available — correct, but wastes cycles.
Compiler reordering: the compiler schedules independent instructions between the two dependent ones to fill the gap usefully.

What a stall actually looks like

To see the cost of a stall, take the same dependent pair on a 5-stage pipeline (IF, ID, EX, MEM, WB) without forwarding. The SUB cannot read R1 until the ADD has reached write-back (WB), so the processor must insert bubbles (shown as ░) to delay SUB:

Cycle	1	2	3	4	5	6	7
`ADD R1,R2,R3`	IF	ID	EX	MEM	WB
`SUB R4,R1,R5`		IF	░	░	ID	EX	MEM

The SUB is held in its decode/operand-read stage for two cycles until R1 is written back, wasting two cycles. Forwarding removes this entirely: the ALU result of ADD (available at the end of cycle 3) is fed straight into the SUB's execute input in cycle 4, so SUB proceeds with no bubble. This is exactly why forwarding hardware is built into real pipelines — it converts a common, expensive stall into no stall at all.

2. Control Hazards (Branch Hazards)

A control hazard occurs at a branch: the processor does not know which instruction to fetch next until the branch condition has been evaluated, yet the pipeline wants to fetch something now.

CMP R1, R2
BEQ skip          # branch if equal -- fetch the next line, or jump to 'skip'?
ADD R6, R6, #1    # only runs if the branch is NOT taken
skip:
SUB R7, R7, #1

If the processor guesses wrong and has already fetched and partly executed instructions from the wrong path, the pipeline must be flushed: every partially completed instruction on the wrong path is discarded and fetching restarts from the correct address. In a deep pipeline this throws away many cycles of work.

Solutions:

Branch prediction: the processor predicts whether the branch will be taken (using history of past behaviour) and fetches speculatively down the predicted path. Modern predictors are correct the large majority of the time, so the flush penalty is rarely paid.
Delayed branching: the instruction immediately after the branch (the delay slot) is always executed, giving the processor a cycle to resolve the branch; the compiler fills the slot with a useful, independent instruction.

3. Structural Hazards (Resource Conflicts)

A structural hazard occurs when two instructions need the same hardware resource in the same cycle — for example, one instruction fetching from memory while another performs a load/store, when there is only one memory port.

Solution: duplicate the resource. The classic example is the Modified Harvard split into separate instruction (L1-I) and data (L1-D) caches, so an instruction fetch and a data access can proceed in the same cycle without contention. Providing multiple ALUs likewise removes ALU structural hazards.

Summary of Hazards

Hazard type	Cause	Example	Solutions
Data	Instruction depends on a not-yet-written result of an earlier instruction	`ADD R1,R2,R3` then `SUB R4,R1,R5`	Forwarding, stalling, compiler reordering
Control	Branch — next address unknown until condition resolved	`BEQ skip`	Branch prediction, delayed branching, pipeline flush
Structural	Two instructions need the same hardware at once	Two instructions both need the single memory port	Duplicate hardware (split I/D caches, extra ALUs)

Exam Tip: For each hazard you should be able to (1) define it, (2) give a concrete instruction-level example, and (3) name at least one solution. The most common errors are confusing a data hazard (dependency on a result) with a structural hazard (contention for hardware), and forgetting that a control hazard specifically arises from branches.

Spotting hazards in a code fragment

Examiners may give you a short sequence and ask you to identify the hazard. Work through this example:

1   LDR R1, X        # load X into R1
2   ADD R2, R1, R3   # R2 = R1 + R3
3   STR R2, Y        # store R2 into Y
4   BEQ done         # branch if zero flag set
5   ADD R5, R5, #1
done:

Pipelining

Pipelining

Spec Mapping

The Concept

An everyday analogy

How Pipelining Improves Throughput

Modern Pipeline Stages

Why split into more stages at all?

Pipeline Hazards

1. Data Hazards

What a stall actually looks like

2. Control Hazards (Branch Hazards)

3. Structural Hazards (Resource Conflicts)

Summary of Hazards

Spotting hazards in a code fragment

More in Computer Science