You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
Pipelining keeps every part of the processor busy at once: while one instruction is being executed, the next is being decoded and a third fetched. This lesson explains how the stages overlap, derives the speed-up, and confronts the three kinds of hazard — together with the branch prediction and operand forwarding that tame them.
This lesson develops OCR H446 section 1.1.1 (pipelining). It explains instruction pipelining and how stages of consecutive instructions overlap, classifies data, control and structural hazards, introduces branch prediction and operand forwarding as mitigations, and includes a worked pipeline speed-up calculation. It builds directly on the FDE cycle and on RISC's fixed-length instructions from the processor-types lesson.
In a non-pipelined processor, each instruction completes every stage before the next begins, so most of the CPU sits idle at any moment — while the execute unit works, the fetch and decode units do nothing. Pipelining divides the processor into independent stages, each with its own dedicated hardware, and feeds a new instruction into the first stage every cycle. The stages of different instructions then run at the same time.
The Gantt-style chart below shows three instructions flowing through a 3-stage (Fetch, Decode, Execute) pipeline. Read each row as one instruction and each column as one clock cycle; notice how the stages stagger diagonally so that all three units are active in cycle 3:
gantt
title 3-stage pipeline: instructions overlapping across cycles
dateFormat X
axisFormat c%s
section Instr 1
Fetch :i1f, 0, 1
Decode :i1d, 1, 1
Execute :i1e, 2, 1
section Instr 2
Fetch :i2f, 1, 1
Decode :i2d, 2, 1
Execute :i2e, 3, 1
section Instr 3
Fetch :i3f, 2, 1
Decode :i3d, 3, 1
Execute :i3e, 4, 1
Without pipelining these three instructions would take 3×3=9 cycles. Pipelined, they finish in just 5 cycles. Once the pipeline is full (from cycle 3 onward), one instruction completes every cycle, even though each individual instruction still takes 3 cycles end-to-end. Pipelining does not make a single instruction faster — it raises throughput (instructions completed per unit time).
This latency-versus-throughput distinction is the single most-tested idea in the topic, so it is worth nailing down. Latency is how long one instruction takes from start to finish — here, 3 cycles, unchanged by pipelining. Throughput is how many instructions complete per unit time — in steady state, one per cycle, a threefold gain for the 3-stage pipeline. The classic analogy is a launderette: washing then drying one load takes the same total time however many machines you own, but with a washer and a dryer you can dry load 1 while washing load 2, so loads complete twice as often. The processor does the same with fetch, decode and execute hardware.
For a k-stage pipeline processing n instructions with no hazards:
cyclespipelined=k+(n−1)
(the first instruction takes k cycles to fill the pipeline, then each remaining n−1 instruction adds one cycle), compared with the non-pipelined:
cyclessequential=k×n.
The speed-up is their ratio:
S=k+(n−1)k×n.
As n→∞, S→k: the maximum speed-up a k-stage pipeline can ever deliver is k times. The deeper the pipeline, the higher the ceiling — but, as we will see, the costlier each flush.
Real processors use more stages. The classic RISC pipeline has five:
| Stage | Abbrev. | Action |
|---|---|---|
| Instruction Fetch | IF | Retrieve instruction from memory |
| Instruction Decode | ID | Decode and read source registers |
| Execute | EX | ALU performs the operation / computes an address |
| Memory Access | MEM | Read from or write to data memory (if needed) |
| Write Back | WB | Write the result back to a register |
gantt
title 5-stage pipeline (IF ID EX MEM WB) — steady state from cycle 5
dateFormat X
axisFormat c%s
section Instr 1
IF :a1, 0, 1
ID :a2, 1, 1
EX :a3, 2, 1
MEM :a4, 3, 1
WB :a5, 4, 1
section Instr 2
IF :b1, 1, 1
ID :b2, 2, 1
EX :b3, 3, 1
MEM :b4, 4, 1
WB :b5, 5, 1
section Instr 3
IF :c1, 2, 1
ID :c2, 3, 1
EX :c3, 4, 1
MEM :c4, 5, 1
WB :c5, 6, 1
Advantages of deeper pipelines: higher throughput (more instructions in flight) and the ability to run a higher clock speed, because each stage does less work per cycle and so can complete in a shorter time.
Disadvantages: a bigger flush penalty (more partially-done instructions are discarded on a misprediction), more pipeline registers (latches) between stages, and eventually diminishing returns as hazard-handling overhead grows.
It is tempting to think that doubling the stage count doubles performance, but two effects fight back. First, splitting the work into more stages does not split it cleanly — each extra pipeline register adds a fixed latch delay to every stage, so the cycle time falls more slowly than 1/k. Second, the misprediction penalty rises in lock-step with depth: a 20-stage pipeline that mispredicts a branch may discard a dozen or more in-flight instructions, where a 5-stage pipeline loses only two. We can see the tension by combining the two. Suppose a workload mispredicts a branch every 5 instructions, each misprediction flushing roughly half the pipeline (≈ k/2 instructions):
CPI≈1+51×2k=1+10k.
For k=5 this gives CPI=1.5; for k=20 it gives CPI=3.0. Even though the deeper pipeline runs at a higher clock, its instructions per cycle has collapsed, so on branch-heavy code the extra depth can make real performance worse. This is precisely why the early-2000s race to 30-plus stages was abandoned and modern designs settle around 10–20 stages paired with very accurate predictors.
A hazard is any situation that stops the next instruction proceeding in its scheduled cycle, forcing a stall — the insertion of a bubble (an idle cycle) that wastes throughput. There are three kinds, and the discriminating skill at A-Level is to match the right mitigation to each kind. We will work through each hazard with a concrete example, trace where the conflict actually bites in the pipeline, and then show the mitigation removing or reducing the penalty.
A data hazard arises when an instruction needs a result that an earlier, still-in-flight instruction has not yet produced.
ADD R1, R2, R3 ; R1 = R2 + R3
SUB R4, R1, R5 ; R4 = R1 - R5 ; needs R1 before ADD has written it back
Where the conflict bites. The SUB reads its source registers in its decode (ID) stage. In a 5-stage pipeline the SUB enters ID in cycle 3 — but the ADD only writes R1 back in its write-back (WB) stage, which is cycle 5. The SUB therefore reads R1 two cycles too early and would pick up the stale value. The table below traces the collision; the WB of ADD (cycle 5) lands after the ID of SUB (cycle 3):
| Cycle | 1 | 2 | 3 | 4 | 5 |
|---|---|---|---|---|---|
| ADD R1,… | IF | ID | EX | MEM | WB |
| SUB …,R1,… | IF | ID | EX | MEM |
The arrow we care about runs from ADD's WB (cycle 5) backwards to SUB's ID (cycle 3) — a backward arrow in time is the visual signature of a true data hazard (a Read-After-Write, or RAW, dependency).
Worked mitigation — operand forwarding. Notice ADD has actually computed R1 at the end of its EX stage (cycle 3), long before WB. Operand forwarding (bypassing) adds a hardware path that routes the ALU output straight from ADD's EX-stage latch into the ALU input of SUB's EX stage in cycle 4 — before the value is ever written to the register file. With this path SUB no longer needs the value at ID; it needs it at EX (cycle 4), and forwarding delivers it exactly then, so the pair runs with zero stalls.
flowchart LR
ADDEX["ADD: EX (cycle 3)<br/>computes R1"] -->|forward R1| SUBEX["SUB: EX (cycle 4)<br/>uses R1"]
RF["Register file<br/>(WB in cycle 5)"] -.->|too late| SUBEX
The one case forwarding cannot fully cover is the load-use hazard: a value loaded from memory by LDR R1, [R6] is not available until the end of the MEM stage, one cycle later than an ALU result, so a dependent instruction immediately after still suffers a single-cycle stall. Tracing it makes the asymmetry clear:
LDR R1, [R6] ; R1 loaded from memory — value ready only after MEM
ADD R4, R1, R5 ; needs R1 in its EX stage
| Cycle | 1 | 2 | 3 | 4 | 5 |
|---|---|---|---|---|---|
| LDR R1,[R6] | IF | ID | EX | MEM | WB |
| ADD …,R1,… | IF | ID | stall | EX |
The LDR produces R1 at the end of cycle 4 (its MEM stage), but the ADD wanted it for its EX stage in cycle 4 — one cycle too soon. Forwarding from MEM-to-EX can deliver R1 only in time for cycle 5, so exactly one bubble must be inserted. Compilers attack this by reordering an independent instruction into the gap (filling the stall cycle with useful work), which is one reason RISC's simple, uniform instructions make such scheduling easier.
| Technique | How it works |
|---|---|
| Stalling (bubbling) | Insert NOPs until the value is ready — simple but slow |
| Operand forwarding (bypassing) | Route the ALU result directly from the producing stage to the consuming stage's input, before write-back, eliminating most stalls |
| Compiler reordering | The compiler slots independent instructions between the dependent pair |
A control hazard (branch hazard) occurs at a branch: the processor does not yet know whether the branch is taken, so it does not know which instruction to fetch next.
CMP R1, R2 ; compare
BEQ target ; branch if equal
ADD R3, R4, R5 ; fetched speculatively — may need discarding
Where the conflict bites. The branch outcome is not known until BEQ reaches its EX stage (where the comparison flags are tested), in cycle 3. But the pipeline must fetch something in cycles 2 and 3 to stay full, so it fetches the instructions sequentially after the branch. If the branch turns out to be taken, those one-or-two sequentially-fetched instructions are on the wrong path:
| Cycle | 1 | 2 | 3 | 4 |
|---|---|---|---|---|
| BEQ target | IF | ID | EX (outcome known) | MEM |
| ADD (wrong path) | IF | ID → flushed | ||
| next (wrong path) | IF → flushed |
Worked mitigation — flush vs predict. When the branch resolves as taken in cycle 3, every instruction fetched on the wrong path is flushed (converted to bubbles) and fetching restarts from target in cycle 4. Here that costs 2 wasted cycles — the number of stages between fetch and branch resolution. The mitigation is branch prediction: instead of blindly fetching sequentially, the processor guesses the outcome and speculatively fetches down the predicted path. A correct guess means the right instructions were in the pipeline all along — zero penalty; only a misprediction incurs the 2-cycle flush. The deeper the pipeline, the more stages between fetch and resolution, so the heavier the misprediction penalty — which is why accurate prediction becomes critical in deep pipelines (covered in detail below).
| Technique | How it works |
|---|---|
| Pipeline flush | On a taken (or mispredicted) branch, discard the wrongly fetched instructions and refetch from the target — wasting those cycles |
| Branch prediction | Guess the outcome and speculatively fetch down the predicted path |
| Delayed branching | The compiler fills the slot after the branch with an instruction that is needed either way |
A structural hazard occurs when two instructions need the same hardware resource in the same cycle — for example, both wanting the single memory port (one to fetch an instruction, another to access data).
Where the conflict bites. Consider a single-memory (Von Neumann) design. In any given cycle one instruction may be in its MEM stage reading data, while a later instruction is in its IF stage trying to fetch — both touch memory at once:
| Cycle | 1 | 2 | 3 | 4 |
|---|---|---|---|---|
| LDR (data access) | IF | ID | EX | MEM |
| later instr (fetch) | IF |
In cycle 4 the LDR's MEM and the later instruction's IF both demand the single memory port — they collide.
Worked mitigation — split the resource. Provide two ports: separate instruction and data caches (the modified-Harvard arrangement). Now the IF reads the instruction cache while the MEM reads the data cache in the same cycle with no contention, removing the structural hazard entirely. Where duplicating hardware is too expensive, the alternative is simply to stall the fetching instruction one cycle until the port frees — trading a bubble for the hardware saving.
| Technique | How it works |
|---|---|
| Duplicate / split hardware | Provide separate units — e.g. split instruction and data caches (the modified-Harvard arrangement), so fetch and data access never collide |
| Stalling | Delay one instruction a cycle until the resource frees |
This is a direct synoptic link to architecture: the modified Harvard split L1 caches exist precisely to remove this structural hazard, which is why almost every modern CPU is internally modified-Harvard even when its programming model is Von Neumann.
These two techniques do most of the heavy lifting in keeping a pipeline full.
Subscribe to continue reading
Get full access to this lesson and all 10 lessons in this course.