You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
If you watch a single instruction make its way through a processor, it spends most of its life waiting. While the execute unit is busy with one instruction, the fetch hardware sits idle; while the next instruction is being fetched, the execute unit sits idle. Pipelining is the technique that puts all that idle hardware to work: it lets several instructions occupy different stages of the cycle at the same time, so that — once the pipeline is full — one instruction completes every clock cycle even though each individual instruction still takes several cycles from start to finish.
Pipelining is one of the most heavily examined topics in A-Level computer architecture, because it ties together the FDE cycle, the RISC/CISC distinction and the Modified Harvard cache split, and because it has a satisfyingly precise timing model you can calculate with. This lesson develops the timing diagram, derives the speed-up formula, and then tackles the part candidates find hardest: the three kinds of pipeline hazard (data, control and structural), with concrete instruction-level examples and the techniques used to overcome each.
This lesson maps to the AQA A-Level Computer Science (7517) specification, §4.7.3 Structure and role of the processor and its components:
Without pipelining, the processor completes one full Fetch-Decode-Execute cycle before starting the next instruction. The hardware for each stage is used one-third of the time:
| Cycle | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
|---|---|---|---|---|---|---|---|---|---|
| Instruction 1 | F | D | E | ||||||
| Instruction 2 | F | D | E | ||||||
| Instruction 3 | F | D | E |
Three instructions take 9 cycles — one instruction completes every 3 cycles.
With pipelining, as soon as instruction 1 leaves the Fetch stage, instruction 2 can enter it. The stages overlap:
| Cycle | 1 | 2 | 3 | 4 | 5 |
|---|---|---|---|---|---|
| Instruction 1 | F | D | E | ||
| Instruction 2 | F | D | E | ||
| Instruction 3 | F | D | E |
Now three instructions take just 5 cycles. Once the pipeline is full (from cycle 3 onwards), one instruction completes on every clock cycle, even though each instruction still individually takes three cycles end-to-end. Pipelining does not make any single instruction faster — it raises throughput (instructions finished per unit time).
A laundry analogy makes the idea intuitive. Suppose washing one load involves three stages — wash (30 min), dry (30 min), fold (30 min) — and you have three loads.
Crucially, each load still takes 90 minutes from start to finish — pipelining did not speed up any individual load. What improved is the rate at which loads complete once the system is full: one every 30 minutes. That is precisely the distinction between latency (time for one item, unchanged) and throughput (items per unit time, increased) that pipelining delivers in a processor.
flowchart LR
F["Fetch"] --> D["Decode"] --> E["Execute"]
note["At any instant: one instruction in F,<br/>another in D, another in E"]
| Without pipelining | With 3-stage pipeline | |
|---|---|---|
| 3 instructions take | 9 cycles | 5 cycles |
| Throughput (steady state) | 1 instruction per 3 cycles | 1 instruction per cycle |
For a pipeline of n stages executing k instructions:
Cycles without pipelining=n×k Cycles with pipelining=n+(k−1)
The first n cycles fill the pipeline; thereafter each new cycle retires one instruction.
To make the formula concrete, here is a full 5-instruction program through a 3-stage (F, D, E) pipeline:
| Cycle | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
|---|---|---|---|---|---|---|---|
| Instruction 1 | F | D | E | ||||
| Instruction 2 | F | D | E | ||||
| Instruction 3 | F | D | E | ||||
| Instruction 4 | F | D | E | ||||
| Instruction 5 | F | D | E |
This takes n+(k−1)=3+(5−1)=7 cycles, versus n×k=3×5=15 without pipelining. From cycle 3 onwards one instruction completes every cycle (instruction 1 finishes in cycle 3, instruction 2 in cycle 4, and so on). The first two cycles are the unavoidable fill cost during which no instruction has yet completed.
As k grows large the speed-up approaches n (the number of stages). For example, with a 3-stage pipeline and 100 instructions:
Exam Tip: You may be asked to calculate the cycles or speed-up for given
nandk. Memorise the two formulae — n×k versus n+(k−1) — and remember the speed-up tends towardsnonly for long instruction streams with no stalls.
The deeper the pipeline (more stages), the higher the potential speed-up — but the greater the penalty when a hazard forces the pipeline to be flushed, because more partially completed instructions are lost.
Real processors split the work into many more than three stages so each stage is shorter and the clock can run faster. A common five/six-stage RISC pipeline is:
Some designs add a separate Operand Fetch stage. Historically, the Intel Pentium 4 (NetBurst) pushed this idea to extremes with pipelines of around 20–31 stages to chase very high clock speeds — at the cost of huge flush penalties on a mispredicted branch.
The benefit of more stages is subtle: the clock speed is limited by the slowest single stage, because every stage must complete within one clock tick. By dividing the work into more, smaller stages, each stage does less, so it can complete faster, so the clock can tick faster. For example, splitting one 4-nanosecond stage into two 2-nanosecond stages could double the clock rate. This is why the drive for higher GHz in the late 1990s and early 2000s went hand-in-hand with ever-deeper pipelines.
The catch — as the worked calculation later shows — is that deeper pipelines lose more cycles per branch misprediction, so there is an optimum depth beyond which real-world performance falls. Stage balance matters too: if the stages are unequal, the clock is throttled by the slowest one and the faster stages sit partly idle, so designers work hard to divide the pipeline into stages of roughly equal duration.
Pipelining only delivers its ideal throughput if a fresh instruction can enter every cycle and proceed without interference. A hazard is any situation that prevents the next instruction from executing in the next cycle, forcing the processor to insert a stall (also called a bubble — effectively a wasted cycle). There are three types.
A data hazard occurs when an instruction needs the result of an earlier instruction that has not yet been written back.
ADD R1, R2, R3 # R1 = R2 + R3
SUB R4, R1, R5 # R4 = R1 - R5 <-- needs the NEW R1, not yet written back
The SUB reaches its execute stage before the ADD has written R1 back, so without intervention it would read a stale value.
| Cycle | 1 | 2 | 3 | 4 | 5 |
|---|---|---|---|---|---|
ADD R1,R2,R3 | IF | ID | EX | MEM | WB |
SUB R4,R1,R5 | IF | ID | EX(needs R1!) | … |
Solutions:
ADD directly to the input of SUB as soon as it is computed, without waiting for write-back. This is the primary hardware fix and removes most data-hazard stalls.To see the cost of a stall, take the same dependent pair on a 5-stage pipeline (IF, ID, EX, MEM, WB) without forwarding. The SUB cannot read R1 until the ADD has reached write-back (WB), so the processor must insert bubbles (shown as ░) to delay SUB:
| Cycle | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
|---|---|---|---|---|---|---|---|
ADD R1,R2,R3 | IF | ID | EX | MEM | WB | ||
SUB R4,R1,R5 | IF | ░ | ░ | ID | EX | MEM |
The SUB is held in its decode/operand-read stage for two cycles until R1 is written back, wasting two cycles. Forwarding removes this entirely: the ALU result of ADD (available at the end of cycle 3) is fed straight into the SUB's execute input in cycle 4, so SUB proceeds with no bubble. This is exactly why forwarding hardware is built into real pipelines — it converts a common, expensive stall into no stall at all.
A control hazard occurs at a branch: the processor does not know which instruction to fetch next until the branch condition has been evaluated, yet the pipeline wants to fetch something now.
CMP R1, R2
BEQ skip # branch if equal -- fetch the next line, or jump to 'skip'?
ADD R6, R6, #1 # only runs if the branch is NOT taken
skip:
SUB R7, R7, #1
If the processor guesses wrong and has already fetched and partly executed instructions from the wrong path, the pipeline must be flushed: every partially completed instruction on the wrong path is discarded and fetching restarts from the correct address. In a deep pipeline this throws away many cycles of work.
Solutions:
A structural hazard occurs when two instructions need the same hardware resource in the same cycle — for example, one instruction fetching from memory while another performs a load/store, when there is only one memory port.
Solution: duplicate the resource. The classic example is the Modified Harvard split into separate instruction (L1-I) and data (L1-D) caches, so an instruction fetch and a data access can proceed in the same cycle without contention. Providing multiple ALUs likewise removes ALU structural hazards.
| Hazard type | Cause | Example | Solutions |
|---|---|---|---|
| Data | Instruction depends on a not-yet-written result of an earlier instruction | ADD R1,R2,R3 then SUB R4,R1,R5 | Forwarding, stalling, compiler reordering |
| Control | Branch — next address unknown until condition resolved | BEQ skip | Branch prediction, delayed branching, pipeline flush |
| Structural | Two instructions need the same hardware at once | Two instructions both need the single memory port | Duplicate hardware (split I/D caches, extra ALUs) |
Exam Tip: For each hazard you should be able to (1) define it, (2) give a concrete instruction-level example, and (3) name at least one solution. The most common errors are confusing a data hazard (dependency on a result) with a structural hazard (contention for hardware), and forgetting that a control hazard specifically arises from branches.
Examiners may give you a short sequence and ask you to identify the hazard. Work through this example:
1 LDR R1, X # load X into R1
2 ADD R2, R1, R3 # R2 = R1 + R3
3 STR R2, Y # store R2 into Y
4 BEQ done # branch if zero flag set
5 ADD R5, R5, #1
done:
Subscribe to continue reading
Get full access to this lesson and all 10 lessons in this course.