You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
For decades, computers got faster simply because each new generation of processor ran at a higher clock speed. Around the mid-2000s that "free lunch" ended: clock speeds stopped climbing because of heat and power limits. Since then, performance has come from doing many things at once rather than doing one thing faster. This lesson explains the forms parallelism takes, how a multi-core processor is organised, the crucial difference between parallel and concurrent execution, why Flynn's taxonomy (SISD/SIMD/MISD/MIMD) is the standard way to classify parallel machines, and why Amdahl's law means that adding cores almost never gives proportional speedup. By the end you should be able to classify a system using Flynn's taxonomy, perform an Amdahl's-law calculation, and argue precisely about when parallelism helps and when it does not.
This lesson addresses the AQA A-Level Computer Science (7517) specification within §4.7 Fundamentals of computer organisation and architecture, drawing on §4.7.3 Structure and role of the processor and its components — specifically the factors that affect processor performance, the use of multiple cores, and parallel processing.
The treatment of Flynn's taxonomy, the parallel-versus-concurrent distinction, and Amdahl's law develops the spec's requirement to understand how the number of cores and the nature of a task affect performance. It links forwards to ideas of concurrency met in algorithm and operating-system contexts, and backwards to §4.7.2 the stored program concept, since each core is itself a stored-program processor running its own Fetch-Decode-Execute cycle.
There are hard physical limits on how fast a single core can run:
The industry response was to stop chasing higher clock speeds for one core and instead place several cores on one chip. The performance now comes from running work in parallel — but only software that can be split into independent pieces benefits, which is the central tension of this whole topic.
These two words are routinely confused, and examiners reward the precise difference.
| Concurrency | Parallelism | |
|---|---|---|
| Definition | Multiple tasks in progress over an interval | Multiple tasks executing at the same instant |
| Hardware needed | Works on a single core (time-slicing) | Requires multiple cores / execution units |
| Relationship | The broader idea | A specific way of realising concurrency |
| Example | One core rapidly switching between a browser and a music player | Two cores, each running one of those programs simultaneously |
A clean exam sentence: "All parallel execution is concurrent, but not all concurrency is parallel."
ILP extracts parallelism from within a single instruction stream by overlapping or reordering independent instructions:
ILP is largely invisible to the programmer — the hardware finds the parallelism automatically.
DLP applies the same operation to many data items at once. This is the SIMD idea: one instruction, many data. DLP suits graphics, audio, image processing and scientific computing, and is exposed through instruction-set extensions (e.g. SSE and AVX on x86, NEON on ARM) and, at large scale, through GPUs.
To see why it helps, consider brightening an image by adding a constant to every pixel value. The scalar (one-at-a-time) approach processes a single value per instruction:
# Scalar: one addition per loop iteration
FOR i ← 0 TO n - 1
pixel[i] ← pixel[i] + brightness
ENDFOR
A SIMD instruction loads several adjacent pixels into one wide register and adds the constant to all of them in a single operation. If a 128-bit register holds four 32-bit values, four additions complete in the time of one, so the loop runs roughly four times fewer iterations:
# SIMD: four additions per instruction (128-bit register, 4 x 32-bit lanes)
FOR i ← 0 TO n - 1 STEP 4
pixel[i..i+3] ← pixel[i..i+3] + brightness # one vector add, four lanes
ENDFOR
The crucial point is that the same instruction acts on multiple data items — this is exactly what makes it SIMD in Flynn's taxonomy, and exactly why image, audio and video processing (where one operation is applied uniformly across huge arrays) gain so much from it. It also explains why a GPU, with thousands of lanes, is the extreme expression of this idea.
TLP runs different threads or processes on different cores at the same time. The operating-system scheduler allocates threads to cores; each core runs its own independent instruction stream. TLP is the headline benefit of a multi-core processor and the form most visible to application programmers, who must explicitly split their work into threads.
| Form | What is parallelised | Granularity | Who exploits it | Example |
|---|---|---|---|---|
| ILP | Independent instructions in one stream | Finest | The hardware, automatically | Pipelining, superscalar issue |
| DLP | The same operation over many data items | Medium | Compiler / vector instructions | SIMD brightness filter, GPU shading |
| TLP | Whole threads/processes | Coarsest | The programmer + OS scheduler | Two apps on two cores |
A well-engineered system uses all three at once: each core pipelines and superscalar-issues instructions (ILP), runs vector instructions on arrays (DLP), and the OS spreads independent threads across cores (TLP). They are complementary layers, not alternatives.
A multi-core processor places two or more independent cores on a single chip, each able to execute its own Fetch-Decode-Execute cycle.
flowchart TB
subgraph Chip["CPU Chip"]
C0["Core 0<br/>L1-I + L1-D"]
C1["Core 1<br/>L1-I + L1-D"]
C2["Core 2<br/>L1-I + L1-D"]
C3["Core 3<br/>L1-I + L1-D"]
L3["Shared L3 Cache"]
C0 --> L3
C1 --> L3
C2 --> L3
C3 --> L3
end
L3 <--> RAM["Main Memory (RAM)"]
Each core typically owns its L1 cache (split into instruction and data parts — the Modified Harvard idea from lesson 1) and a private L2 cache, while a larger L3 cache is shared between cores so they can exchange data quickly.
The power advantage deserves a closer look, because it is the reason the industry moved to multi-core. Dynamic power rises roughly with the square of the voltage and with the clock frequency, and pushing a single core to a very high clock also needs a higher voltage. Doubling a single core's clock can therefore more than quadruple its power and heat. Splitting the same work across two cores each running at the original clock and voltage delivers similar throughput for far less power — which is why a quad-core chip at 3 GHz is practical where a single core at 12 GHz would be impossible to cool. This is the engineering reality behind the "power wall": once clock scaling became thermally unaffordable, adding cores was the only way to keep raising total performance.
A subtle consequence is that a faster clock and more cores are not interchangeable. A higher clock speeds up every program, including purely sequential ones, because each instruction completes sooner. Extra cores speed up only code that has been written to run in parallel. This is why a user running one old single-threaded application may see no benefit at all from upgrading to a CPU with twice as many cores, even though a multi-threaded video encoder on the same machine runs dramatically faster.
Amdahl's law quantifies the maximum speedup obtainable by parallelising a program, given that some fraction must run sequentially. If P is the parallelisable proportion of the work and N is the number of cores, the speedup is:
Speedup=(1−P)+NP1
where:
P is the fraction of the program that can be parallelised, with 0≤P≤1;(1 - P) is the sequential fraction that cannot be sped up;N is the number of processor cores.As N→∞ the term NP→0, so the speedup approaches a hard ceiling:
Speedupmax=1−P1
Suppose 80% of a program can be parallelised (P = 0.8) and we run it on 4 cores:
Speedup=(1−0.8)+40.81=0.2+0.21=0.41=2.5
So four cores give only a 2.5× speedup, not 4×, because the 20% sequential portion dominates. Pushing to infinitely many cores caps the speedup at 1−0.81=5 — you can never exceed 5× for this program no matter how much hardware you buy.
P (parallel fraction) | N = 2 | N = 4 | N = 8 | N → ∞ |
|---|---|---|---|---|
| 0.50 | 1.33× | 1.60× | 1.78× | 2.00× |
| 0.75 | 1.60× | 2.29× | 2.91× | 4.00× |
| 0.90 | 1.82× | 3.08× | 4.71× | 10.00× |
| 0.95 | 1.90× | 3.48× | 5.93× | 20.00× |
The pattern is stark: the higher the sequential fraction, the lower the ceiling. This is exactly why doubling the cores does not double the performance — a phrase examiners love to test.
It often helps to think in concrete times rather than ratios. Suppose a job takes 100 seconds on one core, of which 20 seconds is irreducibly sequential and 80 seconds is perfectly parallelisable (so P = 0.8). On N cores the parallel part is divided among the cores while the sequential part is unchanged:
| Cores N | Sequential time | Parallel time (80 s ÷ N) | Total time | Speedup |
|---|---|---|---|---|
| 1 | 20 s | 80 s | 100 s | 1.0× |
| 2 | 20 s | 40 s | 60 s | 1.67× |
| 4 | 20 s | 20 s | 40 s | 2.5× |
| 8 | 20 s | 10 s | 30 s | 3.33× |
| ∞ | 20 s | 0 s | 20 s | 5.0× |
The 20-second sequential block is a fixed floor the total time can never fall below, however many cores you add — which is why the speedup converges on 5×. This time-based picture is the same Amdahl result seen from the other side, and it makes the "diminishing returns" obvious: going from 1 to 2 cores saves 40 seconds, but going from 4 to 8 cores saves only 10. Each doubling buys less because the parallel part is already small while the sequential floor stays put.
Amdahl's law gives an optimistic upper bound. In practice, the measured speedup is usually lower still, because of overheads the formula ignores:
These overheads tend to grow with the number of cores, so beyond a certain point adding cores can even make a program slower. The exam-ready takeaway: Amdahl's law sets the ceiling, and real-world overheads pull the actual result below it.
Exam Tip: Show the substitution explicitly. Write the formula, substitute
PandN, then evaluate. A bare numerical answer with no working risks losing method marks even if the figure is right.
Subscribe to continue reading
Get full access to this lesson and all 10 lessons in this course.