Parallel Processing and Multi-core Systems

For decades, computers got faster simply because each new generation of processor ran at a higher clock speed. Around the mid-2000s that "free lunch" ended: clock speeds stopped climbing because of heat and power limits. Since then, performance has come from doing many things at once rather than doing one thing faster. This lesson explains the forms parallelism takes, how a multi-core processor is organised, the crucial difference between parallel and concurrent execution, why Flynn's taxonomy (SISD/SIMD/MISD/MIMD) is the standard way to classify parallel machines, and why Amdahl's law means that adding cores almost never gives proportional speedup. By the end you should be able to classify a system using Flynn's taxonomy, perform an Amdahl's-law calculation, and argue precisely about when parallelism helps and when it does not.

Spec Mapping

This lesson addresses the AQA A-Level Computer Science (7517) specification within §4.7 Fundamentals of computer organisation and architecture, drawing on §4.7.3 Structure and role of the processor and its components — specifically the factors that affect processor performance, the use of multiple cores, and parallel processing.

The treatment of Flynn's taxonomy, the parallel-versus-concurrent distinction, and Amdahl's law develops the spec's requirement to understand how the number of cores and the nature of a task affect performance. It links forwards to ideas of concurrency met in algorithm and operating-system contexts, and backwards to §4.7.2 the stored program concept, since each core is itself a stored-program processor running its own Fetch-Decode-Execute cycle.

Why Parallelism Became Necessary

There are hard physical limits on how fast a single core can run:

The power wall. Dynamic power consumed by a chip rises sharply with clock frequency and voltage. Beyond roughly 3–4 GHz the extra heat becomes impractical to remove, so pushing the clock higher gives diminishing returns for rapidly rising power and temperature.
Signal propagation. Electrical signals travel at a finite speed. As clock periods shrink, the distance a signal can cross in one cycle shrinks too, constraining how large and how fast a single core can be.
Diminishing instruction-level returns. Techniques that speed up a single instruction stream (deeper pipelines, wider superscalar issue) eventually hit dependency and branch limits.

The industry response was to stop chasing higher clock speeds for one core and instead place several cores on one chip. The performance now comes from running work in parallel — but only software that can be split into independent pieces benefits, which is the central tension of this whole topic.

Parallel versus Concurrent: a Precise Distinction

These two words are routinely confused, and examiners reward the precise difference.

Concurrency means managing several tasks that are all in progress over the same period, but not necessarily executing at literally the same instant. A single-core CPU achieves concurrency by time-slicing — rapidly switching between tasks so they all make progress. At any single instant, exactly one task is actually running.
Parallelism means executing two or more tasks at literally the same instant, which requires more than one execution unit (e.g. multiple cores). Parallelism is a way of achieving concurrency, but concurrency does not require parallelism.

	Concurrency	Parallelism
Definition	Multiple tasks in progress over an interval	Multiple tasks executing at the same instant
Hardware needed	Works on a single core (time-slicing)	Requires multiple cores / execution units
Relationship	The broader idea	A specific way of realising concurrency
Example	One core rapidly switching between a browser and a music player	Two cores, each running one of those programs simultaneously

A clean exam sentence: "All parallel execution is concurrent, but not all concurrency is parallel."

Types of Parallelism

Instruction-Level Parallelism (ILP)

ILP extracts parallelism from within a single instruction stream by overlapping or reordering independent instructions:

Pipelining (previous lesson) overlaps the fetch/decode/execute stages of consecutive instructions.
Superscalar execution provides multiple execution units (several ALUs, load/store units) so more than one instruction can be issued per clock cycle.
Out-of-order execution lets independent instructions execute as soon as their operands are ready, rather than strictly in program order.

ILP is largely invisible to the programmer — the hardware finds the parallelism automatically.

Data-Level Parallelism (DLP)

DLP applies the same operation to many data items at once. This is the SIMD idea: one instruction, many data. DLP suits graphics, audio, image processing and scientific computing, and is exposed through instruction-set extensions (e.g. SSE and AVX on x86, NEON on ARM) and, at large scale, through GPUs.

To see why it helps, consider brightening an image by adding a constant to every pixel value. The scalar (one-at-a-time) approach processes a single value per instruction:

# Scalar: one addition per loop iteration
FOR i ← 0 TO n - 1
    pixel[i] ← pixel[i] + brightness
ENDFOR

A SIMD instruction loads several adjacent pixels into one wide register and adds the constant to all of them in a single operation. If a 128-bit register holds four 32-bit values, four additions complete in the time of one, so the loop runs roughly four times fewer iterations:

# SIMD: four additions per instruction (128-bit register, 4 x 32-bit lanes)
FOR i ← 0 TO n - 1 STEP 4
    pixel[i..i+3] ← pixel[i..i+3] + brightness   # one vector add, four lanes
ENDFOR

The crucial point is that the same instruction acts on multiple data items — this is exactly what makes it SIMD in Flynn's taxonomy, and exactly why image, audio and video processing (where one operation is applied uniformly across huge arrays) gain so much from it. It also explains why a GPU, with thousands of lanes, is the extreme expression of this idea.

Task-Level Parallelism (TLP)

TLP runs different threads or processes on different cores at the same time. The operating-system scheduler allocates threads to cores; each core runs its own independent instruction stream. TLP is the headline benefit of a multi-core processor and the form most visible to application programmers, who must explicitly split their work into threads.

The three forms at a glance

Form	What is parallelised	Granularity	Who exploits it	Example
ILP	Independent instructions in one stream	Finest	The hardware, automatically	Pipelining, superscalar issue
DLP	The same operation over many data items	Medium	Compiler / vector instructions	SIMD brightness filter, GPU shading
TLP	Whole threads/processes	Coarsest	The programmer + OS scheduler	Two apps on two cores

A well-engineered system uses all three at once: each core pipelines and superscalar-issues instructions (ILP), runs vector instructions on arrays (DLP), and the OS spreads independent threads across cores (TLP). They are complementary layers, not alternatives.

Multi-Core Processors

A multi-core processor places two or more independent cores on a single chip, each able to execute its own Fetch-Decode-Execute cycle.

flowchart TB
    subgraph Chip["CPU Chip"]
      C0["Core 0<br/>L1-I + L1-D"]
      C1["Core 1<br/>L1-I + L1-D"]
      C2["Core 2<br/>L1-I + L1-D"]
      C3["Core 3<br/>L1-I + L1-D"]
      L3["Shared L3 Cache"]
      C0 --> L3
      C1 --> L3
      C2 --> L3
      C3 --> L3
    end
    L3 <--> RAM["Main Memory (RAM)"]

Each core typically owns its L1 cache (split into instruction and data parts — the Modified Harvard idea from lesson 1) and a private L2 cache, while a larger L3 cache is shared between cores so they can exchange data quickly.

Advantages

Higher throughput — independent threads or processes run genuinely in parallel.
Better multitasking — the OS can place different applications on different cores so a heavy task on one core does not stall the rest.
Better performance-per-watt — two cores at a moderate clock often deliver more work for less power than one core driven to a very high clock, because power rises faster than frequency.

The power advantage deserves a closer look, because it is the reason the industry moved to multi-core. Dynamic power rises roughly with the square of the voltage and with the clock frequency, and pushing a single core to a very high clock also needs a higher voltage. Doubling a single core's clock can therefore more than quadruple its power and heat. Splitting the same work across two cores each running at the original clock and voltage delivers similar throughput for far less power — which is why a quad-core chip at 3 GHz is practical where a single core at 12 GHz would be impossible to cool. This is the engineering reality behind the "power wall": once clock scaling became thermally unaffordable, adding cores was the only way to keep raising total performance.

Limitations

Not all software parallelises. Inherently sequential code, where each step depends on the previous result, gains nothing from extra cores.
Synchronisation and communication overhead — cores must coordinate access to shared data (locks, cache-coherence traffic), and this overhead grows with the number of cores.
Amdahl's law — the sequential fraction of a program places a hard ceiling on the achievable speedup, however many cores are added (see below).
Operating-system and programmer burden — the OS scheduler must distribute threads across cores intelligently, and the programmer must explicitly divide a task into threads and manage shared data safely. A single-threaded program simply will not use the extra cores no matter how many are present.

A subtle consequence is that a faster clock and more cores are not interchangeable. A higher clock speeds up every program, including purely sequential ones, because each instruction completes sooner. Extra cores speed up only code that has been written to run in parallel. This is why a user running one old single-threaded application may see no benefit at all from upgrading to a CPU with twice as many cores, even though a multi-threaded video encoder on the same machine runs dramatically faster.

Amdahl's Law

Amdahl's law quantifies the maximum speedup obtainable by parallelising a program, given that some fraction must run sequentially. If P is the parallelisable proportion of the work and N is the number of cores, the speedup is:

$\text{Speedup} = \frac{1}{(1 - P) + \dfrac{P}{N}}$

where:

P is the fraction of the program that can be parallelised, with $0 \le P \le 1$ ;
(1 - P) is the sequential fraction that cannot be sped up;
N is the number of processor cores.

As $N \to \infty$ the term $\frac{P}{N} \to 0$ , so the speedup approaches a hard ceiling:

$\text{Speedup}_{\max} = \frac{1}{1 - P}$

Worked calculation

Suppose 80% of a program can be parallelised (P = 0.8) and we run it on 4 cores:

$\text{Speedup} = \frac{1}{(1 - 0.8) + \dfrac{0.8}{4}} = \frac{1}{0.2 + 0.2} = \frac{1}{0.4} = 2.5$

So four cores give only a 2.5× speedup, not 4×, because the 20% sequential portion dominates. Pushing to infinitely many cores caps the speedup at $\frac{1}{1 - 0.8} = 5$ — you can never exceed 5× for this program no matter how much hardware you buy.

Speedup table

`P` (parallel fraction)	N = 2	N = 4	N = 8	N → ∞
0.50	1.33×	1.60×	1.78×	2.00×
0.75	1.60×	2.29×	2.91×	4.00×
0.90	1.82×	3.08×	4.71×	10.00×
0.95	1.90×	3.48×	5.93×	20.00×

The pattern is stark: the higher the sequential fraction, the lower the ceiling. This is exactly why doubling the cores does not double the performance — a phrase examiners love to test.

A time-based view of the same result

It often helps to think in concrete times rather than ratios. Suppose a job takes 100 seconds on one core, of which 20 seconds is irreducibly sequential and 80 seconds is perfectly parallelisable (so P = 0.8). On N cores the parallel part is divided among the cores while the sequential part is unchanged:

Cores N	Sequential time	Parallel time (80 s ÷ N)	Total time	Speedup
1	20 s	80 s	100 s	1.0×
2	20 s	40 s	60 s	1.67×
4	20 s	20 s	40 s	2.5×
8	20 s	10 s	30 s	3.33×
∞	20 s	0 s	20 s	5.0×

The 20-second sequential block is a fixed floor the total time can never fall below, however many cores you add — which is why the speedup converges on 5×. This time-based picture is the same Amdahl result seen from the other side, and it makes the "diminishing returns" obvious: going from 1 to 2 cores saves 40 seconds, but going from 4 to 8 cores saves only 10. Each doubling buys less because the parallel part is already small while the sequential floor stays put.

Why real speedups fall short of even the Amdahl figure

Amdahl's law gives an optimistic upper bound. In practice, the measured speedup is usually lower still, because of overheads the formula ignores:

Synchronisation — threads must wait at barriers and acquire locks, idling cores.
Communication — cores exchange data and keep their caches coherent, which consumes bus bandwidth.
Load imbalance — if the work does not divide evenly, some cores finish early and sit idle.

These overheads tend to grow with the number of cores, so beyond a certain point adding cores can even make a program slower. The exam-ready takeaway: Amdahl's law sets the ceiling, and real-world overheads pull the actual result below it.

Exam Tip: Show the substitution explicitly. Write the formula, substitute P and N, then evaluate. A bare numerical answer with no working risks losing method marks even if the figure is right.

Parallel Processing and Multi-core Systems

Parallel Processing and Multi-core Systems

Spec Mapping

Why Parallelism Became Necessary

Parallel versus Concurrent: a Precise Distinction

Types of Parallelism

Instruction-Level Parallelism (ILP)

Data-Level Parallelism (DLP)

Task-Level Parallelism (TLP)

The three forms at a glance

Multi-Core Processors

Advantages

Limitations

Amdahl's Law

Worked calculation

Speedup table

A time-based view of the same result

Why real speedups fall short of even the Amdahl figure

More in Computer Science