You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
Not all processors are built the same way. This lesson compares the two great instruction-set philosophies — RISC and CISC — then surveys the specialised and parallel hardware that defines modern computing: GPUs, multicore processors, co-processors, and the mathematics that governs how much parallelism can actually buy you, Amdahl's Law.
This lesson develops OCR H446 sections 1.1.1 and 1.1.2 (types of processor and parallel processing). It contrasts RISC and CISC characteristics and trade-offs, explains GPUs and the data-parallel workloads they suit, covers multicore and parallel processing and co-processors, and introduces Amdahl's Law as a quantitative limit on parallel speed-up, with a worked calculation. It links forward to the Pipelining lesson, where RISC's fixed-length instructions prove decisive.
A CISC processor offers a large, varied instruction set in which a single instruction may perform a complex, multi-step operation — for example, an instruction that reads two operands from memory, multiplies them, and writes the result back, all in one. CISC arose when memory was scarce and expensive: packing more work into each instruction kept programs small and reduced the number of slow memory fetches.
| Feature | Detail |
|---|---|
| Instruction set | Large — hundreds of different instructions |
| Instruction length | Variable — instructions occupy different numbers of bytes |
| Instruction complexity | A single instruction can perform a complex task (e.g. memory-to-memory multiply) |
| Clock cycles per instruction | Variable — a complex instruction may take many cycles |
| Addressing modes | Many, giving the compiler flexibility |
| Registers | Fewer general-purpose registers |
| Hardware | Complex — typically uses microcode to decompose complex instructions |
| Compiler complexity | Simpler — many powerful instructions to choose from |
Examples: Intel x86 / x86-64 in most desktops and laptops; AMD's x86-64 processors.
A RISC processor uses a small, highly optimised instruction set in which each instruction does one simple thing and ideally completes in a single clock cycle. Memory access is restricted to dedicated LOAD and STORE instructions (a load-store architecture); all arithmetic operates register-to-register. The simplicity moves complexity from the silicon to the compiler — and, crucially, makes instruction timing predictable.
| Feature | Detail |
|---|---|
| Instruction set | Small — typically fewer than 100 instructions |
| Instruction length | Fixed — every instruction is the same width |
| Instruction complexity | Each instruction performs one simple operation |
| Clock cycles per instruction | Ideally one |
| Addressing modes | Few |
| Registers | Many general-purpose registers (often 32+) |
| Hardware | Simpler — instructions are hardwired, no microcode |
| Compiler complexity | More complex — it must break tasks into sequences of simple instructions |
Examples: ARM (smartphones, tablets, Raspberry Pi); Apple M-series (ARM-based); MIPS (embedded systems).
| Feature | CISC | RISC |
|---|---|---|
| Number of instructions | Many (hundreds) | Few (under 100) |
| Instruction length | Variable | Fixed |
| Cycles per instruction | Multiple | Ideally one |
| Registers | Fewer | Many (32+) |
| Pipelining efficiency | Harder (variable-length, unpredictable) | Easier (fixed-length, predictable) |
| Power consumption | Higher | Lower |
| Code density | Higher (fewer instructions per task) | Lower (more instructions per task) |
| Compiler complexity | Simpler | More complex |
| Microcode | Yes | No (hardwired) |
| Typical use | Desktop PCs, servers | Mobile, embedded, increasingly laptops/servers |
The central trade-off: CISC achieves high code density and shifts effort onto the hardware, which suited memory-scarce eras and eases the compiler's job. RISC accepts larger code in exchange for simpler, faster, lower-power hardware with uniform instruction timing. That uniformity is the killer feature for pipelining: because every RISC instruction is the same length and takes a predictable number of stages, the processor can overlap instructions cleanly. (This is why RISC's fixed-length instructions make pipelining easier — developed fully in the next lesson.) The low power draw is also why RISC/ARM dominates battery-powered devices.
The pipelining link deserves spelling out, because it is the heart of why RISC won in low-power computing. A pipeline overlaps instructions by splitting each into stages (fetch, decode, execute, …) and working on several instructions at once, each in a different stage. For this to flow smoothly, the processor must know where the next instruction begins the instant it starts fetching the current one. With RISC's fixed-length instructions, the next instruction is always a fixed number of bytes ahead — the fetch unit simply advances the PC by a constant and never has to wait. The decoder, too, always sees fields (opcode, registers) in the same bit positions, so decoding is fast and uniform.
CISC's variable-length instructions break this. The processor cannot know where instruction n+1 starts until it has at least partly decoded instruction n to discover n's length. Fetch therefore cannot run cleanly ahead of decode, and the stages are harder to balance because a complex multi-cycle instruction and a one-cycle instruction occupy a pipeline stage for very different times — creating structural and timing complications. RISC's uniform length and single-cycle ideal mean each stage takes a predictable time, so the pipeline stays full. This is precisely why the comparison table lists "pipelining efficiency" as easier for RISC: it is a direct consequence of fixed-length instructions, not a coincidence. (The mechanics of pipelining and its hazards are developed fully in the next lesson.)
The line has blurred. Modern x86 (CISC) chips decode each complex instruction into a stream of internal micro-operations (micro-ops) that a RISC-like back-end executes and pipelines. So a contemporary processor presents a CISC interface but runs a RISC-style engine underneath. This is the best of both worlds: programs keep the dense, backward-compatible CISC instruction encoding, while the fast, pipelined RISC-style core does the actual work on the simple micro-ops. It also explains why the RISC-versus-CISC question is no longer about raw speed but about power efficiency and design philosophy — the reason ARM (a clean RISC design with no legacy CISC front-end to decode) dominates phones and is now competitive in laptops and servers.
Exam Tip: To compare RISC and CISC, give at least four crisp differences (instruction count, length fixed/variable, cycles per instruction, register count), then note that RISC pipelines more easily because instructions are fixed-length, and finish with the modern micro-op convergence for top marks.
A multicore processor places two or more independent cores on one chip, each able to fetch, decode and execute its own instruction stream. This delivers true parallelism (genuinely simultaneous execution), unlike time-sliced concurrency on a single core.
| Aspect | Explanation |
|---|---|
| True parallelism | Each core runs a separate thread at the same instant |
| Shared cache | Cores usually share L2/L3, enabling fast inter-core data sharing |
| Multitasking | The OS schedules different processes/threads onto different cores |
The two most examinable parallel models, SIMD and MIMD, come from Flynn's taxonomy, which classifies architectures by how many instruction streams and how many data streams they handle at once:
| Class | Instruction streams | Data streams | Meaning | Typical hardware |
|---|---|---|---|---|
| SISD | One | One | A classic single-core processor doing one thing at a time | Traditional uniprocessor |
| SIMD | One | Many | One instruction applied simultaneously across many data items | GPU shader cores; CPU vector units (AVX) |
| MISD | Many | One | Several operations on the same data (rare) | Fault-tolerant pipelines (very rare) |
| MIMD | Many | Many | Independent cores each running their own instructions on their own data | Multicore CPUs; clusters |
The distinction matters because SIMD and MIMD suit different problems. SIMD shines when the same calculation must hit a huge, regular dataset — shading a million pixels, multiplying two large matrices, applying a filter to every sample in an audio buffer. Because every data lane runs the identical instruction in lockstep, the control hardware is shared and the silicon spent per lane is tiny, which is why a GPU can pack thousands of SIMD lanes. The weakness is branch divergence: if a conditional sends some lanes one way and the rest another, the hardware must run both paths with the inactive lanes masked off, wasting throughput. SIMD therefore hates irregular, branchy data.
MIMD is the opposite trade-off. Each core has its own control unit and program counter, so cores can run entirely different code on entirely different data — one core compressing a file, another decoding a video, a third running the operating system. This flexibility suits task-level parallelism and irregular workloads, but it costs far more silicon per core (each needs its own fetch/decode machinery) and demands synchronisation when cores share data. The practical upshot: a modern PC uses MIMD at the level of its handful of fat CPU cores and SIMD inside both the GPU and the CPU's vector units — the two models are layered, not rivals.
| Limitation | Explanation |
|---|---|
| Software must be multi-threaded | A single-threaded program uses only one core; the rest sit idle |
| Synchronisation overhead | Coordinating threads, locks and shared data costs time |
| Diminishing returns | Doubling cores rarely doubles performance |
| Amdahl's Law | The serial fraction caps the achievable speed-up (below) |
It is tempting to imagine that doubling the cores doubles the speed, but three real costs intervene. First, the software must be written to use them: a single-threaded program runs on exactly one core and leaves the rest idle, so the gain depends entirely on how much of the work can be split into independent threads. Second, threads that share data must be synchronised — using locks, semaphores or message passing — and this coordination is itself serial work that does not parallelise; the more cores, the more coordination, so synchronisation overhead grows with scale. Third, the cores compete for shared resources: they share the L3 cache and the single path to main memory, so adding cores can saturate memory bandwidth and starve every core at once. A closely related issue is cache coherence — when several cores cache the same memory location, the hardware must ensure that a write by one core is seen by the others, using a coherence protocol whose traffic also grows with core count. All three costs are why real speed-up falls short of the core count, and they are the practical backdrop against which Amdahl's Law sets the theoretical ceiling.
Amdahl's Law quantifies the ceiling on speed-up when only part of a program can be parallelised. If a fraction p of the work is parallelisable and (1−p) is inherently serial, then running the parallel part across N processors gives an overall speed-up of:
S(N)=(1−p)+Np1
As N→∞, the parallel term vanishes and the speed-up is bounded by the serial fraction alone:
Smax=1−p1
Question: A program spends 80% of its time in a part that can be parallelised and 20% in a strictly serial part. Find the speed-up on (a) 4 cores and (b) an unlimited number of cores.
Here p=0.8 and 1−p=0.2.
(a) With N=4:
S(4)=0.2+40.81=0.2+0.21=0.41=2.5
So four cores give only a 2.5× speed-up, not 4× — the serial 20% holds it back.
(b) As N→∞:
Smax=1−0.81=0.21=5
Even with infinite cores the program can never run more than 5× faster, because one fifth of it must run serially. This is the fundamental reason adding cores yields diminishing returns, and why reducing the serial fraction often matters more than adding hardware.
Exam Tip: In Amdahl's Law questions, identify p and 1−p first, substitute into S(N)=1/((1−p)+p/N), and remember the limit 1/(1−p) for "infinite processors". Always interpret the number: state plainly that the serial portion is the bottleneck.
The first example fixed the serial fraction and varied the cores. This second example does the opposite — it fixes the cores and varies the serial fraction — to reveal the more important lesson Amdahl's Law teaches.
Question: A program is run on 8 cores. Compare the speed-up when the serial fraction is (a) 10% and (b) 5%.
(a) With 1−p=0.10, so p=0.90, and N=8:
S(8)=0.10+80.901=0.10+0.11251=0.21251≈4.71
(b) Now halve the serial fraction to 1−p=0.05, so p=0.95, still on N=8:
Subscribe to continue reading
Get full access to this lesson and all 10 lessons in this course.