CPU Architecture

This lesson examines the structural blueprints behind every processor you will ever program: how the CPU is wired internally, where instructions and data live, and how a handful of tiny registers orchestrate the entire machine. Master this and the fetch-decode-execute cycle, processor performance, and pipelining all fall into place.

Spec Mapping

This lesson develops OCR H446 section 1.1.1 (Structure and function of the processor). It establishes the Von Neumann stored-program model, contrasts it with the Harvard and modified Harvard alternatives, sets out the purpose of the core registers (PC, ACC, MAR, MDR, CIR and the status register), and explains the address, data and control buses — including how bus width and word length influence performance. These ideas are the foundation for the fetch-decode-execute cycle (1.1.1), processor performance factors (1.1.1) and pipelining (1.1.1) developed in later lessons, and they connect outward to data representation in module 1.4.

The Von Neumann Model

The Von Neumann architecture is named after the mathematician John von Neumann, whose 1945 report describing the design of the EDVAC set out the principle that still dominates general-purpose computing. Its defining idea is the stored-program concept: a program is not hard-wired into the machine but is held as data in the same memory that holds the values it operates on. Instructions and operands are both fetched across a single shared system bus.

This was revolutionary. Earlier machines such as ENIAC were rewired by hand to change their task; the stored-program idea meant a computer could be reprogrammed simply by loading new instructions into memory — exactly the way you load a new application today.

Key Components

Component	Role
Central Processing Unit (CPU)	Fetches, decodes and executes instructions; contains the control unit, ALU and registers
Main Memory (RAM)	Stores both program instructions and data in one single, uniform address space
System Bus	Connects the CPU to memory and I/O — comprises the address bus, data bus and control bus
Input/Output (I/O)	Peripheral devices connected through I/O controllers

Because instructions and data live in one address space, a single addressing scheme suffices and memory can be flexibly partitioned between code and data at run time. This flexibility is precisely why operating systems, compilers and self-loading programs are possible: code is just bytes that happen to be interpreted as instructions.

The Von Neumann Bottleneck

The price of this elegance is the Von Neumann bottleneck. Because instructions and data share a single bus, the CPU cannot fetch an instruction and transfer an operand in the same bus cycle — the two requests must take turns. As processors became far faster than memory, the bus increasingly became the limiting factor: the CPU sits idle, starved of work, while it waits for the next transfer to complete. This idle waiting is sometimes called the memory wall.

Exam Tip: When describing the Von Neumann bottleneck, state clearly that the limitation arises because instructions and data share a single bus — not merely because they share the same memory. Many candidates write only "they share memory" and lose the mark. The bus is the contended resource.

Caching (covered in the primary-storage lesson) is the main mitigation: a small, fast cache near the CPU services most requests, so the slow main-memory bus is used less often.

The Harvard Architecture

The Harvard architecture takes the opposite design decision: it uses physically separate memories and buses for instructions and data. The name comes from the Harvard Mark I relay computer, which kept its program on punched tape separate from its data store.

Key Components

Component	Role
Processor (CPU)	Fetches, decodes and executes instructions
Instruction Memory	Dedicated memory for storing program instructions
Data Memory	Dedicated memory for storing data
Instruction Bus	Dedicated bus between CPU and instruction memory
Data Bus	Dedicated bus between CPU and data memory

Advantages of Harvard over Von Neumann

No bottleneck — the CPU can fetch the next instruction while simultaneously reading or writing data, because the buses are independent. Two transfers happen in parallel, not in sequence.
Different bus widths — instruction and data buses can be sized independently (for example 14-bit instructions but 8-bit data on a small microcontroller), saving silicon where wide buses are unnecessary.
Improved security and reliability — if instruction memory is read-only, a running program cannot corrupt its own code, deliberately or by accident. This determinism is valued in embedded control.

Disadvantages of Harvard

More complex and expensive — two separate bus systems and memory modules increase cost and silicon area.
Wasted space — if instruction memory is full but data memory has spare capacity (or vice versa), the unused space cannot be shared between them.
Less flexible — code cannot easily be treated as data, so self-modifying programs, just-in-time compilers and conventional general-purpose operating systems are awkward or impossible.

Where Harvard Is Used

Pure Harvard architecture is rare in desktops but common where its determinism pays off:

Digital Signal Processors (DSPs) — audio and video processing demand continuous data streaming alongside instruction execution, so parallel buses are ideal.
Microcontrollers — many embedded controllers such as PIC and AVR (the family inside classic Arduino boards) use Harvard with separate Flash (instructions) and SRAM (data).
Embedded real-time systems — predictable, deterministic instruction timing matters more than flexibility.

Modified Harvard Architecture

Most modern desktop, laptop and smartphone CPUs use a modified Harvard architecture — a pragmatic hybrid. At the main-memory level, instructions and data share a single RAM (Von Neumann style, for flexibility). But inside the CPU, the first level of cache is split into a separate instruction cache (L1-I) and data cache (L1-D), Harvard style, so the core can fetch an instruction and access data in the same cycle without contention.

flowchart TB
    subgraph CPU
      L1I["L1-I Cache<br/>(Instructions)"]
      L1D["L1-D Cache<br/>(Data)"]
      L2["L2 Cache<br/>(unified)"]
      L1I --> L2
      L1D --> L2
    end
    L2 --> RAM["Unified Main Memory<br/>(Von Neumann-style)"]

This gives the speed of Harvard (parallel instruction and data access at the cache level, where it matters most for throughput) together with the flexibility of Von Neumann (one main memory that programs, operating systems and compilers can manage freely). It is the best-of-both-worlds compromise that dominates general-purpose computing today.

Comparison Table

Feature	Von Neumann	Harvard	Modified Harvard
Memory	Shared for instructions and data	Separate memories	Shared main memory, split caches
Buses	Single shared bus	Separate instruction and data buses	Shared main bus, split at cache level
Bottleneck	Yes	No	Reduced (mitigated by split caches)
Cost / complexity	Lower	Higher	Moderate
Self-modifying code	Possible	Difficult	Possible (via unified main memory)
Typical use	General-purpose PCs (historically)	DSPs, microcontrollers	Modern desktop/laptop/phone CPUs

Key CPU Registers

A register is a tiny, extremely fast storage location built directly into the CPU. Registers are orders of magnitude faster than main memory because they sit on the processor die and are accessed in a fraction of a clock cycle. The processor contains several special-purpose registers that each play a specific role during instruction processing.

Register	Full Name	Purpose
PC	Program Counter	Holds the address of the next instruction to be fetched. Incremented during each fetch; overwritten by jumps and branches
MAR	Memory Address Register	Holds the address of the memory location about to be read from or written to. Connected to the address bus
MDR	Memory Data Register	Holds the data/instruction just read from memory, or the data about to be written. Connected to the data bus. (Sometimes called the MBR — Memory Buffer Register)
CIR	Current Instruction Register	Holds the instruction currently being decoded and executed
ACC	Accumulator	A general-purpose register holding the working value and the result of ALU operations
Status Register	(Flags / PSW)	Holds individual condition flags set by the ALU — e.g. Zero (Z), Negative (N), Carry (C) and Overflow (V) — used by conditional branch instructions

The status register (also called the flags register or program status word) deserves special attention because students often forget it. After an arithmetic or logic operation, the ALU sets individual bits according to the result: the Zero flag is set if the result was 0, the Negative flag if the result was negative, the Carry flag if an unsigned operation overflowed the word, and the Overflow flag if a signed operation overflowed. A conditional branch such as "jump if equal" works by testing a flag — for instance, comparing two values subtracts them and a following branch inspects the Zero flag. Without the status register, decision-making in machine code would be impossible.

The CPU Datapath

The registers do not work in isolation. The diagram below shows the datapath — how registers, the Arithmetic Logic Unit (ALU), the Control Unit (CU) and the buses are wired together. The CU decodes instructions and emits the control signals (along the control bus) that gate data through the datapath; the ALU performs arithmetic and logic and updates the status register; the registers stage addresses and data on and off the buses.

flowchart TB
    subgraph CPU
      CU["Control Unit<br/>(decodes, emits control signals)"]
      PC["PC<br/>(next instruction addr)"]
      CIR["CIR<br/>(current instruction)"]
      MAR["MAR"]
      MDR["MDR"]
      ACC["ACC"]
      ALU["ALU<br/>(arithmetic & logic)"]
      SR["Status Register<br/>(Z N C V flags)"]
      PC --> MAR
      MDR --> CIR
      CIR --> CU
      MDR --> ALU
      ACC --> ALU
      ALU --> ACC
      ALU --> SR
      CU -. control signals .-> ALU
      CU -. control signals .-> PC
    end
    MAR -- "Address Bus (unidirectional)" --> MEM["Main Memory / I/O"]
    MEM -- "Data Bus (bidirectional)" --> MDR
    MDR -- "Data Bus (write)" --> MEM
    CU -- "Control Bus (bidirectional)" --> MEM

The MAR drives the address bus (unidirectional — CPU to memory).
The MDR connects to the data bus (bidirectional — data flows both ways).
The CU sends and receives signals along the control bus (e.g. memory read, memory write, interrupt request).

The System Buses

A bus is a set of parallel conducting wires that carries data, addresses or control signals between components. The system bus is conventionally divided into three.

Address Bus

Property	Detail
Direction	Unidirectional — from CPU to memory/I/O. The CPU specifies where; it never receives an address back
Width	Determines the maximum addressable memory. An n-bit address bus can address $2^n$ distinct locations
Purpose	Carries the address of the memory location or I/O port the CPU wants to access

Data Bus

Property	Detail
Direction	Bidirectional — data travels both from CPU to memory and from memory to CPU
Width	Determines how many bits can be transferred in a single operation. Common widths: 8, 16, 32, 64 bits. Often matches the word length of the processor
Purpose	Carries the actual data or instructions being transferred

Control Bus

Property	Detail
Direction	Bidirectional — some signals go from CPU to devices (e.g. memory write), others from devices to CPU (e.g. interrupt request)
Key signals	Memory Read, Memory Write, I/O Read, I/O Write, Interrupt Request, Bus Request, Bus Grant, Clock
Purpose	Coordinates and synchronises the activities of every component on the bus

Bus Width, Word Length and Performance

Two related dials affect performance:

Address bus width sets the size of the addressable memory space. The relationship is $\text{locations} = 2^n$ for an n-bit address bus. Widening it lets the system use more RAM, but does nothing for transfer speed.
Data bus width sets how many bits move per transfer. A 64-bit data bus shifts twice as much per cycle as a 32-bit bus, raising throughput.
Word length is the natural unit of data the CPU processes at once (e.g. a 64-bit processor). A wider word means more data per instruction and a larger range of values handled in one operation, but it also widens registers and the data bus, increasing cost and power.

Exam Tip: A frequent question asks you to contrast the effect of widening the address bus versus the data bus. Widening the address bus increases the amount of memory that can be addressed; widening the data bus increases the amount of data transferred per operation (throughput). State both effects precisely and do not conflate them — they do different jobs.

Worked Example

Question: A computer has a 24-bit address bus and a 16-bit data bus. Calculate (a) the maximum number of addressable locations and the memory size if each location stores one byte, and (b) the amount of data transferred in one bus operation.

Answer:

(a) Number of locations is given by the address bus width:

$\text{locations} = 2^{24} = 16{,}777{,}216$

If each location stores 1 byte, total addressable memory is

$16{,}777{,}216 \text{ bytes} = 16 \text{ MiB.}$

(b) Data per operation equals the data bus width:

$16 \text{ bits} = 2 \text{ bytes per transfer.}$

So this machine can address 16 MiB of byte-addressable memory and move 2 bytes per bus cycle. To move a 32-bit value it would need two transfers — a concrete illustration of how a narrow data bus throttles throughput.

A Second Worked Example — Designing a Bus

Question: A systems designer wants a processor that can address at least 4 GiB of byte-addressable memory and transfer a 64-bit value in a single operation. What is the minimum width of (a) the address bus and (b) the data bus?

Answer. For (a), 4 GiB is $4 \times 2^{30} = 2^{32}$ bytes, so the address bus must be able to select $2^{32}$ locations. Since an n-bit bus addresses $2^n$ locations, we need

$2^n \geq 2^{32} \implies n \geq 32.$

A 32-bit address bus is therefore the minimum. (This is exactly why 32-bit operating systems were historically capped at 4 GiB of usable RAM, and why 64-bit machines were needed to go beyond it.) For (b), to move a 64-bit value in one transfer the data bus must be 64 bits wide. Notice how the two requirements are independent: the address-bus width is fixed by the amount of memory, the data-bus width by the size of each transfer. A designer can trade these off separately — a point worth making explicitly in an exam.

How the Registers Cooperate — A Narrative

It helps to see the registers not as an isolated list but as a relay team. The program counter is the bookmark that never loses its place: it always names where the next instruction lives, and at the end of every fetch it nudges itself forward so the machine marches sequentially through memory unless a branch redirects it. When the processor is ready to fetch, it cannot send the program counter's value to memory directly — memory listens only to the memory address register, so the address is first copied there. The memory address register is the only register wired to the address bus, which is why every memory access, whether for an instruction or for data, must funnel through it.

Once the address is on the address bus and the control unit has raised the read line on the control bus, memory responds by placing the requested word on the data bus. That word lands in the memory data register, the sole register connected to the (bidirectional) data bus. The memory data register is genuinely two-way: on a read it receives from memory, and on a write it holds the value the processor is about to store. This bidirectionality is the practical embodiment of the Von Neumann bottleneck — a single data register on a single data bus can only face one direction at a time, so reads and writes cannot overlap.

If the word just fetched is an instruction, it is copied onward into the current instruction register, where it sits, stable, for the entire decode and execute phases while the control unit picks it apart into opcode and operand. If instead the word is data destined for a calculation, it is routed to the ALU. The accumulator is the ALU's scratchpad: it both supplies one operand and catches the result, which is why a sequence of additions accumulates in it (hence the name). Finally, as the ALU finishes, it does not silently discard information about the result — it sets the status register flags, so that a later conditional branch can ask "was the last result zero?" or "did it overflow?" and steer the program counter accordingly. Seen this way, the registers form a pipeline of custody: address out, data in, instruction held, operands combined, condition recorded — and the whole loop begins again.

A Worked Register-Transfer Trace

Prose explains the idea; register-transfer notation (RTN) pins down the mechanism step by step. RTN uses the arrow ← to mean "the contents of the right-hand side are copied into the left-hand side", with [X] meaning "the contents of memory location X". Square-bracketing memory and naming registers explicitly turns the relay-team story into something you can verify line by line — which is exactly what an examiner rewards.

Suppose location 30 holds the single instruction ADD 64 ("add the contents of address 64 to the accumulator"), location 64 holds the value 7, and the accumulator already holds 5. The program counter currently holds 30. The table below traces every register on every micro-step, so you can see precisely what each register holds and why at each moment.

Step	Micro-operation (RTN)	PC	MAR	MDR	CIR	ACC	What is happening
Start	—	30	–	–	–	5	PC points at the instruction; ACC holds a partial sum
F1	`MAR ← [PC]`	30	30	–	–	5	Address of next instruction copied to MAR (drives address bus)
F2	`PC ← [PC] + 1`	31	30	–	–	5	PC incremented so it already names the following instruction
F3	`MDR ← [memory at MAR]`	31	30	ADD 64	–	5	Memory places the instruction on the data bus into MDR
F4	`CIR ← [MDR]`	31	30	ADD 64	ADD 64	5	Instruction copied to CIR, ready for decoding
D1	decode `CIR`	31	30	ADD 64	ADD 64	5	CU splits CIR into opcode `ADD` and operand `64`
E1	`MAR ← operand (64)`	31	64	ADD 64	ADD 64	5	Operand address loaded to MAR — note MAR is reused for data
E2	`MDR ← [memory at MAR]`	31	64	7	ADD 64	5	The data value 7 is fetched into MDR
E3	`ACC ← [ACC] + [MDR]`	31	64	7	ADD 64	12	ALU adds; result lands in ACC; status flags updated

Three subtleties become visible only in the trace. First, the PC is incremented during fetch (step F2), not after execute — so even a jump instruction has already moved the PC past itself before the jump overwrites it. Second, the MAR is used twice in a single instruction: once to fetch the instruction (F1) and again to fetch the operand (E1). It is a shared funnel, not a dedicated instruction-address register. Third, the MDR likewise carries first an instruction and then a data value in the same cycle — and on this single shared data bus those two transfers cannot overlap, which is the Von Neumann bottleneck made concrete in two lines of RTN.

Exam Tip: When asked for RTN, always write MAR ← [PC] then PC ← [PC] + 1 before the memory read. Putting the increment after the read, or omitting the bracket notation for memory contents, are the two most common ways to drop marks on a register-transfer question.

Contrast this with a STORE instruction such as STA 80 ("store the accumulator into address 80"), which reverses the direction of the data bus during execute. The fetch phase is identical (MAR ← [PC], PC ← [PC] + 1, MDR ← [Memory[MAR]], CIR ← [MDR]), but in the execute phase the operand 80 is loaded into the MAR (MAR ← operand), the value to be stored is loaded into the MDR from the accumulator (MDR ← [ACC]), and finally memory is written from the MDR (Memory[MAR] ← [MDR]) while the control unit asserts MEMORY WRITE instead of MEMORY READ. Notice the symmetry: a LOAD/ADD reads memory into the MDR, whereas a STORE writes memory from the MDR. The MDR is the same register in both cases — only the direction of travel on the data bus changes, governed by which control line the control unit raises. This is exactly why the data bus is described as bidirectional while the address bus is unidirectional: addresses only ever leave the CPU, but data must flow both ways depending on the opcode.

Synoptic Links

Fetch-decode-execute (1.1.1): every register introduced here has a precise job in the FDE cycle — the PC supplies the next address, the MAR/MDR shuttle it across the buses, the CIR holds it for decoding, and the ACC and status register record results. The next lesson traces these as register transfers.
Processor performance & pipelining (1.1.1): the Von Neumann bottleneck is one of the central performance limits that caches, wider buses and (in later lessons) pipelining are designed to alleviate.
Data representation (1.4): address bus width links directly to $2^n$ addressing, and the status-register Carry/Overflow flags are the binary-arithmetic overflow conditions met in two's-complement work.
Assembly / LMC (1.2): the PC, ACC and an instruction register are exactly the components a Little Man Computer or assembly-language model exposes — this architecture is what your assembly programs run on.

Common Errors & A-Level Misconceptions

"The bottleneck is because they share memory." The contended resource is the single bus, not the shared memory per se. Always name the bus.
Confusing the MAR and MDR. The MAR holds an address (it talks to the address bus); the MDR holds the data/instruction (it talks to the data bus). A quick memory aid: Address Register, Data Register.
Thinking the PC holds the current instruction. It holds the address of the next instruction. The instruction itself lives in the CIR.
Forgetting the status register entirely. It is examinable and underpins all conditional branching — list it among the registers.
Claiming modern PCs are "pure Harvard". They are modified Harvard: unified main memory, split L1 caches. Pure Harvard is for DSPs and microcontrollers.
Saying a wider address bus makes the computer faster. It increases addressable capacity, not transfer speed. Speed comes from the data bus width and clock.

Specimen Question

A microcontroller used in a real-time motor controller uses a Harvard architecture. A general-purpose laptop uses a modified Harvard architecture.

(a) State the purpose of the Memory Address Register (MAR) and the Memory Data Register (MDR). [2]

(b) Explain one advantage of the Harvard architecture for the real-time motor controller. [2]

(c) Discuss why a general-purpose laptop uses a modified Harvard architecture rather than a pure Von Neumann or pure Harvard design. [6]

AO Breakdown

(a) AO1 [2] — knowledge of register roles.
(b) AO2 [2] — application of Harvard's properties to the embedded scenario.
(c) AO3 [6] — analyse and evaluate the trade-offs, reaching a justified conclusion.

Model Answers

(a) Mid-band. The MAR holds the address of a memory location. The MDR holds the data going to or from memory.

(a) Stronger. The MAR holds the address of the memory location the CPU wants to access, and the MDR holds the data or instruction transferred to or from that location. The MAR feeds the address bus and the MDR connects to the data bus, so together they form the interface through which every memory access passes.

(a) Top-band. The MAR holds the address of the memory location the CPU is about to read from or write to, and is connected to the address bus. The MDR holds the data or instruction just read from memory, or the value about to be written, and is connected to the (bidirectional) data bus.

(b) Mid-band. In a Harvard machine the instruction and data buses are separate, so the controller can fetch instructions and read data at the same time, which makes it faster.

(b) Stronger. Because the instruction and data memories have separate buses, the controller can fetch its next instruction while simultaneously reading a sensor value, instead of the two accesses taking turns on one bus. This parallel access avoids the Von Neumann bottleneck, so each instruction takes a fixed, predictable time.

(b) Top-band. In a Harvard machine the instruction and data buses are physically separate, so the controller can fetch its next instruction at the same time as reading a sensor value from data memory. This parallelism removes the Von Neumann bottleneck and gives predictable, deterministic timing — essential for a real-time motor controller that must respond within a fixed deadline.

(c) Mid-band. Von Neumann uses one memory and one bus, which is cheap and flexible but suffers the bottleneck. Harvard uses separate memories and buses, which is faster but more expensive and wastes space. Modified Harvard combines them by using one main memory but splitting the cache, so a laptop gets the speed of Harvard and the flexibility of Von Neumann.

(c) Stronger. A pure Von Neumann laptop would be cheap and easy to program — one memory means the operating system can freely allocate space between code and data — but the shared bus forces instructions and data to contend, stalling the fast CPU. A pure Harvard laptop would remove that contention with separate buses, but the duplicated memory wastes capacity (spare data memory cannot be lent to code) and makes self-modifying code and just-in-time compilation, which real software relies on, difficult. Modified Harvard takes the useful part of each: a single, flexible main memory keeps the Von Neumann programming model, while splitting the L1 cache into instruction and data halves restores Harvard-style parallel access at the level where most accesses actually occur. The laptop therefore gets most of Harvard's speed without losing Von Neumann's flexibility, which is why it is the chosen design.

(c) Top-band. A pure Von Neumann design is simple and flexible — one memory, one bus, easy to manage with an operating system and compilers — but suffers the Von Neumann bottleneck: instructions and data contend for a single bus, so the fast CPU stalls waiting for memory. A pure Harvard design removes that contention with separate instruction and data memories, but it is more expensive (duplicated buses and memory), wastes capacity because spare data memory cannot be lent to instructions, and makes self-modifying code, JIT compilation and conventional operating systems awkward — all of which a general-purpose laptop relies on. The modified Harvard compromise resolves the tension: it keeps a single, flexible main memory (Von Neumann) so the OS and applications can manage code and data freely, while splitting the L1 cache into separate instruction and data caches (Harvard) so the core can fetch and load in parallel where it matters most for throughput. Because the overwhelming majority of accesses hit the fast split caches, the laptop gains most of Harvard's parallelism without sacrificing Von Neumann's flexibility — which is why modified Harvard, not either pure form, dominates general-purpose machines.

Examiner-style commentary: Part (a) is pure recall but candidates frequently swap the two registers — the strongest answers anchor each to its bus. Part (b) must apply to the scenario; a generic "it is faster" would not reach top band, whereas linking parallel access to deterministic real-time deadlines does. Part (c) is an AO3 discussion: top-band responses weigh all three designs, identify the specific weakness each suffers, and conclude with a justified reason for the chosen architecture rather than merely listing pros and cons.

Going Further

Cache coherence and the unified L2: modified Harvard splits L1 but unifies L2/L3. Investigate how the hardware keeps the separate L1-I and L1-D consistent when self-modifying code writes an instruction that is already cached as data — this is the cache coherence problem.
Address space vs physical memory: a 64-bit address bus could theoretically address 16 exabytes, but real chips wire far fewer address lines because no machine has that much RAM. Look into how virtual memory (a later topic) decouples the logical address space from physical RAM.
DSP architectures: explore why specialised DSPs sometimes use multiple data buses (a "super-Harvard" arrangement) to fetch two operands and an instruction simultaneously for a multiply-accumulate — the workhorse operation of signal processing.

CPU Architecture

CPU Architecture

Spec Mapping

The Von Neumann Model

Key Components

The Von Neumann Bottleneck

The Harvard Architecture

Key Components

Advantages of Harvard over Von Neumann

Disadvantages of Harvard

Where Harvard Is Used

Modified Harvard Architecture

Comparison Table

Key CPU Registers

The CPU Datapath

The System Buses

Address Bus

Data Bus

Control Bus

Bus Width, Word Length and Performance

Worked Example

A Second Worked Example — Designing a Bus

How the Registers Cooperate — A Narrative

A Worked Register-Transfer Trace

Synoptic Links

Common Errors & A-Level Misconceptions

Specimen Question

AO Breakdown

Model Answers

Going Further

More in Computer Science