You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
A translator is system software that converts program code written by a human into a form the CPU can execute. This lesson develops the three kinds of translator — assemblers, compilers and interpreters — explains when and why each is chosen, and traces the stages of compilation in detail, finishing with the role of bytecode and the virtual machine in modern hybrid languages.
This lesson develops OCR H446 section 1.2.3 (Software Development — translators). It requires you to describe the operation of assemblers, compilers and interpreters; to compare them and justify the choice of one over another for a given scenario; and to describe the stages of compilation (lexical analysis, syntax analysis, semantic analysis, code generation and optimisation), explaining what each stage consumes and produces. The levels of language (low-level vs high-level, why different languages exist) belong to the Types of Programming Language lesson and are only referenced here — this lesson owns the translators and the compilation pipeline. The material connects directly to the fetch–decode–execute cycle and assembly language from the architecture topic: a translator's whole purpose is to end up with the opcodes and operands that the CPU's FDE cycle consumes.
A CPU can only execute machine code — binary instructions made of an opcode (what to do) and one or more operands (the data or address to act on). Humans, however, write in assembly language or high-level languages because machine code is unreadable and architecture-specific. A translator bridges that gap. The choice of translator is not arbitrary: it shapes how errors are reported, how fast the program runs, whether you must ship your source code, and whether the program is portable.
| Source Language | Translator | Target / Effect |
|---|---|---|
| Assembly language | Assembler | Machine code (object file) — one mnemonic to one instruction |
| High-level language | Compiler | Machine code (or bytecode) produced before the program runs |
| High-level language | Interpreter | Executes the source directly, during the run, statement by statement |
Cross-reference: the Types of Programming Language lesson explains why low-level and high-level languages coexist; here we assume those levels and concentrate on the machinery that translates them.
An assembler translates assembly language (a low-level language with one mnemonic per machine instruction) into machine code. Because assembly maps almost directly onto the instruction set, the translation is essentially one-to-one: each mnemonic such as ADD, LDR or STR becomes exactly one machine-code instruction.
| Feature | Detail |
|---|---|
| Input | Assembly language source file |
| Output | Machine code (object file) |
| Translation | One-to-one — each assembly mnemonic maps to exactly one machine-code instruction |
| Passes | Most assemblers are two-pass so that forward references (labels used before they are defined) can be resolved |
Consider a loop that jumps forward to a label that the assembler has not yet reached:
BNE skip ; jump to "skip" — but where is it?
ADD R1, R1, #1
skip: STR R1, total ; this label's address is only known here
On a single pass the assembler would not yet know the address of skip when it encounters BNE skip. The solution is to scan the source twice.
| Pass | Action |
|---|---|
| Pass 1 | Scan the source. Record every label and the memory address it will occupy in a symbol table. Calculate instruction sizes so addresses are correct. Flag obvious syntax errors |
| Pass 2 | Translate each mnemonic into its machine-code opcode. Replace every label reference with the address looked up from the symbol table. Emit the final object code |
The symbol table built here is the same idea you will meet again in compilation: a structure that maps a name (label, variable, function) to information about it (here, an address).
Take this short routine that counts down from 3, assuming each instruction occupies one address starting at address 0:
LDR R0, three ; addr 0
loop: SUB R0, R0, #1 ; addr 1
CMP R0, #0 ; addr 2
BNE loop ; addr 3 -> needs address of "loop"
STR R0, result ; addr 4
three: DAT 3 ; addr 5 (data)
result: DAT 0 ; addr 6 (data)
On pass 1 the assembler walks the lines, tracks the current address, and fills the symbol table the moment it meets each label:
| Symbol | Address |
|---|---|
loop | 1 |
three | 5 |
result | 6 |
On pass 2 it can now resolve every reference: BNE loop becomes "branch-if-not-equal to address 1", LDR R0, three reads address 5, and STR R0, result writes address 6. Without pass 1 the BNE loop on address 3 would have been impossible to encode, because loop was already defined earlier but a forward branch to a not-yet-seen label needs the same mechanism — the two-pass design handles both backward and forward references uniformly.
A compiler translates the entire high-level source program into machine code (or an intermediate form) before the program runs. The output is typically a standalone executable file that can be distributed and run without the compiler or the original source.
| Feature | Detail |
|---|---|
| Input | High-level source code (e.g. a .c, .java or .cpp file) |
| Output | Machine-code executable (e.g. .exe) or bytecode |
| When translation happens | Once, before execution — the whole program is analysed and translated together |
| Error reporting | Produces a list of all errors found across the program; a single syntax error usually stops a successful build |
| Execution speed | Fast at run time — the CPU executes native instructions with no translation overhead |
| Distribution | The executable can be shipped without the source code, protecting intellectual property |
| Portability of output | The executable is tied to one architecture/OS; you must recompile for a different target |
Because the compiler sees the whole program, it can perform global optimisations (see the optimisation stage below) that an interpreter, working a line at a time, cannot.
An interpreter translates and executes high-level source code one statement at a time, with no separate executable produced. It is the engine behind the interactive "REPL" you see in Python.
| Feature | Detail |
|---|---|
| Input | High-level source code |
| Output | No separate file — statements are executed as they are read |
| When translation happens | Continuously, during execution; each statement is processed as the run reaches it |
| Error reporting | Halts at the first run-time error encountered, reporting it immediately — convenient for development |
| Execution speed | Slower — a statement inside a loop is re-translated on every iteration |
| Distribution | The source (or an equivalent) must be shipped, and the target machine must have a compatible interpreter |
| Portability | The same source runs anywhere a compatible interpreter exists |
The re-translation cost is the crucial weakness: a loop body executed a million times is, in a naive interpreter, analysed a million times. This is exactly the problem that the hybrid bytecode model (below) is designed to soften.
Consider this fragment:
for i in range(1000):
x = i * i + 1
Compare what each translator does with the body x = i * i + 1:
| Translator | Translation work for the body | Run-time work |
|---|---|---|
| Compiler | Translated to machine code once, at compile time | The CPU executes the same native instructions 1000 times — no re-translation |
| Interpreter | Re-examined on every iteration — tokens recognised, structure rechecked, then executed | 1000 translate-and-execute cycles for the body alone |
| Bytecode + VM | Compiled to bytecode once; a JIT may later compile the hot loop to native code | After the first iterations the JIT can run native speed for the rest |
This is the single clearest way to articulate the speed gap in an exam: it is not that interpreting one statement is dramatically slow, it is that repeated statements pay the translation cost repeatedly, whereas a compiler pays it once.
| Feature | Compiler | Interpreter |
|---|---|---|
| Unit translated | Entire program at once | One statement at a time, during execution |
| Output | Standalone executable file | None — direct execution |
| Run-time speed | Fast (already native code) | Slower (translated repeatedly at run time) |
| Error reporting | All errors listed after a build | Stops at the first error reached |
| Edit–test cycle | Slower — recompile after every change | Faster — re-run immediately |
| Source distribution | Not required — ship the executable | Source (or interpreter) required |
| Memory at translate time | Must hold/analyse the whole program | Needs only the current statement plus state |
| Optimisation scope | Whole-program (global) optimisation possible | Limited — no whole-program view |
| Choose a Compiler When... | Choose an Interpreter When... |
|---|---|
| The program is finished and shipped to end users | You are actively developing and debugging |
| Run-time speed is critical (games, numerical code) | You are prototyping and want a fast edit–run loop |
| You must protect the source from users | Cross-platform portability matters more than raw speed |
| The program runs many times unchanged | The program is a short script run occasionally |
In an exam scenario question, justify your choice by naming the dominant pressure (speed, portability, IP protection, debugging convenience) rather than reciting the whole table.
The error-reporting difference is worth seeing concretely. Suppose a program contains two faults — a missing colon on line 4 and an undeclared variable on line 9.
This is why compiled workflows suit a "fix the whole list, then rebuild" rhythm, while interpreted workflows suit "fix, re-run, fix, re-run" — and it is the precise distinction a 1-mark "state a difference" question is testing.
Three terms are easily muddled and frequently examined:
| Term | What it is |
|---|---|
| Source code | The human-written high-level (or assembly) program |
| Object code | The machine-code output of a single translation unit, before linking |
| Executable code | The final, linked, runnable machine-code program |
A compiler produces object code; a linker then combines object files and library code into the executable. (Linking is the natural follow-on, though the translator's own job ends at code generation/optimisation.)
Several mainstream languages — Java, Python and C# among them — deliberately sit between the compiler and interpreter models to get the best of both.
flowchart LR
SRC["Source code<br/>(e.g. .java / .py)"] --> BC["Bytecode<br/>(platform-independent)"]
BC --> VM["Virtual Machine<br/>(JVM / Python VM)"]
VM --> NAT["Native execution<br/>on this CPU + OS"]
This buys two things at once. Portability: the same bytecode runs on any machine that has the right VM, so you compile once and run anywhere ("write once, run anywhere"). Performance: because the heavy lexical/syntax/semantic analysis was done once at compile time, the run-time work is far lighter than re-parsing source text, and a JIT can specialise frequently executed code to native speed.
| Term | Meaning |
|---|---|
| Bytecode | Compact intermediate instructions for an abstract machine, not a real CPU |
| Virtual machine | A program that executes bytecode, providing a uniform runtime across platforms |
| JIT compilation | Translating hot bytecode into native machine code during execution to gain speed |
A handy way to remember the trade-off: a pure compiler is fast but tied to one platform; a pure interpreter is portable but slow; bytecode + VM is portable and reasonably fast, at the cost of needing the VM installed.
When a compiler (or the front end of a bytecode toolchain) processes source code, it does so through a sequence of well-defined stages. The first three are analysis (understanding the program); the last two are synthesis (producing and improving the output).
Subscribe to continue reading
Get full access to this lesson and all 10 lessons in this course.