Context-Free Languages and BNF

Regular languages — the territory of finite state machines and regular expressions — hit a hard wall: they cannot describe nested or recursive structure. Balanced brackets, the language $a^n b^n$ , and the syntax of every real programming language all lie beyond them. Context-free languages (CFLs) are the next, more powerful class, and they are exactly what we need to describe such structure. They are generated by context-free grammars (CFGs), which in computer science are most often written in Backus-Naur Form (BNF). The decisive new ingredient is recursion: a grammar rule that refers, directly or indirectly, to itself, giving the unbounded "memory" that finite machines lack.

Spec Mapping

This lesson sits in the Theory of Computation strand (around section 4.4.3) of AQA A-Level Computer Science (7517). You are expected to understand the limits of regular languages — that some sets of strings (matched brackets, $a^n b^n$ ) cannot be defined by a regular expression or recognised by an FSM — and to know that such languages can instead be described by Backus-Naur Form (BNF). You must be able to read and write BNF production rules, recognise and use recursion within BNF, derive example strings from a grammar, and relate BNF to syntax (railroad) diagrams. The treatment below is descriptive rather than a verbatim quotation of the specification.

Where Regular Languages Run Out

Recall the FSM/regex limitation: a finite machine has only finitely many states, so it can remember only a bounded amount of information. Two canonical languages defeat it:

Matched brackets — strings like (()) and ((())) where every opening bracket has a matching close in the right order. Checking this requires tracking the current nesting depth, which can grow without limit.
$a^n b^n$ — equal numbers of as then bs: ab, aabb, aaabbb, … . This requires counting the as to match against the bs.

Both need unbounded memory, so no regex describes them. The escape is recursion: a grammar rule that contains itself can "wrap" a structure around a smaller copy of the same structure, expanding to any depth. That single idea is what lifts us from regular to context-free.

The Chomsky Hierarchy

Formal languages sit in a strict ladder of expressive power, the Chomsky hierarchy:

Type	Language class	Recognised by	Characteristic example
3	Regular	Finite state machine	`ab`
2	Context-free	Pushdown automaton (FSM + a stack)	$a^n b^n$ , balanced brackets
1	Context-sensitive	Linear-bounded automaton	$a^n b^n c^n$
0	Recursively enumerable	Turing machine	any computable language

Each level strictly contains the one below: Regular ⊂ Context-Free ⊂ Context-Sensitive ⊂ Recursively Enumerable. The jump from Type 3 to Type 2 is precisely the jump this lesson is about, and the machine that recognises Type 2 languages — a pushdown automaton — is just a finite state machine equipped with a stack. That stack is the unbounded memory the FSM was missing, and it is no coincidence that the stack reappears when we evaluate expressions in the Reverse Polish Notation lesson. The pattern of the whole hierarchy is worth internalising: each rung is the machine of the rung below with one more capability added — a stack to get from regular to context-free, a bounded read/write tape to get to context-sensitive, and an unbounded tape to reach the full power of a Turing machine. Memory, in increasingly flexible forms, is the single dimension along which computational power grows.

Context-Free Grammars

A context-free grammar has four components:

Component	Notation	Description
Terminals	lowercase letters, digits, symbols	The actual characters that appear in the language (e.g. `a`, `b`, `+`, `(`, `)`)
Non-terminals	UPPERCASE, or names in `< >`	"Variables" standing for sub-structures, to be rewritten
Production rules	`→` or `::=`	Rules rewriting a non-terminal as a sequence of terminals and non-terminals
Start symbol	usually `S` or `<program>`	The non-terminal a derivation begins from

The word context-free means each rule rewrites a single non-terminal regardless of what surrounds it — its "context" is irrelevant. That restriction is what keeps the class tractable while still allowing recursion.

Worked grammar: $a^n b^n$

S → a S b | ε

Here S is the start symbol, a and b are terminals, and ε is the empty string. The rule says: an S is either an a, then another S, then a b (the recursive case), or nothing at all (the base case that stops the recursion). Watch a full derivation of aaabbb, rewriting S at each step:

Step	Sentential form	Rule applied
0	S	start
1	a S b	S → a S b
2	a a S b b	S → a S b (applied to inner S)
3	a a a S b b b	S → a S b
4	a a a ε b b b	S → ε (base case)
5	a a a b b b = aaabbb	collapse ε

Each recursive step adds exactly one a on the left and exactly one b on the right, so the counts can never get out of step — the grammar generates precisely { ε, ab, aabb, aaabbb, … } = $\{ a^n b^n : n \geq 0 \}$ . An FSM cannot do this: it would need to count the as, and no fixed number of states can count to an arbitrary $n$ . The recursion in the grammar is the unbounded memory.

A Second Worked Grammar: Balanced Brackets

The other classic non-regular language is the set of correctly balanced brackets — strings like (), (()), ()() and (()()) where every ( has a matching ) in the right order. A context-free grammar captures it neatly:

S → ( S ) | S S | ε

Read the three alternatives as the three ways a balanced string can be formed: a balanced string wrapped in a fresh pair of brackets (( S )); two balanced strings written one after another (S S); or the empty string (the base case). Watch a derivation of (())():

Step	Sentential form	Rule applied
0	S	start
1	S S	S → S S
2	( S ) S	left S → ( S )
3	( ( S ) ) S	inner S → ( S )
4	( ( ) ) S	inner S → ε
5	( ( ) ) ( S )	right S → ( S )
6	( ( ) ) ( )	last S → ε

Reading the leaves gives (())() — correctly balanced. The grammar can never produce an unmatched bracket because brackets are only ever introduced as a pair by the ( S ) rule. This is exactly the language a finite state machine cannot recognise, because checking balance requires tracking the current nesting depth, which is unbounded; the grammar's recursion supplies precisely that unbounded capability. The very same structure underlies the matching of begin/end, {/} and opening/closing HTML tags in real languages, which is why this little grammar is far more than a toy.

Backus-Naur Form (BNF)

BNF is the standard metalanguage for writing context-free grammars and is used to define the syntax of programming languages, network protocols and file formats.

Symbol	Meaning
`::=`	"is defined as" (the BNF form of →)
`	`
`< >`	enclose a non-terminal's name

Worked grammar: arithmetic expressions with correct precedence

<expression> ::= <term> | <expression> + <term> | <expression> - <term>
<term>       ::= <factor> | <term> * <factor> | <term> / <factor>
<factor>     ::= <number> | ( <expression> )
<number>     ::= <digit> | <number> <digit>
<digit>      ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

This single grammar achieves three things that reward close study:

Operator precedence is baked into the structure. * and / live at the <term> level, while + and - live at the higher <expression> level. Because an expression is built from terms, multiplication binds more tightly than addition — without any explicit precedence table. The grammar's shape is the precedence.
Recursion gives unbounded length. <number> ::= <digit> | <number> <digit> lets a number be any string of digits, and <expression> referring to itself lets expressions be arbitrarily long.
Nesting via brackets. <factor> ::= ( <expression> ) makes a whole parenthesised expression usable wherever a single factor can appear — so 3 + 4 * (2 - 1) is valid, with the brackets forcing 2 - 1 to be evaluated first.

Recursion: left, right and the base case

Recursion in BNF comes in two shapes, and both need a non-recursive base case or the definition never terminates:

Right recursion: the non-terminal appears at the end, as in S → a S b (also <digits> ::= <digit> | <digit> <digits>).
Left recursion: the non-terminal appears at the start, as in <expression> ::= <expression> + <term>.

Always include at least one alternative with no recursion (here <digit>, or ε) — it is the exit from the recursion. A grammar with only recursive alternatives generates nothing, because there is no way to terminate a derivation.

Parse Trees (Derivation Trees)

A parse tree displays how a string is derived: the root is the start symbol, internal nodes are non-terminals, and the leaves, read left to right, spell out the terminal string. The shape of the tree captures the structure (and hence the meaning) of the input.

Parsing "3 + 4 * 2"

flowchart TB
    E["&lt;expression&gt;"] --> EL["&lt;expression&gt; (= 3)"]
    E --> PLUS["+"]
    E --> TR["&lt;term&gt;"]
    EL --> TL["&lt;term&gt; → &lt;factor&gt; → 3"]
    TR --> TRL["&lt;term&gt; (= 4)"]
    TR --> STAR["*"]
    TR --> TRF["&lt;factor&gt; → 2"]

Because 4 * 2 is grouped together as a single <term> below the +, the tree forces 4 * 2 to be combined before the addition — exactly matching the convention that 3 + 4 * 2 = 11, not 14. The grammar's structure, made visible in the parse tree, is the precedence rule. A grammar that allows two different parse trees for the same string is called ambiguous; the precedence-layered grammar above is unambiguous, which is precisely why compilers prefer such designs.

Why Ambiguity Matters

To see why an unambiguous grammar is so important, imagine a careless grammar that just said <expr> ::= <expr> + <expr> | <expr> * <expr> | <number>. For the input 3 + 4 * 2 this grammar permits two different parse trees:

one in which the + is the top operator, grouping 4 * 2 together → value 11 (correct);
one in which the * is the top operator, grouping 3 + 4 together → value 14 (wrong).

Since the parse tree determines how the expression is evaluated, an ambiguous grammar would let a compiler compute either answer — clearly unacceptable. The well-designed layered grammar (<expression> → <term> → <factor>) eliminates this by forcing * and / to sit lower in the tree than + and -, so only the correct grouping is derivable. The lesson generalises: when you design a grammar for anything that will be evaluated, you must ensure each valid string has exactly one parse tree, because that tree is the meaning. This is one of the central practical concerns of programming-language design, and it connects directly to the Reverse Polish Notation lesson, where the (unique) expression tree is flattened to postfix for evaluation.

Syntax (Railroad) Diagrams

A syntax diagram (also called a railroad diagram) is a graphical equivalent of a BNF rule: you read it by following the "track" from left to right, and any complete path through the diagram spells a valid string. Loops in the track represent repetition (recursion); branches represent alternatives (|).

flowchart LR
    start([start]) --> L["letter"]
    L --> branch{ }
    branch -->|"letter"| branch
    branch -->|"digit"| branch
    branch --> done([valid identifier])

This diagram says an identifier is a letter followed by any number of further letters or digits — the loop back into the branch is the diagrammatic form of recursion. Syntax diagrams and BNF carry the same information; language reference manuals often print both because each suits different readers.

Extended BNF (EBNF)

EBNF adds shorthand that makes common patterns less verbose, without changing the class of languages describable:

EBNF notation	Meaning	Plain-BNF equivalent
`{ }`	zero or more repetitions	a recursive rule
`[ ]`	optional (zero or one)	two alternatives, one with `ε`
`( )`	grouping	grouping with `

For example, in EBNF an identifier is:

<identifier> ::= <letter> { <letter> | <digit> }

which is exactly equivalent to the recursive plain BNF:

<identifier> ::= <letter> | <identifier> <letter> | <identifier> <digit>

The { } simply hides the recursion that the plain form makes explicit — useful, but adding no new power.

Context-Free Languages and BNF

Context-Free Languages and BNF

Spec Mapping

Where Regular Languages Run Out

The Chomsky Hierarchy

Context-Free Grammars

Worked grammar: $a^n b^n$

A Second Worked Grammar: Balanced Brackets

Backus-Naur Form (BNF)

Worked grammar: arithmetic expressions with correct precedence

Recursion: left, right and the base case

Parse Trees (Derivation Trees)

Parsing "3 + 4 * 2"

Why Ambiguity Matters

Syntax (Railroad) Diagrams

Extended BNF (EBNF)

Why Context-Free Grammars Matter

More in Computer Science

Context-Free Languages and BNF

Context-Free Languages and BNF

Spec Mapping

Where Regular Languages Run Out

The Chomsky Hierarchy

Context-Free Grammars

Worked grammar: anbna^n b^nanbn

A Second Worked Grammar: Balanced Brackets

Backus-Naur Form (BNF)

Worked grammar: arithmetic expressions with correct precedence

Recursion: left, right and the base case

Parse Trees (Derivation Trees)

Parsing "3 + 4 * 2"

Why Ambiguity Matters

Syntax (Railroad) Diagrams

Extended BNF (EBNF)

Why Context-Free Grammars Matter

More in Computer Science

Worked grammar: $a^n b^n$