You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
Regular languages — the territory of finite state machines and regular expressions — hit a hard wall: they cannot describe nested or recursive structure. Balanced brackets, the language anbn, and the syntax of every real programming language all lie beyond them. Context-free languages (CFLs) are the next, more powerful class, and they are exactly what we need to describe such structure. They are generated by context-free grammars (CFGs), which in computer science are most often written in Backus-Naur Form (BNF). The decisive new ingredient is recursion: a grammar rule that refers, directly or indirectly, to itself, giving the unbounded "memory" that finite machines lack.
This lesson sits in the Theory of Computation strand (around section 4.4.3) of AQA A-Level Computer Science (7517). You are expected to understand the limits of regular languages — that some sets of strings (matched brackets, anbn) cannot be defined by a regular expression or recognised by an FSM — and to know that such languages can instead be described by Backus-Naur Form (BNF). You must be able to read and write BNF production rules, recognise and use recursion within BNF, derive example strings from a grammar, and relate BNF to syntax (railroad) diagrams. The treatment below is descriptive rather than a verbatim quotation of the specification.
Recall the FSM/regex limitation: a finite machine has only finitely many states, so it can remember only a bounded amount of information. Two canonical languages defeat it:
(()) and ((())) where every opening bracket has a matching close in the right order. Checking this requires tracking the current nesting depth, which can grow without limit.as then bs: ab, aabb, aaabbb, … . This requires counting the as to match against the bs.Both need unbounded memory, so no regex describes them. The escape is recursion: a grammar rule that contains itself can "wrap" a structure around a smaller copy of the same structure, expanding to any depth. That single idea is what lifts us from regular to context-free.
Formal languages sit in a strict ladder of expressive power, the Chomsky hierarchy:
| Type | Language class | Recognised by | Characteristic example |
|---|---|---|---|
| 3 | Regular | Finite state machine | a*b* |
| 2 | Context-free | Pushdown automaton (FSM + a stack) | anbn, balanced brackets |
| 1 | Context-sensitive | Linear-bounded automaton | anbncn |
| 0 | Recursively enumerable | Turing machine | any computable language |
Each level strictly contains the one below: Regular ⊂ Context-Free ⊂ Context-Sensitive ⊂ Recursively Enumerable. The jump from Type 3 to Type 2 is precisely the jump this lesson is about, and the machine that recognises Type 2 languages — a pushdown automaton — is just a finite state machine equipped with a stack. That stack is the unbounded memory the FSM was missing, and it is no coincidence that the stack reappears when we evaluate expressions in the Reverse Polish Notation lesson. The pattern of the whole hierarchy is worth internalising: each rung is the machine of the rung below with one more capability added — a stack to get from regular to context-free, a bounded read/write tape to get to context-sensitive, and an unbounded tape to reach the full power of a Turing machine. Memory, in increasingly flexible forms, is the single dimension along which computational power grows.
A context-free grammar has four components:
| Component | Notation | Description |
|---|---|---|
| Terminals | lowercase letters, digits, symbols | The actual characters that appear in the language (e.g. a, b, +, (, )) |
| Non-terminals | UPPERCASE, or names in < > | "Variables" standing for sub-structures, to be rewritten |
| Production rules | → or ::= | Rules rewriting a non-terminal as a sequence of terminals and non-terminals |
| Start symbol | usually S or <program> | The non-terminal a derivation begins from |
The word context-free means each rule rewrites a single non-terminal regardless of what surrounds it — its "context" is irrelevant. That restriction is what keeps the class tractable while still allowing recursion.
S → a S b | ε
Here S is the start symbol, a and b are terminals, and ε is the empty string. The rule says: an S is either an a, then another S, then a b (the recursive case), or nothing at all (the base case that stops the recursion). Watch a full derivation of aaabbb, rewriting S at each step:
| Step | Sentential form | Rule applied |
|---|---|---|
| 0 | S | start |
| 1 | a S b | S → a S b |
| 2 | a a S b b | S → a S b (applied to inner S) |
| 3 | a a a S b b b | S → a S b |
| 4 | a a a ε b b b | S → ε (base case) |
| 5 | a a a b b b = aaabbb | collapse ε |
Each recursive step adds exactly one a on the left and exactly one b on the right, so the counts can never get out of step — the grammar generates precisely { ε, ab, aabb, aaabbb, … } = {anbn:n≥0}. An FSM cannot do this: it would need to count the as, and no fixed number of states can count to an arbitrary n. The recursion in the grammar is the unbounded memory.
The other classic non-regular language is the set of correctly balanced brackets — strings like (), (()), ()() and (()()) where every ( has a matching ) in the right order. A context-free grammar captures it neatly:
S → ( S ) | S S | ε
Read the three alternatives as the three ways a balanced string can be formed: a balanced string wrapped in a fresh pair of brackets (( S )); two balanced strings written one after another (S S); or the empty string (the base case). Watch a derivation of (())():
| Step | Sentential form | Rule applied |
|---|---|---|
| 0 | S | start |
| 1 | S S | S → S S |
| 2 | ( S ) S | left S → ( S ) |
| 3 | ( ( S ) ) S | inner S → ( S ) |
| 4 | ( ( ) ) S | inner S → ε |
| 5 | ( ( ) ) ( S ) | right S → ( S ) |
| 6 | ( ( ) ) ( ) | last S → ε |
Reading the leaves gives (())() — correctly balanced. The grammar can never produce an unmatched bracket because brackets are only ever introduced as a pair by the ( S ) rule. This is exactly the language a finite state machine cannot recognise, because checking balance requires tracking the current nesting depth, which is unbounded; the grammar's recursion supplies precisely that unbounded capability. The very same structure underlies the matching of begin/end, {/} and opening/closing HTML tags in real languages, which is why this little grammar is far more than a toy.
BNF is the standard metalanguage for writing context-free grammars and is used to define the syntax of programming languages, network protocols and file formats.
| Symbol | Meaning |
|---|---|
::= | "is defined as" (the BNF form of →) |
| ` | ` |
< > | enclose a non-terminal's name |
<expression> ::= <term> | <expression> + <term> | <expression> - <term>
<term> ::= <factor> | <term> * <factor> | <term> / <factor>
<factor> ::= <number> | ( <expression> )
<number> ::= <digit> | <number> <digit>
<digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
This single grammar achieves three things that reward close study:
* and / live at the <term> level, while + and - live at the higher <expression> level. Because an expression is built from terms, multiplication binds more tightly than addition — without any explicit precedence table. The grammar's shape is the precedence.<number> ::= <digit> | <number> <digit> lets a number be any string of digits, and <expression> referring to itself lets expressions be arbitrarily long.<factor> ::= ( <expression> ) makes a whole parenthesised expression usable wherever a single factor can appear — so 3 + 4 * (2 - 1) is valid, with the brackets forcing 2 - 1 to be evaluated first.Recursion in BNF comes in two shapes, and both need a non-recursive base case or the definition never terminates:
S → a S b (also <digits> ::= <digit> | <digit> <digits>).<expression> ::= <expression> + <term>.Always include at least one alternative with no recursion (here <digit>, or ε) — it is the exit from the recursion. A grammar with only recursive alternatives generates nothing, because there is no way to terminate a derivation.
A parse tree displays how a string is derived: the root is the start symbol, internal nodes are non-terminals, and the leaves, read left to right, spell out the terminal string. The shape of the tree captures the structure (and hence the meaning) of the input.
flowchart TB
E["<expression>"] --> EL["<expression> (= 3)"]
E --> PLUS["+"]
E --> TR["<term>"]
EL --> TL["<term> → <factor> → 3"]
TR --> TRL["<term> (= 4)"]
TR --> STAR["*"]
TR --> TRF["<factor> → 2"]
Because 4 * 2 is grouped together as a single <term> below the +, the tree forces 4 * 2 to be combined before the addition — exactly matching the convention that 3 + 4 * 2 = 11, not 14. The grammar's structure, made visible in the parse tree, is the precedence rule. A grammar that allows two different parse trees for the same string is called ambiguous; the precedence-layered grammar above is unambiguous, which is precisely why compilers prefer such designs.
To see why an unambiguous grammar is so important, imagine a careless grammar that just said <expr> ::= <expr> + <expr> | <expr> * <expr> | <number>. For the input 3 + 4 * 2 this grammar permits two different parse trees:
+ is the top operator, grouping 4 * 2 together → value 11 (correct);* is the top operator, grouping 3 + 4 together → value 14 (wrong).Since the parse tree determines how the expression is evaluated, an ambiguous grammar would let a compiler compute either answer — clearly unacceptable. The well-designed layered grammar (<expression> → <term> → <factor>) eliminates this by forcing * and / to sit lower in the tree than + and -, so only the correct grouping is derivable. The lesson generalises: when you design a grammar for anything that will be evaluated, you must ensure each valid string has exactly one parse tree, because that tree is the meaning. This is one of the central practical concerns of programming-language design, and it connects directly to the Reverse Polish Notation lesson, where the (unique) expression tree is flattened to postfix for evaluation.
A syntax diagram (also called a railroad diagram) is a graphical equivalent of a BNF rule: you read it by following the "track" from left to right, and any complete path through the diagram spells a valid string. Loops in the track represent repetition (recursion); branches represent alternatives (|).
flowchart LR
start([start]) --> L["letter"]
L --> branch{ }
branch -->|"letter"| branch
branch -->|"digit"| branch
branch --> done([valid identifier])
This diagram says an identifier is a letter followed by any number of further letters or digits — the loop back into the branch is the diagrammatic form of recursion. Syntax diagrams and BNF carry the same information; language reference manuals often print both because each suits different readers.
EBNF adds shorthand that makes common patterns less verbose, without changing the class of languages describable:
| EBNF notation | Meaning | Plain-BNF equivalent |
|---|---|---|
{ } | zero or more repetitions | a recursive rule |
[ ] | optional (zero or one) | two alternatives, one with ε |
( ) | grouping | grouping with ` |
For example, in EBNF an identifier is:
<identifier> ::= <letter> { <letter> | <digit> }
which is exactly equivalent to the recursive plain BNF:
<identifier> ::= <letter> | <identifier> <letter> | <identifier> <digit>
The { } simply hides the recursion that the plain form makes explicit — useful, but adding no new power.
Subscribe to continue reading
Get full access to this lesson and all 10 lessons in this course.