You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
Backus-Naur Form (BNF) is a notation for describing the syntax — the grammatical structure — of a formal language, most often a programming language. Where a regular expression describes a flat pattern of characters, BNF describes a nested, recursive structure: an expression made of sub-expressions, a statement containing other statements, a list of lists. Syntax diagrams (also called railroad diagrams) are the picture form of the same rules. Together they are how language designers write down exactly what counts as valid code, and how compiler-writers build the parser that checks it.
The reason BNF earns its own lesson — rather than being a footnote to regular expressions — is one decisive fact: BNF can express things no regular expression can. A regex, being equivalent to a finite-state machine, has only finite memory and cannot count without bound; it therefore cannot describe balanced brackets, matched if…else nesting, or arithmetic with parentheses to arbitrary depth. BNF can, because a BNF rule may refer to itself — recursion — which gives it the unbounded structure that regular expressions lack. That single capability, recursion, is what lifts BNF from the regular languages up to the context-free languages, and it is the conceptual bridge to the Chomsky hierarchy in the final lesson. This lesson covers terminals and non-terminals, production rules, recursion (and why it matters), syntax/railroad diagrams, parsing and parse trees, EBNF, and — crucially — what BNF can express that a regular expression cannot.
This lesson addresses the BNF and syntax-diagram content of H446 section 2.1 (computational methods). The specification expects you to be able to:
::= ("is defined as") and | ("or"), and understand the role of recursion in a rule for defining structures of arbitrary length;It builds directly on regular expressions (where BNF goes beyond them) and connects forward to the classification of languages (BNF describes the context-free languages) and to syntax analysis / parsing in the Translators topic.
Natural language is riddled with ambiguity; a programming language must not be. A formal grammar pins down, with no room for argument, exactly which sequences of symbols are legal.
| Reason | Explanation |
|---|---|
| Unambiguous specification | Every possible input is either valid or invalid according to fixed rules — no human judgement |
| Compiler construction | The parser is generated directly from the grammar; the grammar is the specification it implements |
| Meaningful error messages | Knowing the expected structure lets a compiler say "expected ) here" rather than just "error" |
| Portability | Different compilers for the same language agree on what is legal because they share one grammar |
| Documentation | The grammar is a precise reference for anyone learning or implementing the language |
The key word is structure. A grammar does not just say "these characters are allowed"; it says how they may be arranged and nested — and that structural, recursive aspect is what regular expressions cannot reach.
Every grammar is built from two kinds of symbol:
| Type | Definition | Convention |
|---|---|---|
| Terminal | An actual symbol or token that appears literally in the language; it cannot be broken down further | written in quotes (or bold): "if", "+", "(", "0" |
| Non-terminal | A named rule that stands for a structure and can be replaced by a sequence of terminals and/or non-terminals | written in angle brackets: <digit>, <expression>, <statement> |
A useful analogy: terminals are the actual words of the language, while non-terminals are grammatical categories ("noun phrase", "sentence") that describe how words may be combined. In a real grammar:
"if", "(", ")", "{", "}", "=", ";", "0", "1", …, "9"<if-statement>, <expression>, <condition>, <digit>A grammar always has one distinguished start symbol (a non-terminal) representing the whole thing being defined — <program>, say. A string is valid exactly when, starting from that symbol and repeatedly replacing non-terminals using the rules, you can produce that string with only terminals left.
A BNF grammar is a list of production rules. Each rule names a non-terminal on the left and gives its possible expansions on the right:
<non-terminal> ::= expansion
The notation uses just a few symbols:
| Symbol | Meaning |
|---|---|
::= | "is defined as" |
| | "or" — separates alternative expansions |
<name> | a non-terminal |
"text" | a terminal |
Defining a digit is the simplest rule — a digit is one of ten terminals:
<digit> ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"
This says a <digit> may be expanded to any one of the ten terminal characters. So far this is no more than a regular expression could do ([0-9]). The power appears the moment a rule mentions itself.
Defining a natural number of any length needs exactly that self-reference:
<digit> ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"
<natural-number> ::= <digit> | <digit> <natural-number>
Read the second rule as: a <natural-number> is either a single <digit>, or a <digit> followed by another <natural-number>. The rule refers to itself in its second alternative — recursion — which is what lets it generate "5", "42", "1000", or a number with a million digits. There is no upper bound built into the rule; the recursion can unfold as many times as needed.
Recursion is the idea that makes BNF more powerful than regular expressions, so it deserves a careful look. A recursive rule is one whose right-hand side mentions the non-terminal being defined. This is what expresses "a structure of arbitrary size".
A list of items shows the two standard shapes. Left-recursive (the rule refers to itself on the left of the new material):
<list> ::= <item> | <list> "," <item>
This builds a list as "a list, then a comma, then one more item", generating "a", "a,b", "a,b,c", … . Right-recursive (self-reference on the right):
<list> ::= <item> | <item> "," <list>
This describes the same language but grows the other way ("one item, a comma, then the rest of the list"). Both are valid; the choice affects how a parser builds the structure (and which parsing algorithms cope with it), but not which strings are legal.
The deep point is that a finite set of recursive rules describes an infinite language with unbounded structure — exactly what a regular expression, with its finite memory, cannot do. Whenever you need "any number of these, possibly nested", recursion in BNF is the tool.
A genuinely useful grammar describes arithmetic with + - * / and brackets, and gets operator precedence right:
<expression> ::= <term> | <expression> "+" <term> | <expression> "-" <term>
<term> ::= <factor> | <term> "*" <factor> | <term> "/" <factor>
<factor> ::= <number> | "(" <expression> ")"
<number> ::= <digit> | <digit> <number>
<digit> ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"
This single grammar handles:
<number>;<expression> level;<term> level;<factor> can be "(" <expression> ")", and <expression> can contain more brackets, recursively, without limit;* and / live in <term> and <term> is "lower" (more deeply nested) than <expression>, multiplication and division bind tighter than addition and subtraction — they are forced to combine first.That last point is the elegance of layered grammars: precedence is built into the shape of the rules, not bolted on afterwards. And the bracket rule "(" <expression> ")" is the textbook example of something no regular expression can express — arbitrarily deep nesting — captured in one short recursive production.
Grammars describe statements as readily as expressions:
<if-statement> ::= "if" "(" <condition> ")" "{" <statements> "}"
| "if" "(" <condition> ")" "{" <statements> "}" "else" "{" <statements> "}"
<condition> ::= <expression> <comparator> <expression>
<comparator> ::= "==" | "!=" | "<" | ">" | "<=" | ">="
<statements> ::= <statement> | <statement> <statements>
The first rule offers two alternatives — an if with or without an else — and <statements> is recursive, so a block may contain any number of statements. Because a <statement> could itself be another <if-statement> (in a fuller grammar), this directly supports nested if-statements to any depth, again something a regular expression could never enforce because it cannot track how many opening { await a closing }.
Identifiers (variable names) sit on an interesting boundary — they are flat, so they fall within the reach of a regular expression, yet they are equally easy to express in BNF, which lets us compare the two notations directly. A typical rule: an identifier starts with a letter, then has any number of letters or digits.
<identifier> ::= <letter> | <letter> <rest>
<rest> ::= <alphanumeric> | <alphanumeric> <rest>
<alphanumeric> ::= <letter> | <digit>
<letter> ::= "a" | "b" | "c"
<digit> ::= "0" | "1" | "2"
Here <identifier> is "a letter, optionally followed by a <rest>", and <rest> recursively allows one or more further letters-or-digits. This generates "a", "ab", "a1", "abc", "b2c", and rejects "1a" (starts with a digit) or "" (empty). The equivalent regular expression is [a-c][a-c0-2]* — much shorter. The comparison is instructive: because an identifier has no nested structure, both a regex and a grammar describe it, and the regex is the natural choice (which is exactly why a compiler's lexer uses regular expressions for tokens like identifiers). The grammar's extra power — recursion that nests rather than merely repeats — is simply not needed here, and that is the whole point of the regex-vs-BNF division of labour: use the weaker, simpler tool where it suffices, and reach for grammars only when genuine nesting appears.
A syntax diagram is the pictorial form of a BNF rule. They are nicknamed railroad diagrams because a valid string corresponds to any path a "train" can take through the track from entry to exit. The drawing conventions:
| Symbol | Meaning |
|---|---|
| rounded box / circle | a terminal (the literal text inside it) |
| rectangle | a non-terminal (another rule, by name) |
| line with arrow | direction of travel, left to right |
| a fork | a choice (alternatives) — the train may take either branch |
| a loop back | repetition — the train may go round again |
Syntax diagram for <digit> — a single fork choosing one of ten terminals (every path goes through exactly one digit and out):
flowchart LR
IN(("in")) --> J{choose one}
J --> D0(("0"))
J --> D1(("1"))
J --> D2(("2"))
J --> D3(("3"))
J --> D4(("4"))
J --> D5(("5"))
J --> D6(("6"))
J --> D7(("7"))
J --> D8(("8"))
J --> D9(("9"))
D0 --> OUT(("out"))
D1 --> OUT
D2 --> OUT
D3 --> OUT
D4 --> OUT
D5 --> OUT
D6 --> OUT
D7 --> OUT
D8 --> OUT
D9 --> OUT
Syntax diagram for <natural-number> — go through one digit, then optionally loop back to take another, capturing "one or more digits":
flowchart LR
IN(("in")) --> DG["digit"] --> OUT(("out"))
DG -- "repeat" --> DG
The loop arrow is the diagrammatic equivalent of the recursion in the BNF rule — both say "as many digits as you like, but at least one".
The two forms carry identical information, and converting is a routine task:
| BNF construct | Syntax-diagram equivalent |
|---|---|
terminal "x" | a rounded box containing x |
non-terminal <x> | a rectangle labelled x |
A | B (alternatives) | two parallel branches that fork and rejoin |
A B (sequence) | A then B along the same line |
<x> ::= … <x> (recursion) | a loop arrow back to the relevant point |
So alternation becomes a fork, concatenation becomes a straight run, and recursion/repetition becomes a loop. (These are the same three shapes — choice, sequence, repetition — that appeared in regular expressions; the extra power of grammars comes purely from one rule being allowed to invoke another, including itself.)
Parsing is the process of analysing a string against a grammar to (a) decide whether it is valid and (b) reveal its structure. The program that does this is a parser; the structure it produces is a parse tree (or derivation tree); a string that fits no derivation produces a syntax error.
| Concept | Definition |
|---|---|
| Parser | a program that checks a string against a grammar and builds its structure |
| Parse tree | a tree showing how the string is derived from the start symbol via the rules |
| Syntax error | raised when no sequence of rule-applications yields the string |
A parse tree for "3 + 5 * 2" under the arithmetic grammar makes the precedence visible — 5 * 2 is grouped as a single <term> below the +:
Subscribe to continue reading
Get full access to this lesson and all 10 lessons in this course.