Regular Expressions

A regular expression (regex) is a concise, formal notation for describing a set of strings — a pattern. The set of strings a regex describes is called the language of that expression, and any language describable this way is called a regular language. The headline result of this lesson is one of the most elegant in the whole of computer science: regular expressions and finite state machines are exactly equivalent in power. Every regex can be turned into an FSM that accepts precisely the matching strings, and every FSM's language can be written as a regex. The languages they share are the regular languages — the simplest, best-understood class in the Chomsky hierarchy. Understanding what regex can and cannot describe is just as important as the syntax itself.

Spec Mapping

This lesson sits in the Theory of Computation strand (around section 4.4.2) of AQA A-Level Computer Science (7517). You need to know the regular-expression metacharacters required by the specification — concatenation, alternation |, the Kleene star *, the Kleene plus +, the optional ?, and grouping with ( ) — and be able to write a regex for a simple language and decide whether a given string matches one. You must understand that regular expressions and finite state machines are equivalent (they describe the same regular languages), and appreciate the limits of regular languages: certain patterns, such as matched/balanced brackets and $a^n b^n$ , cannot be described by any regular expression. As elsewhere, the treatment is descriptive; learn the ideas rather than any exact wording.

Why Regular Expressions Matter

Regular expressions are one of the most heavily used tools in practical computing:

Compilers — the lexical analysis stage uses regex to define and recognise tokens (keywords, identifiers, numbers).
Text editors and IDEs — find-and-replace with patterns rather than literal text.
Data validation — checking that an email address, postcode or phone number has the right form.
Log analysis and web scraping — extracting structured fields from unstructured text.
Command-line tools — grep, sed and awk are built around regular expressions.

What makes regular expressions so valuable is the combination of conciseness and power within their class: a single short line of notation can capture a pattern that would otherwise need pages of hand-written conditional logic, and — because that line compiles to a finite state machine — it can be checked against text extremely efficiently. The same notation appears, with minor syntactic variations, in virtually every programming language and text-processing tool, so the concepts you learn here transfer directly into professional practice. Equally important for this course is knowing the boundary of that power: a great deal of the examinable understanding is about recognising the patterns regular expressions can describe versus those they fundamentally cannot.

Regex Syntax (the operators you must know)

Operator	Name	Meaning	Example	Matches
(none)	Concatenation	One pattern immediately followed by another	`ab`	"ab"
`	`	Alternation (OR)	One pattern or another	`a\|b`
`*`	Kleene star	Zero or more repetitions	`a*`	"", "a", "aa", …
`+`	Kleene plus	One or more repetitions	`a+`	"a", "aa", "aaa", …
`?`	Optional	Zero or one occurrence	`a?`	"" or "a"
`( )`	Grouping	Bracket a sub-expression so operators apply to it as a unit	`(ab)*`	"", "ab", "abab", …

The single most common slip is confusing * with +: a* includes the empty string whereas a+ requires at least one a. Equivalently, a+ is just shorthand for aa*, and a? is shorthand for "a or nothing".

Operator precedence (highest to lowest)

The postfix operators *, +, ?
Concatenation
Alternation |

So ab*|c parses as (a(b*))|c — "an a followed by zero or more bs" OR "a c". If you actually want "a or c, repeated", you must add brackets: (a|c)*. Brackets exist precisely to override this default precedence.

A Gallery of Examples

Regex	Language described
`a`	{ "a" }
`ab`	{ "ab" }
`a\|b`	{ "a", "b" }
`a*`	{ "", "a", "aa", "aaa", … }
`(a\|b)*`	Every string over {a, b}, including the empty string
`a+b`	{ "ab", "aab", "aaab", … }
`(ab)+`	{ "ab", "abab", "ababab", … }
`a(b\|c)d`	{ "abd", "acd" }
`(0\|1)*01`	All binary strings ending in "01"
`1(0\|1)*1`	All binary strings that start and end with 1 (length ≥ 2)
`(0\|1)(0\|1)`	All binary strings of exactly length 2

Spend a moment on 1(0\|1)*1: the leading 1 and trailing 1 are fixed, and (0\|1)* allows anything in between, so the shortest match is "11" — there is no way to match a single "1" because two fixed 1s are required.

Reading a Regex Backwards (Describing the Language)

Just as useful as writing a regex is reading one — describing in words the language a given expression denotes. This is a common exam task and a good test of whether you really understand the operators. Work outwards from the fixed parts:

Regex	Language in words	Two example matches	A non-match
`a(a\|b)*`	starts with `a`, then anything	"a", "abba"	"b"
`(a\|b)aa(a\|b)`	contains "aa" somewhere	"aa", "baab"	"abab"
`babab*`	exactly two `a`s, any `b`s around/between them	"aa", "babab"	"aaa"
`(ab)*`	zero or more copies of "ab"	"", "abab"	"aba"

The third row repays close study. b*ab*ab* reads as "some bs, an a, some bs, another a, some bs" — and because the only two literal as are fixed and every other position is a b* (which may be empty), the language is precisely the strings containing exactly two as. This pattern — fixing the things you must have and padding the gaps with * — is the workhorse technique for both reading and writing regular expressions, and recognising it lets you decode an unfamiliar regex quickly under exam pressure.

The Equivalence with FSMs (Kleene's Theorem)

Kleene's theorem states that the following three statements describe exactly the same class of languages:

The language is regular — recognised by some DFA.
The language can be described by a regular expression.
The language is recognised by some NFA.

In other words, FSMs and regular expressions are two faces of the same coin. For any regex you can mechanically build an FSM that accepts precisely the matching strings, and for any FSM you can write a regex for its language. This is why a regex engine can run efficiently: under the hood it compiles your pattern into a finite automaton.

Regex → FSM, by example

The pattern (0\|1)*01 (binary strings ending in "01") corresponds to this DFA — the very machine you met in the FSM lesson:

stateDiagram-v2
    [*] --> S0
    S0 --> S1 : 0
    S0 --> S0 : 1
    S1 --> S1 : 0
    S1 --> S2 : 1
    S2 --> S1 : 0
    S2 --> S0 : 1
    S2 --> [*]

The (0\|1)* part is the "absorb anything" behaviour of the early states, and the trailing 01 is captured by the path S0 → S1 → S2 into the accepting state. This concrete correspondence is Kleene's theorem in miniature.

Thompson's construction (regex → NFA, conceptually)

There is a systematic recipe — Thompson's construction — that builds an NFA piece by piece from a regex, one rule per operator:

Regex fragment	NFA fragment built
`a` (a single symbol)	Two states joined by an edge labelled `a`
$r_1 r_2$ (concatenation)	Join the accepting state of $r_1$ 's NFA to the start of $r_2$ 's with an ε-transition
$r_1 \mid r_2$ (alternation)	A new start state with ε-transitions into both sub-NFAs
$r^*$ (Kleene star)	A new start/accept state with ε-transitions allowing the sub-NFA to be skipped or repeated

An ε-transition (epsilon transition) is a move the machine may take without consuming any input symbol — the device that lets Thompson's construction "glue" fragments together. The resulting NFA can then be converted to a DFA by the subset construction, completing the chain regex → NFA → DFA.

What Regular Expressions Cannot Describe

This is the conceptually richest part of the topic, and a perennial exam favourite. Because a regex is equivalent to a finite state machine, it inherits the FSM's fundamental limitation: no unbounded memory. A regex can therefore describe any pattern that needs only a fixed, finite amount of "remembering", but no more.

Language	Regular?	Reason
Strings over {a,b} ending in "ab"	Yes	`(a\|b)*ab` — only a bounded amount of recent history matters
Binary strings with an even number of 0s	Yes	Two states (a regex `(1010)1`) track parity
$a^n b^n$ (equal numbers of `a` then `b`)	No	Requires counting the `a`s to match the `b`s — unbounded memory
Balanced/matched brackets, e.g. `((()))`	No	Requires tracking unbounded nesting depth — needs a stack
Palindromes over {a,b}	No	Requires comparing the first half against the reversed second half

The crucial intuition: a regex like a*b* is not the same as $a^n b^n$ . The pattern a*b* matches "aaabb" and "ab" and "aabbb" — any number of as followed by any number of bs, with no requirement that the counts be equal. To force equality you would need to count, and counting to an arbitrary value is exactly what a finite machine cannot do. Likewise, no regex can match balanced brackets, because correct nesting requires remembering how many brackets are currently open — an unbounded quantity. These patterns live one rung higher in the hierarchy, in the context-free languages handled by grammars and pushdown automata (the next lesson).

The Pumping Lemma (the why, conceptually)

The pumping lemma is the formal tool used to prove a language is not regular. The intuition: if a string in a regular language is at least as long as the number of states in its FSM, then by the pigeonhole principle the machine must revisit some state — so part of the string forms a loop. That looped section can be repeated ("pumped") any number of times, and every result must still be in the language. For $a^n b^n$ , pumping the looping section would add extra as (or bs) without matching the other side, producing a string not in the language — a contradiction. Hence $a^n b^n$ is not regular. You are not expected to write formal pumping-lemma proofs at A-Level, but you are expected to explain why such languages exceed regular power.

How to Build a Regular Expression Methodically

Writing a regex from a description becomes far easier with a small repertoire of patterns. Most exam tasks are variations on these building blocks:

Requirement	Pattern idea	Example regex (alphabet {a, b})
"contains the substring `X`"	`.X.` (here `.` = `(a\|b)`)	`(a\|b)ab(a\|b)` for "contains ab"
"starts with `X`"	`X.*`	`a(a\|b)*` for "starts with a"
"ends with `X`"	`.*X`	`(a\|b)*b` for "ends with b"
"exactly `X`"	just `X`	`aba` matches only "aba"
"any string over the alphabet"	`(symbols)*`	`(a\|b)*`
"one or more of `X`"	`X+` (or `XX*`)	`(ab)+` for "ab" repeated
"optional `X`"	`X?`	`a?b` matches "b" or "ab"

The general strategy is: (1) identify the fixed parts the string must contain and where; (2) decide what is allowed around and between those fixed parts; (3) wrap the flexible regions in * or +; (4) test against two or three example strings — and crucially against a string that should fail — to confirm you have not been too permissive or too strict.

Worked construction

Write a regex over {a, b} for "every a is immediately followed by a b". Think about what a valid string looks like: it is a sequence of pieces, where each piece is either a lone b or the pair ab (an a is never allowed to stand without a following b). So each unit is (ab\|b), repeated any number of times: (ab\|b)*. Test it: "abb" = ab+b ✓; "b" = b ✓; the empty string ✓; "ba" should fail — and indeed there is no way to produce a trailing lone a, so it correctly does not match. The deliberate failing test is what gives confidence the regex is not merely accepting the right strings but also rejecting the wrong ones.

Matching as Running the Equivalent FSM

Because a regex compiles to an FSM, deciding whether a string matches is exactly running that FSM and checking the final state. This is worth seeing once, as it ties the two halves of the topic together. Take the regex (0\|1)*01 (binary strings ending in "01") and its DFA from earlier (states S0, S1, S2; accepting S2). Matching "1101":

Step	State	Symbol	Next state
1	S0	1	S0
2	S0	1	S0
3	S0	0	S1
4	S1	1	S2

Final state S2 is accepting, so "1101" matches (0\|1)*01 — and indeed it ends in "01". Now "1100":

Regular Expressions

Regular Expressions

Spec Mapping

Why Regular Expressions Matter

Regex Syntax (the operators you must know)

Operator precedence (highest to lowest)

A Gallery of Examples

Reading a Regex Backwards (Describing the Language)

The Equivalence with FSMs (Kleene's Theorem)

Regex → FSM, by example

Thompson's construction (regex → NFA, conceptually)

What Regular Expressions Cannot Describe

The Pumping Lemma (the why, conceptually)

How to Build a Regular Expression Methodically

Worked construction

Matching as Running the Equivalent FSM

More in Computer Science