Regular Expressions

A regular expression (regex) is a compact piece of notation that describes a set of strings — a pattern. Instead of listing every string you want ("cat", "cats", "catt", …) you write a short formula that generates or matches them all. Regular expressions are the everyday tool for pattern matching: validating that an email address has the right shape, finding every date in a document, extracting the numbers from a log file, or splitting source code into tokens. They are so useful that almost every programming language, text editor and command-line tool understands them.

But for an A-Level Computer Scientist the real significance of regular expressions is theoretical, and it is the thread that ties this whole part of the course together: a regular expression and a finite state machine are two different notations for exactly the same thing. Every regular expression can be converted into an FSM that accepts precisely the strings the regex matches, and every FSM can be described by a regular expression. The set of languages they describe — the regular languages — is the same. That equivalence, and the boundary it draws (the things regex cannot describe, like balanced brackets), is the heart of this lesson and the launch-pad for BNF and the Chomsky hierarchy that follow. This lesson covers the regex metacharacters and what each matches, builds up worked patterns from parts, then makes the regex ↔ FSM equivalence concrete by drawing the machine for a regex.

Spec Mapping

This lesson addresses the regular-expression content of H446 section 2.1 (computational methods). The specification expects you to be able to:

understand what a regular expression is and what it is used for (searching, validation, extraction, replacement, and lexical analysis);
know the meaning of the common metacharacters — ., *, +, ?, the anchors ^ and $, alternation |, character classes [...], ranges, negation [^...], grouping (...), and the counted quantifiers {n}, {n,}, {n,m};
construct a regular expression to match a stated pattern, and interpret a given regular expression by describing the strings it matches (and giving examples that do and do not match);
understand the equivalence between regular expressions and finite state machines — that both describe exactly the regular languages — and that some languages (e.g. balanced brackets, $a^n b^n$ ) are not regular and so lie beyond both.

It connects backward to finite state machines (the previous lesson — the machine form of the same idea) and forward to the classification of languages (Type-3 regular languages) and lexical analysis in the Translators topic.

What a Regular Expression Describes

It helps to fix the idea before the symbols. An alphabet is the finite set of symbols we are working over (often {0,1}, or the letters and digits). A string is a finite sequence of those symbols. A language is a set of strings. A regular expression is a formula that picks out one particular language — the set of all strings that "match" it.

Three primitive building blocks generate every regular expression, and it is worth seeing them because they explain why the operators behave as they do:

Concatenation — writing patterns next to each other means "this, then that". ab matches "ab".
Alternation — | means "this or that". a|b matches "a" or "b".
Repetition (Kleene star) — * means "zero or more of the preceding". a* matches "", "a", "aa", ….

Everything else — +, ?, character classes, counted quantifiers — is convenience built on top of these three. That matters for the theory: because the primitives are exactly concatenation, alternation and star, the languages regular expressions describe are precisely the regular languages, no more.

Basic Metacharacters

A metacharacter is a character that has special meaning rather than standing for itself.

Metacharacter	Meaning	Example	Matches	Does not match
`.`	any single character (except newline)	`a.c`	"abc", "a1c", "a c"	"ac", "abbc"
`*`	zero or more of the preceding element	`ab*c`	"ac", "abc", "abbbc"	"adc"
`+`	one or more of the preceding element	`ab+c`	"abc", "abbc"	"ac"
`?`	zero or one of the preceding element	`ab?c`	"ac", "abc"	"abbc"
`^`	start of string (anchor)	`^Hi`	"Hi there"	"Say Hi"
`$`	end of string (anchor)	`end$`	"the end"	"ending"
`\|`	alternation (OR)	`cat\|dog`	"cat", "dog"	"cot"

The three repetition operators are the ones most often muddled, so be precise: * allows none, + demands at least one, and ? allows at most one. The difference between ab*c and ab+c is exactly whether the bare "ac" is included.

The anchors ^ and $ are different in kind — they match a position, not a character. ^ asserts "we are at the start", $ asserts "we are at the end". They consume nothing; they simply pin the rest of the pattern to the boundary, which is how you force a regex to match a whole string rather than just a substring of it.

Character Classes

A character class, written in square brackets, matches exactly one character chosen from a set:

Syntax	Meaning	Example	Matches
`[abc]`	any one of a, b, c	`[aeiou]`	a single vowel
`[a-z]`	any character in the range a–z	`[a-z]`	any lowercase letter
`[A-Z]`	any uppercase letter	`[A-Z]`	"A" … "Z"
`[0-9]`	any digit	`[0-9]`	"0" … "9"
`[a-zA-Z]`	any letter	`[a-zA-Z]`	any letter, either case
`[^abc]`	any character except a, b, c	`[^0-9]`	any non-digit

Two points trip students up. First, a range like a-z uses the hyphen specially — it means "every character from a to z inclusive", relying on the underlying character ordering. Second, a ^ as the very first character inside the brackets negates the class: [^0-9] means "any character that is not a digit".

Key Term: The ^ symbol has two unrelated meanings. Outside square brackets it is the start-of-string anchor; immediately inside square brackets it negates the character class. ^[0-9] means "a digit at the start of the string"; [^0-9] means "a single non-digit character". Reading the two correctly is a common exam discriminator.

Counted Quantifiers

The bare *, + and ? cover "any number", "at least one" and "optional". For specific counts, use brace quantifiers:

Quantifier	Meaning	Example	Matches
`*`	0 or more	`a*`	"", "a", "aa", …
`+`	1 or more	`a+`	"a", "aa", … (not "")
`?`	0 or 1	`a?`	"", "a"
`{n}`	exactly n	`a{3}`	"aaa"
`{n,}`	n or more	`a{2,}`	"aa", "aaa", …
`{n,m}`	between n and m	`a{2,4}`	"aa", "aaa", "aaaa"

These are pure convenience — a{3} is just shorthand for aaa, and a{2,4} for aa(a(a)?)? — so they add no expressive power, only readability. They are invaluable when a real format demands a fixed length, such as "exactly three digits" in a phone code.

Grouping and Alternation

Parentheses group a sub-pattern so an operator applies to the whole group, and | chooses between alternatives:

Syntax	Meaning	Example	Matches
`(abc)`	treat "abc" as one unit	`(ab)+`	"ab", "abab", "ababab"
`a\|b`	"a" or "b"	`cat\|dog`	"cat", "dog"
`(cat\|dog)s`	alternation inside a group	`(cat\|dog)s`	"cats", "dogs"

Grouping is what makes repetition of multi-character patterns possible: ab+ repeats only the b (matching "ab", "abb", …), whereas (ab)+ repeats the whole pair (matching "ab", "abab", …). Grouping also scopes alternation: (cat|dog)s adds the trailing "s" to both alternatives, while the un-grouped cat|dogs would mean "cat" or "dogs" — a genuinely different language. Mis-scoping alternation is one of the commonest regex bugs.

A Library of Useful Patterns

Purpose	Pattern	Notes
any digit	`[0-9]`	one character
any letter	`[a-zA-Z]`	either case
alphanumeric	`[a-zA-Z0-9]`	letters and digits
an integer	`[0-9]+`	one or more digits
a binary string	`[01]+`	one or more 0/1
an identifier	`[a-zA-Z_][a-zA-Z0-9_]*`	first char letter/underscore, then any
simplified UK postcode	`[A-Z]{1,2}[0-9]{1,2} [0-9][A-Z]{2}`	e.g. "SW1A 1AA"
simple email	`[a-zA-Z0-9.]+@[a-zA-Z0-9]+\.[a-zA-Z]{2,}`	basic shape only

The identifier pattern repays study because it is exactly how a compiler's lexer recognises a variable name: [a-zA-Z_] forces a letter or underscore first (so "2x" is rejected), and [a-zA-Z0-9_]* then allows any run of letters, digits or underscores. Notice the email pattern contains \. — a backslash-dot, the escaped form of . — because an unescaped . would match any character, allowing "user@exampleXcom"; escaping it forces a literal full stop.

Worked Examples — Building Patterns From Parts

The reliable method is decompose, write each part, concatenate, then test with strings that should and should not match.

Example 1 — a UK mobile in the format 07xxx xxxxxx. The parts are: literal "07", then three digits, a space, then six digits.

07[0-9]{3} [0-9]{6}

Matches "07123 456789". Does not match "07123456789" (no space) or "08123 456789" (does not start "07").

Example 2 — a 24-hour time HH:MM. The hours are 00–23, which splits into "00–19" ([01][0-9]) or "20–23" (2[0-3]); minutes are 00–59 ([0-5][0-9]):

([01][0-9]|2[0-3]):[0-5][0-9]

Matches "09:30", "23:59", "00:00". Does not match "24:00" (hour out of range), "9:30" (single-digit hour) or "12:60" (minute out of range). The alternation-inside-a-group is doing the real work: it is the only way to express "00–23" as a regular pattern, since the leading digit constrains what the second digit may be.

Example 3 — binary strings containing at least one 1. Lead with any number of 0s, then a compulsory 1, then anything:

0*1[01]*

Matches "1", "01", "10", "1010", "0001". Does not match "" (no 1) or "000" (no 1). This same language was recognised by a two-state FSM in the previous lesson — a first hint of the equivalence we develop next.

Example 4 — a real (decimal) number with an optional sign and optional fractional part, such as "42", "-3", "3.14", "+0.5". Decompose: an optional sign, then one or more digits, then optionally a dot followed by one or more digits.

[+-]?[0-9]+(\.[0-9]+)?

Reading the parts: [+-]? is an optional leading + or − (the ? makes the whole sign optional); [0-9]+ is the compulsory integer part; (\.[0-9]+)? is an optional group — an escaped dot followed by one or more digits — so the fractional part is present or absent as a whole. Matches "42", "-3", "3.14", "+0.5". Does not match "3." (the group demands at least one digit after the dot), ".5" (no integer part) or "1.2.3" (the group occurs at most once). This pattern is essentially how a lexer recognises a numeric literal — and it shows how ? on a group expresses "this whole sub-structure is optional".

Example 5 — a string over {a,b} that starts and ends with the same letter (and has length ≥ 1). There is no single neat formula, so use alternation between the two cases — "starts and ends with a" or "starts and ends with b" — plus the single-letter strings "a" and "b":

a([ab]*a)?|b([ab]*b)?

The left alternative a([ab]*a)? is "an a, optionally followed by (anything then another a)", covering "a", "aa", "aba", "abba"…; the right alternative does the same for b. Matches "a", "b", "aba", "bab", "abba". Does not match "ab" or "ba" (different first and last letters) or "" (empty). The lesson here is a useful exam tactic: when a single pattern is awkward, split the language into cases with | and handle each case separately.

Reading a Given Regex (the reverse skill)

Examiners test interpretation as often as construction: given a regular expression, describe the language and give examples. The reliable method is to read left to right, naming each piece, then state the overall shape.

Take [A-Z][a-z]*( [A-Z][a-z]*)*. Read it in chunks: [A-Z] is one capital letter; [a-z]* is zero or more lowercase letters — so [A-Z][a-z]* is a capitalised word ("A", "Hello", "Cat"). The trailing ( [A-Z][a-z]*)* is "zero or more of (space + another capitalised word)". So the whole expression matches a sequence of one or more capitalised words separated by single spaces — e.g. "Hello", "Hello World", "The Quick Brown Fox". It does not match "hello" (no leading capital) or "Hello World" (double space) or "HelloWorld" (no separating space). Naming each fragment before summarising is what turns a baffling line of symbols into a clear English description.

A second worked reading: (00|11)* over {0,1}. The group (00|11) matches "00" or "11"; the * repeats it zero or more times. So the language is "any concatenation of the blocks 00 and 11" — equivalently, all even-length binary strings in which every maximal run has even length… but the cleaner description examiners want is simply "strings built by stuck-together pairs, each pair being 00 or 11", e.g. "", "00", "11", "0011", "1100", "111100". It does not match "01", "0", or "000" (odd length / not a whole number of identical pairs). Whenever you see (X)*, say "zero or more copies of X" and then give the empty string as your first example — examiners reward spotting that * includes the zero case.

The Equivalence Between Regular Expressions and FSMs

This is the central idea of the topic. Regular expressions and finite state machines have exactly the same expressive power. Formally:

every regular expression can be converted into a finite state machine that accepts precisely the strings the regex matches (the standard algorithm is Thompson's construction);
every finite state machine can be converted into a regular expression that describes precisely the strings it accepts (e.g. by state elimination);
consequently the languages describable by regular expressions are identical to the languages recognisable by FSMs: the regular languages.

Regular Expressions

Regular Expressions

Spec Mapping

What a Regular Expression Describes

Basic Metacharacters

Character Classes

Counted Quantifiers

Grouping and Alternation

A Library of Useful Patterns

Worked Examples — Building Patterns From Parts

Reading a Given Regex (the reverse skill)

The Equivalence Between Regular Expressions and FSMs

More in Computer Science