You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
A regular expression (regex) is a concise, formal notation for describing a set of strings — a pattern. The set of strings a regex describes is called the language of that expression, and any language describable this way is called a regular language. The headline result of this lesson is one of the most elegant in the whole of computer science: regular expressions and finite state machines are exactly equivalent in power. Every regex can be turned into an FSM that accepts precisely the matching strings, and every FSM's language can be written as a regex. The languages they share are the regular languages — the simplest, best-understood class in the Chomsky hierarchy. Understanding what regex can and cannot describe is just as important as the syntax itself.
This lesson sits in the Theory of Computation strand (around section 4.4.2) of AQA A-Level Computer Science (7517). You need to know the regular-expression metacharacters required by the specification — concatenation, alternation |, the Kleene star *, the Kleene plus +, the optional ?, and grouping with ( ) — and be able to write a regex for a simple language and decide whether a given string matches one. You must understand that regular expressions and finite state machines are equivalent (they describe the same regular languages), and appreciate the limits of regular languages: certain patterns, such as matched/balanced brackets and anbn, cannot be described by any regular expression. As elsewhere, the treatment is descriptive; learn the ideas rather than any exact wording.
Regular expressions are one of the most heavily used tools in practical computing:
grep, sed and awk are built around regular expressions.What makes regular expressions so valuable is the combination of conciseness and power within their class: a single short line of notation can capture a pattern that would otherwise need pages of hand-written conditional logic, and — because that line compiles to a finite state machine — it can be checked against text extremely efficiently. The same notation appears, with minor syntactic variations, in virtually every programming language and text-processing tool, so the concepts you learn here transfer directly into professional practice. Equally important for this course is knowing the boundary of that power: a great deal of the examinable understanding is about recognising the patterns regular expressions can describe versus those they fundamentally cannot.
| Operator | Name | Meaning | Example | Matches |
|---|---|---|---|---|
| (none) | Concatenation | One pattern immediately followed by another | ab | "ab" |
| ` | ` | Alternation (OR) | One pattern or another | a|b |
* | Kleene star | Zero or more repetitions | a* | "", "a", "aa", … |
+ | Kleene plus | One or more repetitions | a+ | "a", "aa", "aaa", … |
? | Optional | Zero or one occurrence | a? | "" or "a" |
( ) | Grouping | Bracket a sub-expression so operators apply to it as a unit | (ab)* | "", "ab", "abab", … |
The single most common slip is confusing * with +: a* includes the empty string whereas a+ requires at least one a. Equivalently, a+ is just shorthand for aa*, and a? is shorthand for "a or nothing".
*, +, ?|So ab*|c parses as (a(b*))|c — "an a followed by zero or more bs" OR "a c". If you actually want "a or c, repeated", you must add brackets: (a|c)*. Brackets exist precisely to override this default precedence.
| Regex | Language described |
|---|---|
a | { "a" } |
ab | { "ab" } |
a|b | { "a", "b" } |
a* | { "", "a", "aa", "aaa", … } |
(a|b)* | Every string over {a, b}, including the empty string |
a+b | { "ab", "aab", "aaab", … } |
(ab)+ | { "ab", "abab", "ababab", … } |
a(b|c)d | { "abd", "acd" } |
(0|1)*01 | All binary strings ending in "01" |
1(0|1)*1 | All binary strings that start and end with 1 (length ≥ 2) |
(0|1)(0|1) | All binary strings of exactly length 2 |
Spend a moment on 1(0\|1)*1: the leading 1 and trailing 1 are fixed, and (0\|1)* allows anything in between, so the shortest match is "11" — there is no way to match a single "1" because two fixed 1s are required.
Just as useful as writing a regex is reading one — describing in words the language a given expression denotes. This is a common exam task and a good test of whether you really understand the operators. Work outwards from the fixed parts:
| Regex | Language in words | Two example matches | A non-match |
|---|---|---|---|
a(a|b)* | starts with a, then anything | "a", "abba" | "b" |
(a|b)*aa(a|b)* | contains "aa" somewhere | "aa", "baab" | "abab" |
b*ab*ab* | exactly two as, any bs around/between them | "aa", "babab" | "aaa" |
(ab)* | zero or more copies of "ab" | "", "abab" | "aba" |
The third row repays close study. b*ab*ab* reads as "some bs, an a, some bs, another a, some bs" — and because the only two literal as are fixed and every other position is a b* (which may be empty), the language is precisely the strings containing exactly two as. This pattern — fixing the things you must have and padding the gaps with * — is the workhorse technique for both reading and writing regular expressions, and recognising it lets you decode an unfamiliar regex quickly under exam pressure.
Kleene's theorem states that the following three statements describe exactly the same class of languages:
In other words, FSMs and regular expressions are two faces of the same coin. For any regex you can mechanically build an FSM that accepts precisely the matching strings, and for any FSM you can write a regex for its language. This is why a regex engine can run efficiently: under the hood it compiles your pattern into a finite automaton.
The pattern (0\|1)*01 (binary strings ending in "01") corresponds to this DFA — the very machine you met in the FSM lesson:
stateDiagram-v2
[*] --> S0
S0 --> S1 : 0
S0 --> S0 : 1
S1 --> S1 : 0
S1 --> S2 : 1
S2 --> S1 : 0
S2 --> S0 : 1
S2 --> [*]
The (0\|1)* part is the "absorb anything" behaviour of the early states, and the trailing 01 is captured by the path S0 → S1 → S2 into the accepting state. This concrete correspondence is Kleene's theorem in miniature.
There is a systematic recipe — Thompson's construction — that builds an NFA piece by piece from a regex, one rule per operator:
| Regex fragment | NFA fragment built |
|---|---|
a (a single symbol) | Two states joined by an edge labelled a |
| r1r2 (concatenation) | Join the accepting state of r1's NFA to the start of r2's with an ε-transition |
| r1∣r2 (alternation) | A new start state with ε-transitions into both sub-NFAs |
| r∗ (Kleene star) | A new start/accept state with ε-transitions allowing the sub-NFA to be skipped or repeated |
An ε-transition (epsilon transition) is a move the machine may take without consuming any input symbol — the device that lets Thompson's construction "glue" fragments together. The resulting NFA can then be converted to a DFA by the subset construction, completing the chain regex → NFA → DFA.
This is the conceptually richest part of the topic, and a perennial exam favourite. Because a regex is equivalent to a finite state machine, it inherits the FSM's fundamental limitation: no unbounded memory. A regex can therefore describe any pattern that needs only a fixed, finite amount of "remembering", but no more.
| Language | Regular? | Reason |
|---|---|---|
| Strings over {a,b} ending in "ab" | Yes | (a|b)*ab — only a bounded amount of recent history matters |
| Binary strings with an even number of 0s | Yes | Two states (a regex (1*01*0)*1*) track parity |
anbn (equal numbers of a then b) | No | Requires counting the as to match the bs — unbounded memory |
Balanced/matched brackets, e.g. ((())) | No | Requires tracking unbounded nesting depth — needs a stack |
| Palindromes over {a,b} | No | Requires comparing the first half against the reversed second half |
The crucial intuition: a regex like a*b* is not the same as anbn. The pattern a*b* matches "aaabb" and "ab" and "aabbb" — any number of as followed by any number of bs, with no requirement that the counts be equal. To force equality you would need to count, and counting to an arbitrary value is exactly what a finite machine cannot do. Likewise, no regex can match balanced brackets, because correct nesting requires remembering how many brackets are currently open — an unbounded quantity. These patterns live one rung higher in the hierarchy, in the context-free languages handled by grammars and pushdown automata (the next lesson).
The pumping lemma is the formal tool used to prove a language is not regular. The intuition: if a string in a regular language is at least as long as the number of states in its FSM, then by the pigeonhole principle the machine must revisit some state — so part of the string forms a loop. That looped section can be repeated ("pumped") any number of times, and every result must still be in the language. For anbn, pumping the looping section would add extra as (or bs) without matching the other side, producing a string not in the language — a contradiction. Hence anbn is not regular. You are not expected to write formal pumping-lemma proofs at A-Level, but you are expected to explain why such languages exceed regular power.
Writing a regex from a description becomes far easier with a small repertoire of patterns. Most exam tasks are variations on these building blocks:
| Requirement | Pattern idea | Example regex (alphabet {a, b}) |
|---|---|---|
"contains the substring X" | .*X.* (here . = (a|b)) | (a|b)*ab(a|b)* for "contains ab" |
"starts with X" | X.* | a(a|b)* for "starts with a" |
"ends with X" | .*X | (a|b)*b for "ends with b" |
"exactly X" | just X | aba matches only "aba" |
| "any string over the alphabet" | (symbols)* | (a|b)* |
"one or more of X" | X+ (or XX*) | (ab)+ for "ab" repeated |
"optional X" | X? | a?b matches "b" or "ab" |
The general strategy is: (1) identify the fixed parts the string must contain and where; (2) decide what is allowed around and between those fixed parts; (3) wrap the flexible regions in * or +; (4) test against two or three example strings — and crucially against a string that should fail — to confirm you have not been too permissive or too strict.
Write a regex over {a, b} for "every a is immediately followed by a b". Think about what a valid string looks like: it is a sequence of pieces, where each piece is either a lone b or the pair ab (an a is never allowed to stand without a following b). So each unit is (ab\|b), repeated any number of times: (ab\|b)*. Test it: "abb" = ab+b ✓; "b" = b ✓; the empty string ✓; "ba" should fail — and indeed there is no way to produce a trailing lone a, so it correctly does not match. The deliberate failing test is what gives confidence the regex is not merely accepting the right strings but also rejecting the wrong ones.
Because a regex compiles to an FSM, deciding whether a string matches is exactly running that FSM and checking the final state. This is worth seeing once, as it ties the two halves of the topic together. Take the regex (0\|1)*01 (binary strings ending in "01") and its DFA from earlier (states S0, S1, S2; accepting S2). Matching "1101":
| Step | State | Symbol | Next state |
|---|---|---|---|
| 1 | S0 | 1 | S0 |
| 2 | S0 | 1 | S0 |
| 3 | S0 | 0 | S1 |
| 4 | S1 | 1 | S2 |
Final state S2 is accepting, so "1101" matches (0\|1)*01 — and indeed it ends in "01". Now "1100":
Subscribe to continue reading
Get full access to this lesson and all 10 lessons in this course.