String Handling and Regular Expressions

This lesson covers string handling — the manipulation, traversal and processing of text — and regular expressions, the compact pattern-matching language used to search, validate and transform text. Text is the most common data type a program handles: usernames, postcodes, file contents, network messages and user input are all strings, so fluent string manipulation is a core A-Level skill. Underpinning all of it is the idea that a character is fundamentally a number (its character code), which is what makes both encryption and validation possible.

Spec Mapping

This lesson addresses string-handling operations within the Fundamentals of programming section of the AQA A-Level Computer Science (7517) specification (subject content area 4.1.1), together with the representation of characters using ASCII and Unicode (links to data representation in 4.5.3). It covers: string length, position/indexing, slicing (substring), concatenation, case conversion and searching; converting between character and character code (ord/chr); the structure of ASCII and the extension to Unicode; and practical pattern matching with regular expressions for validation and extraction. You are expected to write code that manipulates strings character by character, use string operations correctly, and interpret or construct simple regular expressions.

What is a string?

A string is an ordered sequence of characters. In most modern languages — including Python, Java and C# — strings are immutable: once created they cannot be altered in place. Any operation that appears to change a string in fact constructs a new string and leaves the original untouched.

greeting = "Hello"
greeting.upper()        # returns "HELLO" but does NOT change greeting
print(greeting)         # still "Hello"
greeting = greeting.upper()   # to keep the result, reassign it
print(greeting)         # now "HELLO"

Immutability matters for two reasons. First, it explains why greeting.upper() on its own seems to "do nothing" — a very common beginner bug. Second, it means that building a long string by repeated concatenation in a loop can be inefficient, because each += creates a brand-new string; for heavy work, building a list and using join at the end is preferred.

Core string operations

Operation	Purpose	Python	Result
Length	Count characters	`len("Hello")`	`5`
Indexing	One character by position (0-based)	`"Hello"[1]`	`"e"`
Slicing	Extract a substring	`"Hello"[1:4]`	`"ell"`
Concatenation	Join strings	`"Hi" + " there"`	`"Hi there"`
Repetition	Repeat	`"Ha" * 3`	`"HaHaHa"`
Membership	Is a substring present?	`"ell" in "Hello"`	`True`
Case	Change case	`"Hello".upper()`	`"HELLO"`
Strip	Remove surrounding whitespace	`" Hi ".strip()`	`"Hi"`
Split	Break into a list	`"a,b,c".split(",")`	`["a","b","c"]`
Join	Combine a list into a string	`",".join(["a","b"])`	`"a,b"`
Replace	Substitute occurrences	`"cat".replace("c","b")`	`"bat"`
Find	Position of a substring	`"Hello".find("ll")`	`2`

Indexing and slicing in detail

Positions are zero-based, so the first character of "COMPUTER" is at index 0 and the last (the eighth character) is at index 7. A slice s[start:stop] returns characters from start up to but not including stop — the half-open interval that catches so many students out.

word = "COMPUTER"
#       01234567   index of each character
print(word[0])      # "C"        first character
print(word[7])      # "R"        last character
print(word[0:4])    # "COMP"     indices 0,1,2,3 — NOT 4
print(word[4:])     # "UTER"     from index 4 to the end
print(word[-1])     # "R"        negative index counts from the end
print(len(word))    # 8

The slice word[0:4] has length 4 - 0 = 4; this "stop minus start equals length" rule is the quickest way to predict a slice's result in an exam.

Traversing a string

Processing text character by character is the workhorse of string handling — counting vowels, reversing, encrypting and validating all rest on it.

AQA-style pseudocode

# AQA-style pseudocode: count the vowels in a word
text = "COMPUTER"
count = 0
FOR i = 0 TO LENGTH(text) - 1
    IF text[i] IN ['A','E','I','O','U'] THEN
        count = count + 1
    ENDIF
NEXT i
OUTPUT count

Python

text = "COMPUTER"

# Direct iteration over characters (preferred)
vowels = 0
for char in text:
    if char in "AEIOU":
        vowels += 1
print(vowels)          # 3

# Indexed iteration when the position is needed
for i in range(len(text)):
    print(f"Index {i}: {text[i]}")

Characters are numbers: ASCII and Unicode

Every character is stored as a number — its character code. The original scheme, ASCII (American Standard Code for Information Interchange), uses 7 bits to encode 128 characters: the control codes, digits, upper- and lower-case Latin letters and common punctuation. ord returns a character's code; chr converts a code back to its character.

ord("A")     # 65
chr(65)      # "A"
ord("a")     # 97
ord("0")     # 48

Character	Code	Character	Code
`' '` (space)	32	`'A'`	65
`'!'`	33	`'Z'`	90
`'0'`	48	`'a'`	97
`'9'`	57	`'z'`	122

Three facts about this table are worth memorising because they enable character arithmetic:

The digits '0'–'9' occupy a contiguous block 48–57, so ord(d) - 48 converts a digit character to its numeric value.
Upper-case letters occupy 65–90 and lower-case 97–122, each contiguous and in alphabetical order, so ord('C') - ord('A') gives a letter's position in the alphabet (here, 2).
The same letter differs by exactly 32 between cases — ord('a') - ord('A') == 32 — which is why a single bit (value 32) toggles case.

ASCII is too small for the world's writing systems and symbols, so modern systems use Unicode, which assigns a unique code point to over a million characters (Latin, Greek, Cyrillic, CJK, emoji and more). The first 128 Unicode code points are identical to ASCII, so ASCII text is valid Unicode. Unicode is most commonly stored using the UTF-8 encoding, in which a code point occupies between one and four bytes — ASCII characters still take a single byte, which is why UTF-8 is backward-compatible and efficient for English text.

Exam Tip: Learn the ASCII ranges — digits 48–57, upper-case 65–90, lower-case 97–122, and the 32 gap between cases. These appear directly in character-manipulation and Caesar-cipher questions, and a single correct ord/chr calculation often secures the mark.

Worked example: the Caesar cipher

The Caesar cipher shifts each letter a fixed number of places along the alphabet, wrapping round from Z to A. It is a perfect demonstration of character arithmetic.

def caesar_encrypt(text: str, shift: int) -> str:
    result = ""
    for char in text:
        if char.isalpha():
            base = ord('A') if char.isupper() else ord('a')
            # 1. ord(char) - base  -> position 0..25 in the alphabet
            # 2. + shift           -> move along
            # 3. % 26              -> wrap round (Z -> A)
            # 4. + base            -> back to a character code
            shifted = (ord(char) - base + shift) % 26 + base
            result += chr(shifted)
        else:
            result += char          # leave spaces/punctuation unchanged
    return result

print(caesar_encrypt("Hello, World", 3))   # "Khoor, Zruog"

Tracing one character

Take 'H' with shift = 3:

Step	Expression	Value
Base for upper-case	`ord('A')`	65
Position in alphabet	`ord('H') - 65 = 72 - 65`	7
Apply shift	`7 + 3`	10
Wrap with modulo	`10 % 26`	10
Back to a code	`10 + 65`	75
Resulting character	`chr(75)`	`'K'`

The % 26 is the crucial step: without it, shifting 'Y' (position 24) by 3 would give position 27, which is off the end of the alphabet; 27 % 26 = 1 correctly wraps to 'B'.

Validation without regular expressions

Many checks can be written using only the core string operations, which makes the logic explicit and easy to trace.

def is_valid_email(email: str) -> bool:
    if email.count("@") != 1:        # exactly one @
        return False
    local, domain = email.split("@")
    if len(local) == 0 or len(domain) == 0:
        return False                 # something each side of the @
    if "." not in domain:
        return False                 # domain must contain a dot
    if domain.startswith(".") or domain.endswith("."):
        return False                 # dot not at the very edge
    return True

This is readable and easy to test, but notice how it grows: each new rule adds another if. When the rules become numerous, a regular expression expresses the whole pattern in a single line.

Regular expressions

A regular expression (regex) is a string that defines a search pattern. It is a small declarative language — you describe the shape of the text you want, and the regex engine finds it. Regexes drive validation, search-and-replace, tokenising and data extraction.

Core syntax

Symbol	Meaning	Example	Matches
`.`	Any single character	`c.t`	"cat", "cot", "cut"
`*`	Zero or more of the preceding	`ab*c`	"ac", "abc", "abbc"
`+`	One or more of the preceding	`ab+c`	"abc", "abbc" (not "ac")
`?`	Zero or one of the preceding	`colou?r`	"color", "colour"
`^`	Start of string (anchor)	`^Hello`	"Hello world" (not "Say Hello")
`$`	End of string (anchor)	`end$`	"the end" (not "ending")
`[abc]`	Any one character in the set	`[aeiou]`	any vowel
`[a-z]`	Any one character in the range	`[A-Z]`	any upper-case letter
`[^abc]`	Any character not in the set	`[^0-9]`	any non-digit
`\d`	Any digit (0–9)	`\d{3}`	"123"
`\w`	Any word character (letter, digit, _)	`\w+`	"test_1"
`\s`	Any whitespace character	`\s+`	a run of spaces/tabs
`{n}`	Exactly n of the preceding	`a{3}`	"aaa"
`{n,m}`	Between n and m of the preceding	`a{2,4}`	"aa", "aaa", "aaaa"
`()`	Group sub-patterns	`(ab)+`	"ab", "abab"
`\|`	Alternation (OR)	`cat\|dog`	"cat" or "dog"

Anchoring is the key idea for validation

To validate a whole string (rather than merely find a pattern inside it), anchor the pattern with ^ at the start and $ at the end so the entire input must match. Without anchors, \d{4} matches any string that contains four digits anywhere, including "abc1234xyz"; with anchors, ^\d{4}$ matches only a string that is exactly four digits.

Using regular expressions in Python

import re

text = "Contact: 07123 456789 or 07999 111222"

# findall: return every match
numbers = re.findall(r"\d{5}\s\d{6}", text)
print(numbers)            # ['07123 456789', '07999 111222']

# search: find the first match anywhere
m = re.search(r"\d{5}", text)
if m:
    print(m.group())      # '07123'

# sub: search-and-replace (collapse runs of whitespace to one space)
print(re.sub(r"\s+", " ", "Hello     World"))   # 'Hello World'

# match + anchors: validate a simple UK postcode
def is_valid_postcode(postcode: str) -> bool:
    pattern = r"^[A-Z]{1,2}\d[A-Z\d]?\s?\d[A-Z]{2}$"
    return bool(re.match(pattern, postcode, re.IGNORECASE))

print(is_valid_postcode("SW1A 1AA"))   # True
print(is_valid_postcode("12345"))      # False

Reading a pattern: the postcode example

The pattern ^[A-Z]{1,2}\d[A-Z\d]?\s?\d[A-Z]{2}$ decomposes as:

Fragment	Meaning
`^`	start of string
`[A-Z]{1,2}`	one or two letters (the area, e.g. "SW")
`\d`	one digit (the district, e.g. "1")
`[A-Z\d]?`	an optional letter or digit (e.g. "A")
`\s?`	an optional space
`\d[A-Z]{2}`	a digit then two letters (the inward code, e.g. "1AA")
`$`	end of string

Being able to narrate a pattern fragment by fragment like this is exactly what "explain what this regular expression matches" questions reward.

A library of common patterns

Pattern	Purpose	Example match
`^[A-Za-z]+$`	letters only	"Hello"
`^\d+$`	digits only	"12345"
`^\d{2}/\d{2}/\d{4}$`	date DD/MM/YYYY	"25/12/2026"
`^[\w.%+-]+@[\w.-]+\.[A-Za-z]{2,}$`	email address	"user@example.com"

Exam Tip: You may be asked to write a simple validation regex or interpret one. Focus on the everyday symbols — ., *, +, ?, character classes [...], \d/\w/\s, and the anchors ^ and $. Always anchor a validation pattern at both ends, and remember to escape a literal dot as \. since a bare . matches any character.

Worked example: tokenising and analysing a sentence

A common practical task combines several string operations: take a sentence, split it into words, and report statistics. The following function illustrates traversal, splitting, case-handling and accumulation working together.

def analyse(sentence: str) -> dict:
    # Normalise: lower-case and strip surrounding whitespace
    cleaned = sentence.strip().lower()
    # Split on spaces to get a list of words (tokens)
    words = cleaned.split(" ")
    # Build statistics
    longest = ""
    total_letters = 0
    for word in words:
        # Remove trailing punctuation a word at a time
        word = word.strip(".,!?;:")
        if len(word) > len(longest):
            longest = word
        total_letters += len(word)
    return {
        "word_count": len(words),
        "longest": longest,
        "average_length": total_letters / len(words) if words else 0,
    }

print(analyse("The quick brown fox jumps!"))
# {'word_count': 5, 'longest': 'quick', 'average_length': 4.4}

Tracing the longest-word logic

For the input above, after splitting and stripping punctuation the words are the, quick, brown, fox, jumps. The variable longest updates only when a strictly longer word is found:

Word examined	`len(word)`	`len(longest)` before	`longest` after
the	3	0	"the"
quick	5	3	"quick"
brown	5	5	"quick" (no change)
fox	3	5	"quick"
jumps	5	5	"quick"

The key subtlety is the strict > comparison: because "brown" and "jumps" are the same length as the current longest, they do not replace it, so the function returns the first longest word encountered. Changing > to >= would return the last — a one-character edit with a real effect on behaviour, and exactly the kind of distinction a "what does this output?" question probes.

String formatting and building output

Programs constantly assemble strings for output by inserting values into a template. Modern Python offers f-strings, which embed expressions directly inside a string literal prefixed with f.

name = "Alice"
score = 87.456

String Handling and Regular Expressions

String Handling and Regular Expressions

Spec Mapping

What is a string?

Core string operations

Indexing and slicing in detail

Traversing a string

AQA-style pseudocode

Python

Characters are numbers: ASCII and Unicode

Worked example: the Caesar cipher

Tracing one character

Validation without regular expressions

Regular expressions

Core syntax

Anchoring is the key idea for validation

Using regular expressions in Python

Reading a pattern: the postcode example

A library of common patterns

Worked example: tokenising and analysing a sentence

Tracing the longest-word logic

String formatting and building output

More in Computer Science