You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
This lesson covers string handling — the manipulation, traversal and processing of text — and regular expressions, the compact pattern-matching language used to search, validate and transform text. Text is the most common data type a program handles: usernames, postcodes, file contents, network messages and user input are all strings, so fluent string manipulation is a core A-Level skill. Underpinning all of it is the idea that a character is fundamentally a number (its character code), which is what makes both encryption and validation possible.
This lesson addresses string-handling operations within the Fundamentals of programming section of the AQA A-Level Computer Science (7517) specification (subject content area 4.1.1), together with the representation of characters using ASCII and Unicode (links to data representation in 4.5.3). It covers: string length, position/indexing, slicing (substring), concatenation, case conversion and searching; converting between character and character code (ord/chr); the structure of ASCII and the extension to Unicode; and practical pattern matching with regular expressions for validation and extraction. You are expected to write code that manipulates strings character by character, use string operations correctly, and interpret or construct simple regular expressions.
A string is an ordered sequence of characters. In most modern languages — including Python, Java and C# — strings are immutable: once created they cannot be altered in place. Any operation that appears to change a string in fact constructs a new string and leaves the original untouched.
greeting = "Hello"
greeting.upper() # returns "HELLO" but does NOT change greeting
print(greeting) # still "Hello"
greeting = greeting.upper() # to keep the result, reassign it
print(greeting) # now "HELLO"
Immutability matters for two reasons. First, it explains why greeting.upper() on its own seems to "do nothing" — a very common beginner bug. Second, it means that building a long string by repeated concatenation in a loop can be inefficient, because each += creates a brand-new string; for heavy work, building a list and using join at the end is preferred.
| Operation | Purpose | Python | Result |
|---|---|---|---|
| Length | Count characters | len("Hello") | 5 |
| Indexing | One character by position (0-based) | "Hello"[1] | "e" |
| Slicing | Extract a substring | "Hello"[1:4] | "ell" |
| Concatenation | Join strings | "Hi" + " there" | "Hi there" |
| Repetition | Repeat | "Ha" * 3 | "HaHaHa" |
| Membership | Is a substring present? | "ell" in "Hello" | True |
| Case | Change case | "Hello".upper() | "HELLO" |
| Strip | Remove surrounding whitespace | " Hi ".strip() | "Hi" |
| Split | Break into a list | "a,b,c".split(",") | ["a","b","c"] |
| Join | Combine a list into a string | ",".join(["a","b"]) | "a,b" |
| Replace | Substitute occurrences | "cat".replace("c","b") | "bat" |
| Find | Position of a substring | "Hello".find("ll") | 2 |
Positions are zero-based, so the first character of "COMPUTER" is at index 0 and the last (the eighth character) is at index 7. A slice s[start:stop] returns characters from start up to but not including stop — the half-open interval that catches so many students out.
word = "COMPUTER"
# 01234567 index of each character
print(word[0]) # "C" first character
print(word[7]) # "R" last character
print(word[0:4]) # "COMP" indices 0,1,2,3 — NOT 4
print(word[4:]) # "UTER" from index 4 to the end
print(word[-1]) # "R" negative index counts from the end
print(len(word)) # 8
The slice word[0:4] has length 4 - 0 = 4; this "stop minus start equals length" rule is the quickest way to predict a slice's result in an exam.
Processing text character by character is the workhorse of string handling — counting vowels, reversing, encrypting and validating all rest on it.
# AQA-style pseudocode: count the vowels in a word
text = "COMPUTER"
count = 0
FOR i = 0 TO LENGTH(text) - 1
IF text[i] IN ['A','E','I','O','U'] THEN
count = count + 1
ENDIF
NEXT i
OUTPUT count
text = "COMPUTER"
# Direct iteration over characters (preferred)
vowels = 0
for char in text:
if char in "AEIOU":
vowels += 1
print(vowels) # 3
# Indexed iteration when the position is needed
for i in range(len(text)):
print(f"Index {i}: {text[i]}")
Every character is stored as a number — its character code. The original scheme, ASCII (American Standard Code for Information Interchange), uses 7 bits to encode 128 characters: the control codes, digits, upper- and lower-case Latin letters and common punctuation. ord returns a character's code; chr converts a code back to its character.
ord("A") # 65
chr(65) # "A"
ord("a") # 97
ord("0") # 48
| Character | Code | Character | Code |
|---|---|---|---|
' ' (space) | 32 | 'A' | 65 |
'!' | 33 | 'Z' | 90 |
'0' | 48 | 'a' | 97 |
'9' | 57 | 'z' | 122 |
Three facts about this table are worth memorising because they enable character arithmetic:
'0'–'9' occupy a contiguous block 48–57, so ord(d) - 48 converts a digit character to its numeric value.ord('C') - ord('A') gives a letter's position in the alphabet (here, 2).ord('a') - ord('A') == 32 — which is why a single bit (value 32) toggles case.ASCII is too small for the world's writing systems and symbols, so modern systems use Unicode, which assigns a unique code point to over a million characters (Latin, Greek, Cyrillic, CJK, emoji and more). The first 128 Unicode code points are identical to ASCII, so ASCII text is valid Unicode. Unicode is most commonly stored using the UTF-8 encoding, in which a code point occupies between one and four bytes — ASCII characters still take a single byte, which is why UTF-8 is backward-compatible and efficient for English text.
Exam Tip: Learn the ASCII ranges — digits 48–57, upper-case 65–90, lower-case 97–122, and the 32 gap between cases. These appear directly in character-manipulation and Caesar-cipher questions, and a single correct
ord/chrcalculation often secures the mark.
The Caesar cipher shifts each letter a fixed number of places along the alphabet, wrapping round from Z to A. It is a perfect demonstration of character arithmetic.
def caesar_encrypt(text: str, shift: int) -> str:
result = ""
for char in text:
if char.isalpha():
base = ord('A') if char.isupper() else ord('a')
# 1. ord(char) - base -> position 0..25 in the alphabet
# 2. + shift -> move along
# 3. % 26 -> wrap round (Z -> A)
# 4. + base -> back to a character code
shifted = (ord(char) - base + shift) % 26 + base
result += chr(shifted)
else:
result += char # leave spaces/punctuation unchanged
return result
print(caesar_encrypt("Hello, World", 3)) # "Khoor, Zruog"
Take 'H' with shift = 3:
| Step | Expression | Value |
|---|---|---|
| Base for upper-case | ord('A') | 65 |
| Position in alphabet | ord('H') - 65 = 72 - 65 | 7 |
| Apply shift | 7 + 3 | 10 |
| Wrap with modulo | 10 % 26 | 10 |
| Back to a code | 10 + 65 | 75 |
| Resulting character | chr(75) | 'K' |
The % 26 is the crucial step: without it, shifting 'Y' (position 24) by 3 would give position 27, which is off the end of the alphabet; 27 % 26 = 1 correctly wraps to 'B'.
Many checks can be written using only the core string operations, which makes the logic explicit and easy to trace.
def is_valid_email(email: str) -> bool:
if email.count("@") != 1: # exactly one @
return False
local, domain = email.split("@")
if len(local) == 0 or len(domain) == 0:
return False # something each side of the @
if "." not in domain:
return False # domain must contain a dot
if domain.startswith(".") or domain.endswith("."):
return False # dot not at the very edge
return True
This is readable and easy to test, but notice how it grows: each new rule adds another if. When the rules become numerous, a regular expression expresses the whole pattern in a single line.
A regular expression (regex) is a string that defines a search pattern. It is a small declarative language — you describe the shape of the text you want, and the regex engine finds it. Regexes drive validation, search-and-replace, tokenising and data extraction.
| Symbol | Meaning | Example | Matches |
|---|---|---|---|
. | Any single character | c.t | "cat", "cot", "cut" |
* | Zero or more of the preceding | ab*c | "ac", "abc", "abbc" |
+ | One or more of the preceding | ab+c | "abc", "abbc" (not "ac") |
? | Zero or one of the preceding | colou?r | "color", "colour" |
^ | Start of string (anchor) | ^Hello | "Hello world" (not "Say Hello") |
$ | End of string (anchor) | end$ | "the end" (not "ending") |
[abc] | Any one character in the set | [aeiou] | any vowel |
[a-z] | Any one character in the range | [A-Z] | any upper-case letter |
[^abc] | Any character not in the set | [^0-9] | any non-digit |
\d | Any digit (0–9) | \d{3} | "123" |
\w | Any word character (letter, digit, _) | \w+ | "test_1" |
\s | Any whitespace character | \s+ | a run of spaces/tabs |
{n} | Exactly n of the preceding | a{3} | "aaa" |
{n,m} | Between n and m of the preceding | a{2,4} | "aa", "aaa", "aaaa" |
() | Group sub-patterns | (ab)+ | "ab", "abab" |
| | Alternation (OR) | cat|dog | "cat" or "dog" |
To validate a whole string (rather than merely find a pattern inside it), anchor the pattern with ^ at the start and $ at the end so the entire input must match. Without anchors, \d{4} matches any string that contains four digits anywhere, including "abc1234xyz"; with anchors, ^\d{4}$ matches only a string that is exactly four digits.
import re
text = "Contact: 07123 456789 or 07999 111222"
# findall: return every match
numbers = re.findall(r"\d{5}\s\d{6}", text)
print(numbers) # ['07123 456789', '07999 111222']
# search: find the first match anywhere
m = re.search(r"\d{5}", text)
if m:
print(m.group()) # '07123'
# sub: search-and-replace (collapse runs of whitespace to one space)
print(re.sub(r"\s+", " ", "Hello World")) # 'Hello World'
# match + anchors: validate a simple UK postcode
def is_valid_postcode(postcode: str) -> bool:
pattern = r"^[A-Z]{1,2}\d[A-Z\d]?\s?\d[A-Z]{2}$"
return bool(re.match(pattern, postcode, re.IGNORECASE))
print(is_valid_postcode("SW1A 1AA")) # True
print(is_valid_postcode("12345")) # False
The pattern ^[A-Z]{1,2}\d[A-Z\d]?\s?\d[A-Z]{2}$ decomposes as:
| Fragment | Meaning |
|---|---|
^ | start of string |
[A-Z]{1,2} | one or two letters (the area, e.g. "SW") |
\d | one digit (the district, e.g. "1") |
[A-Z\d]? | an optional letter or digit (e.g. "A") |
\s? | an optional space |
\d[A-Z]{2} | a digit then two letters (the inward code, e.g. "1AA") |
$ | end of string |
Being able to narrate a pattern fragment by fragment like this is exactly what "explain what this regular expression matches" questions reward.
| Pattern | Purpose | Example match |
|---|---|---|
^[A-Za-z]+$ | letters only | "Hello" |
^\d+$ | digits only | "12345" |
^\d{2}/\d{2}/\d{4}$ | date DD/MM/YYYY | "25/12/2026" |
^[\w.%+-]+@[\w.-]+\.[A-Za-z]{2,}$ | email address | "user@example.com" |
Exam Tip: You may be asked to write a simple validation regex or interpret one. Focus on the everyday symbols —
.,*,+,?, character classes[...],\d/\w/\s, and the anchors^and$. Always anchor a validation pattern at both ends, and remember to escape a literal dot as\.since a bare.matches any character.
A common practical task combines several string operations: take a sentence, split it into words, and report statistics. The following function illustrates traversal, splitting, case-handling and accumulation working together.
def analyse(sentence: str) -> dict:
# Normalise: lower-case and strip surrounding whitespace
cleaned = sentence.strip().lower()
# Split on spaces to get a list of words (tokens)
words = cleaned.split(" ")
# Build statistics
longest = ""
total_letters = 0
for word in words:
# Remove trailing punctuation a word at a time
word = word.strip(".,!?;:")
if len(word) > len(longest):
longest = word
total_letters += len(word)
return {
"word_count": len(words),
"longest": longest,
"average_length": total_letters / len(words) if words else 0,
}
print(analyse("The quick brown fox jumps!"))
# {'word_count': 5, 'longest': 'quick', 'average_length': 4.4}
For the input above, after splitting and stripping punctuation the words are the, quick, brown, fox, jumps. The variable longest updates only when a strictly longer word is found:
| Word examined | len(word) | len(longest) before | longest after |
|---|---|---|---|
| the | 3 | 0 | "the" |
| quick | 5 | 3 | "quick" |
| brown | 5 | 5 | "quick" (no change) |
| fox | 3 | 5 | "quick" |
| jumps | 5 | 5 | "quick" |
The key subtlety is the strict > comparison: because "brown" and "jumps" are the same length as the current longest, they do not replace it, so the function returns the first longest word encountered. Changing > to >= would return the last — a one-character edit with a real effect on behaviour, and exactly the kind of distinction a "what does this output?" question probes.
Programs constantly assemble strings for output by inserting values into a template. Modern Python offers f-strings, which embed expressions directly inside a string literal prefixed with f.
name = "Alice"
score = 87.456
Subscribe to continue reading
Get full access to this lesson and all 10 lessons in this course.