You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
Computers store only numbers, so to handle text every character — letter, digit, punctuation mark or emoji — must be mapped to a number. A character set defines that mapping. This lesson explains why such mappings exist, works through 7-bit ASCII (and the deliberate design that makes 'A' = 65 and '0' = 48 useful, not arbitrary), explains extended ASCII, and then develops Unicode with its UTF-8 and UTF-16 encodings to A-Level depth. Crucially, it shows how the ordering of character codes makes alphabetical sorting (collation) and character arithmetic possible — a frequently examined synoptic point.
This lesson covers the character-encoding strand of the AQA A-Level Computer Science (7517) Fundamentals of data representation area:
This material connects directly to string handling and data types in programming, and to the error-detection content elsewhere in data representation.
A character set (charset) is an agreed mapping between the characters used in written communication and the numeric codes that represent them inside a computer. Without a shared standard, a file written on one machine would be meaningless on another: machine A might store 'H' as 72, machine B as 200, and the text would scramble. The whole purpose of a character set, then, is interoperability — agreeing the numbers in advance so that any machine can write text another machine can read back faithfully. This is the same standardisation imperative seen with units of information: representation only works when both ends agree the convention. The historical drift away from that agreement — many incompatible 8-bit code pages — is precisely the problem that drove the creation of a single universal standard, and understanding why a shared mapping is essential is the foundation for appreciating what ASCII achieved and why Unicode had to replace the patchwork that followed it.
It is worth separating two ideas that students often conflate:
ASCII (American Standard Code for Information Interchange) uses 7 bits, giving 27=128 codes (0–127). These are allocated to a deliberate, structured layout rather than at random:
| Code range (denary) | Contents |
|---|---|
| 0–31 | Control characters (non-printing: e.g. carriage return, line feed, tab, null) |
| 32 | Space |
| 33–47, 58–64, 91–96, 123–126 | Punctuation and symbols |
| 48–57 | Digits '0'–'9' (contiguous) |
| 65–90 | Uppercase 'A'–'Z' (contiguous) |
| 97–122 | Lowercase 'a'–'z' (contiguous) |
The exam-critical point is not the specific numbers but their structure:
'7' to the integer 7 you compute 55−48=7.Convert the character '5' to its numeric value:
ord(’5’)=53,53−48=5
Convert uppercase 'G' to lowercase:
ord(’G’)=71,71+32=103=ord(’g’)
Determine which of 'apple' and 'apricot' sorts first: compare character by character. 'a'='a' (equal), 'p'='p' (equal), then 'p' (112) vs 'r' (114) — 112 < 114, so 'apple' sorts before 'apricot'.
def char_to_digit(c: str) -> int:
return ord(c) - ord('0') # '7' -> 55 - 48 = 7
def to_lower(c: str) -> str:
if 'A' <= c <= 'Z':
return chr(ord(c) + 32) # add the fixed case offset
return c
print(char_to_digit('7')) # 7
print(to_lower('G')) # g
Exam Tip: You are not expected to memorise the ASCII table, but you are expected to know that letters and digits are contiguous, that digits start at 48, that uppercase and lowercase differ by 32, and to use these facts to do character arithmetic and explain collation. State the relationship, not just a number.
Although ASCII is a 7-bit code, characters are almost always stored in a whole byte (8 bits), with the spare top bit set to 0 (or historically used as a parity bit for error detection — a synoptic link to error checking).
Extended ASCII uses the 8th bit to provide a further 128 codes (128–255), doubling the total to 256. The lower 128 (0–127) remain identical to standard ASCII, preserving compatibility; the upper 128 were used for accented letters (é, ñ, ü), box-drawing characters and additional symbols.
The problem: there was no single agreement on what the upper 128 codes meant. Many different "code pages" (e.g. for Western European, Cyrillic or Greek text) reused the same 128–255 range for different characters. A document created with one code page displayed as gibberish (mojibake) under another, and 256 codes are nowhere near enough for the world's writing systems — Chinese alone has tens of thousands of characters. This fragmentation is precisely the problem Unicode was created to solve.
Unicode is a single, universal character set designed to assign a unique number — a code point — to every character in every writing system, plus symbols and emoji. A code point is written in the form U+ followed by a hexadecimal number, e.g. U+0041 for 'A', U+00E9 for 'é', U+20AC for the euro sign '€', U+1F600 for a smiley.
Crucially, Unicode is backwards-compatible with ASCII: the first 128 code points (U+0000 to U+007F) are exactly the ASCII characters with the same numeric values. So 'A' is still 65 (U+0041).
Unicode currently defines well over a hundred thousand code points, far more than fit in one or even two bytes. So Unicode separates the code point (the abstract number for a character) from the encoding scheme (how that number is represented as actual bytes). The two main A-Level schemes are UTF-8 and UTF-16.
UTF-8 is a variable-length encoding using 1 to 4 bytes per character. Its design is elegant:
The benefit is compactness for ASCII-dominated text (English, source code, much of the web) while still being able to represent every Unicode character. The cost is that characters no longer have a fixed width, so indexing the nth character is not a simple byte offset. UTF-8 is by far the dominant encoding on the web.
UTF-16 uses 2 bytes (16 bits) for the most common characters (the "Basic Multilingual Plane", roughly U+0000 to U+FFFF) and 4 bytes for the rarer ones (via "surrogate pairs"). It is therefore also variable-length, but with a 2-byte minimum.
The trade-off versus UTF-8:
| Aspect | UTF-8 | UTF-16 |
|---|---|---|
| Minimum bytes/char | 1 | 2 |
| ASCII text size | Compact (1 byte/char) | Larger (2 bytes/char) |
| Asian-script text | Often 3 bytes/char | Often 2 bytes/char |
| ASCII-compatible | Yes (byte-for-byte) | No |
| Typical use | Web, files, Linux | Windows internals, Java/.NET strings |
For predominantly English or markup text UTF-8 is smaller; for text dominated by characters that sit in the 2-byte BMP range (such as many East-Asian scripts) UTF-16 can be more compact. Neither is universally "better" — it depends on the content.
A simple comparison frequently asked in exams: a fixed 7-bit code can distinguish only 27=128 characters; even 8-bit extended ASCII reaches only 256. Representing every script on Earth requires far more — Unicode's code space allows over a million code points (221 is more than enough), which is why a variable-length byte encoding is necessary rather than a single fixed-width byte.
graph TD
A["Text character"] --> B["Character set: assigns a code point e.g. A is U+0041"]
B --> C["Encoding scheme: lays the code point out as bytes"]
C --> D["UTF-8: 1 to 4 bytes, ASCII-compatible"]
C --> E["UTF-16: 2 or 4 bytes"]
UTF-8's cleverness is that the leading bits of each byte announce how many bytes the character uses, so a decoder never gets lost even mid-stream. At A-Level you are not expected to encode by hand, but understanding the structure deepens the explanation marks:
| Code point range | Bytes | Byte pattern (x = code-point bits) |
|---|---|---|
U+0000–U+007F | 1 | 0xxxxxxx |
U+0080–U+07FF | 2 | 110xxxxx 10xxxxxx |
U+0800–U+FFFF | 3 | 1110xxxx 10xxxxxx 10xxxxxx |
U+10000–U+10FFFF | 4 | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
Two design consequences worth stating:
10, and no leading byte does, so the decoder can resynchronise after a corrupted byte rather than mangling the whole rest of the file. This self-synchronising property is a genuine engineering advantage over a naïve fixed-width multi-byte scheme.How many bytes does the string "café" occupy in (a) UTF-8 and (b) UTF-16? The characters are c (U+0063), a (U+0061), f (U+0066) and é (U+00E9).
c, a, f are all in the ASCII range so take 1 byte each (3 bytes); é is U+00E9, which falls in the 2-byte range U+0080–U+07FF, so takes 2 bytes. Total =3+2=5 bytes for 4 characters.Notice two crucial teaching points: (i) the character count (4) is not the byte count in either scheme — fatal if you assume one byte per character; and (ii) for this mostly-ASCII string UTF-8 (5 bytes) is more compact than UTF-16 (8 bytes), illustrating exactly why the web favours UTF-8. Had the string been four Chinese characters instead, UTF-8 would use 4×3=12 bytes while UTF-16 used only 4×2=8 — and the verdict would flip.
s = "café"
print(len(s)) # 4 -> number of characters (code points)
print(len(s.encode("utf-8"))) # 5 -> bytes in UTF-8
print(len(s.encode("utf-16-le"))) # 8 -> bytes in UTF-16 (no BOM)
Exam Tip: When a question asks for the size of a string, decide first whether it wants characters or bytes, and in which encoding. State the per-character byte cost (ASCII = 1 byte in UTF-8; BMP = 2 bytes in UTF-16) and multiply — never assume one byte equals one character outside pure ASCII.
It is reasonable to ask: why not give every character a fixed 4 bytes (the so-called UTF-32 scheme), so that the nth character always sits at byte offset 4n and indexing is trivial? The answer is storage cost. For predominantly English or source-code text — the overwhelming majority of stored text — UTF-32 would quadruple the file size versus UTF-8's one byte per ASCII character, for no benefit to the content. Variable-width encodings trade a little decoding complexity for a large, consistent saving on real-world text. This is the same range-versus-cost reasoning that runs through the whole data-representation topic: you choose the representation that best fits the expected data, not the one that is simplest in the abstract.
This also clarifies a subtle point students often miss: an encoding's "minimum bytes per character" is not its "always bytes per character". UTF-8's minimum is 1, but a single emoji can occupy 4 bytes; UTF-16's minimum is 2, but the same emoji also occupies 4 (via a surrogate pair). The encoding adapts to the code point, which is precisely what lets one scheme cover every character of every script without wasting space on the common case.
The single most useful consequence of a well-ordered character set is that comparing and sorting text reduces to comparing numbers. Because ASCII (and the ASCII-compatible start of Unicode) lays letters out contiguously and in order:
Subscribe to continue reading
Get full access to this lesson and all 10 lessons in this course.