Character Encoding: ASCII and Unicode

Computers store only numbers, so to handle text every character — letter, digit, punctuation mark or emoji — must be mapped to a number. A character set defines that mapping. This lesson explains why such mappings exist, works through 7-bit ASCII (and the deliberate design that makes 'A' = 65 and '0' = 48 useful, not arbitrary), explains extended ASCII, and then develops Unicode with its UTF-8 and UTF-16 encodings to A-Level depth. Crucially, it shows how the ordering of character codes makes alphabetical sorting (collation) and character arithmetic possible — a frequently examined synoptic point.

Spec Mapping

This lesson covers the character-encoding strand of the AQA A-Level Computer Science (7517) Fundamentals of data representation area:

Character sets — the concept of mapping characters to numeric codes, and the distinction between a character set and a character-encoding scheme.
ASCII — the 7-bit code, the structure of its ranges, and why the contiguous layout of letters and digits matters.
Extended ASCII — using the 8th bit for a further 128 codes, and the resulting incompatibilities.
Unicode — the universal character set, code points, and the UTF-8 and UTF-16 encoding schemes.

This material connects directly to string handling and data types in programming, and to the error-detection content elsewhere in data representation.

Character Sets

A character set (charset) is an agreed mapping between the characters used in written communication and the numeric codes that represent them inside a computer. Without a shared standard, a file written on one machine would be meaningless on another: machine A might store 'H' as 72, machine B as 200, and the text would scramble. The whole purpose of a character set, then, is interoperability — agreeing the numbers in advance so that any machine can write text another machine can read back faithfully. This is the same standardisation imperative seen with units of information: representation only works when both ends agree the convention. The historical drift away from that agreement — many incompatible 8-bit code pages — is precisely the problem that drove the creation of a single universal standard, and understanding why a shared mapping is essential is the foundation for appreciating what ASCII achieved and why Unicode had to replace the patchwork that followed it.

It is worth separating two ideas that students often conflate:

A character set assigns each character an abstract number (its code point in Unicode terms) — e.g. 'A' ↔ 65.
A character-encoding scheme decides how that number is actually laid out as bytes — e.g. UTF-8 versus UTF-16. For ASCII the two coincide (one character = one byte-ish code), but for Unicode they are genuinely different layers.

ASCII (7-bit)

ASCII (American Standard Code for Information Interchange) uses 7 bits, giving $2^7 = 128$ codes (0–127). These are allocated to a deliberate, structured layout rather than at random:

Code range (denary)	Contents
0–31	Control characters (non-printing: e.g. carriage return, line feed, tab, null)
32	Space
33–47, 58–64, 91–96, 123–126	Punctuation and symbols
48–57	Digits '0'–'9' (contiguous)
65–90	Uppercase 'A'–'Z' (contiguous)
97–122	Lowercase 'a'–'z' (contiguous)

Why 'A' = 65 and '0' = 48 matter

The exam-critical point is not the specific numbers but their structure:

The letters are contiguous and in alphabetical order. Because 'A' = 65, 'B' = 66, …, 'Z' = 90 are consecutive, the numeric order of the codes is exactly the alphabetical order of the letters. This is what makes string sorting (collation) reduce to comparing numbers.
The digits are contiguous starting at 48. Because '0' = 48, '1' = 49, …, '9' = 57, the numeric value of a digit character is simply $\text{code} - 48$ . So to convert the character '7' to the integer 7 you compute $55 - 48 = 7$ .
Case differs by a fixed offset of 32. 'a' (97) − 'A' (65) = 32 for every letter. Converting case is therefore adding or subtracting 32 — equivalently, toggling bit 5 with a mask (linking to the previous lesson).

Worked character arithmetic

Convert the character '5' to its numeric value:

$\text{ord}(\texttt{'5'}) = 53, \qquad 53 - 48 = \mathbf{5}$

Convert uppercase 'G' to lowercase:

$\text{ord}(\texttt{'G'}) = 71, \qquad 71 + 32 = 103 = \text{ord}(\texttt{'g'})$

Determine which of 'apple' and 'apricot' sorts first: compare character by character. 'a'='a' (equal), 'p'='p' (equal), then 'p' (112) vs 'r' (114) — 112 < 114, so 'apple' sorts before 'apricot'.

def char_to_digit(c: str) -> int:
    return ord(c) - ord('0')      # '7' -> 55 - 48 = 7

def to_lower(c: str) -> str:
    if 'A' <= c <= 'Z':
        return chr(ord(c) + 32)   # add the fixed case offset
    return c

print(char_to_digit('7'))   # 7
print(to_lower('G'))        # g

Exam Tip: You are not expected to memorise the ASCII table, but you are expected to know that letters and digits are contiguous, that digits start at 48, that uppercase and lowercase differ by 32, and to use these facts to do character arithmetic and explain collation. State the relationship, not just a number.

A note on storage

Although ASCII is a 7-bit code, characters are almost always stored in a whole byte (8 bits), with the spare top bit set to 0 (or historically used as a parity bit for error detection — a synoptic link to error checking).

Extended ASCII (8-bit)

Extended ASCII uses the 8th bit to provide a further 128 codes (128–255), doubling the total to 256. The lower 128 (0–127) remain identical to standard ASCII, preserving compatibility; the upper 128 were used for accented letters (é, ñ, ü), box-drawing characters and additional symbols.

The problem: there was no single agreement on what the upper 128 codes meant. Many different "code pages" (e.g. for Western European, Cyrillic or Greek text) reused the same 128–255 range for different characters. A document created with one code page displayed as gibberish (mojibake) under another, and 256 codes are nowhere near enough for the world's writing systems — Chinese alone has tens of thousands of characters. This fragmentation is precisely the problem Unicode was created to solve.

Unicode

Unicode is a single, universal character set designed to assign a unique number — a code point — to every character in every writing system, plus symbols and emoji. A code point is written in the form U+ followed by a hexadecimal number, e.g. U+0041 for 'A', U+00E9 for 'é', U+20AC for the euro sign '€', U+1F600 for a smiley.

Crucially, Unicode is backwards-compatible with ASCII: the first 128 code points (U+0000 to U+007F) are exactly the ASCII characters with the same numeric values. So 'A' is still 65 (U+0041).

Unicode currently defines well over a hundred thousand code points, far more than fit in one or even two bytes. So Unicode separates the code point (the abstract number for a character) from the encoding scheme (how that number is represented as actual bytes). The two main A-Level schemes are UTF-8 and UTF-16.

UTF-8

UTF-8 is a variable-length encoding using 1 to 4 bytes per character. Its design is elegant:

Code points 0–127 (the ASCII range) are stored in a single byte, identical to ASCII. This means any plain ASCII file is already valid UTF-8 — a huge practical advantage.
Code points beyond 127 use 2, 3 or 4 bytes, with leading bits in each byte signalling how many bytes the character occupies.

The benefit is compactness for ASCII-dominated text (English, source code, much of the web) while still being able to represent every Unicode character. The cost is that characters no longer have a fixed width, so indexing the nth character is not a simple byte offset. UTF-8 is by far the dominant encoding on the web.

UTF-16

UTF-16 uses 2 bytes (16 bits) for the most common characters (the "Basic Multilingual Plane", roughly U+0000 to U+FFFF) and 4 bytes for the rarer ones (via "surrogate pairs"). It is therefore also variable-length, but with a 2-byte minimum.

The trade-off versus UTF-8:

Aspect	UTF-8	UTF-16
Minimum bytes/char	1	2
ASCII text size	Compact (1 byte/char)	Larger (2 bytes/char)
Asian-script text	Often 3 bytes/char	Often 2 bytes/char
ASCII-compatible	Yes (byte-for-byte)	No
Typical use	Web, files, Linux	Windows internals, Java/.NET strings

For predominantly English or markup text UTF-8 is smaller; for text dominated by characters that sit in the 2-byte BMP range (such as many East-Asian scripts) UTF-16 can be more compact. Neither is universally "better" — it depends on the content.

Why Unicode needs more bits than ASCII

A simple comparison frequently asked in exams: a fixed 7-bit code can distinguish only $2^7 = 128$ characters; even 8-bit extended ASCII reaches only 256. Representing every script on Earth requires far more — Unicode's code space allows over a million code points ( $2^{21}$ is more than enough), which is why a variable-length byte encoding is necessary rather than a single fixed-width byte.

graph TD
    A["Text character"] --> B["Character set: assigns a code point e.g. A is U+0041"]
    B --> C["Encoding scheme: lays the code point out as bytes"]
    C --> D["UTF-8: 1 to 4 bytes, ASCII-compatible"]
    C --> E["UTF-16: 2 or 4 bytes"]

How UTF-8 signals its byte count

UTF-8's cleverness is that the leading bits of each byte announce how many bytes the character uses, so a decoder never gets lost even mid-stream. At A-Level you are not expected to encode by hand, but understanding the structure deepens the explanation marks:

Code point range	Bytes	Byte pattern (x = code-point bits)
`U+0000`–`U+007F`	1	`0xxxxxxx`
`U+0080`–`U+07FF`	2	`110xxxxx 10xxxxxx`
`U+0800`–`U+FFFF`	3	`1110xxxx 10xxxxxx 10xxxxxx`
`U+10000`–`U+10FFFF`	4	`11110xxx 10xxxxxx 10xxxxxx 10xxxxxx`

Two design consequences worth stating:

A single-byte character always has a leading 0 — exactly the ASCII range — so ASCII bytes are valid UTF-8 unchanged.
Every continuation byte begins 10, and no leading byte does, so the decoder can resynchronise after a corrupted byte rather than mangling the whole rest of the file. This self-synchronising property is a genuine engineering advantage over a naïve fixed-width multi-byte scheme.

Worked example: counting bytes for a short string

How many bytes does the string "café" occupy in (a) UTF-8 and (b) UTF-16? The characters are c (U+0063), a (U+0061), f (U+0066) and é (U+00E9).

UTF-8: c, a, f are all in the ASCII range so take 1 byte each (3 bytes); é is U+00E9, which falls in the 2-byte range U+0080–U+07FF, so takes 2 bytes. Total $= 3 + 2 = \mathbf{5}$ bytes for 4 characters.
UTF-16: every one of these characters sits in the Basic Multilingual Plane, so each takes 2 bytes. Total $= 4 \times 2 = \mathbf{8}$ bytes.

Notice two crucial teaching points: (i) the character count (4) is not the byte count in either scheme — fatal if you assume one byte per character; and (ii) for this mostly-ASCII string UTF-8 (5 bytes) is more compact than UTF-16 (8 bytes), illustrating exactly why the web favours UTF-8. Had the string been four Chinese characters instead, UTF-8 would use $4 \times 3 = 12$ bytes while UTF-16 used only $4 \times 2 = 8$ — and the verdict would flip.

s = "café"
print(len(s))                       # 4  -> number of characters (code points)
print(len(s.encode("utf-8")))       # 5  -> bytes in UTF-8
print(len(s.encode("utf-16-le")))   # 8  -> bytes in UTF-16 (no BOM)

Exam Tip: When a question asks for the size of a string, decide first whether it wants characters or bytes, and in which encoding. State the per-character byte cost (ASCII = 1 byte in UTF-8; BMP = 2 bytes in UTF-16) and multiply — never assume one byte equals one character outside pure ASCII.

A note on fixed-width versus variable-width

It is reasonable to ask: why not give every character a fixed 4 bytes (the so-called UTF-32 scheme), so that the nth character always sits at byte offset $4n$ and indexing is trivial? The answer is storage cost. For predominantly English or source-code text — the overwhelming majority of stored text — UTF-32 would quadruple the file size versus UTF-8's one byte per ASCII character, for no benefit to the content. Variable-width encodings trade a little decoding complexity for a large, consistent saving on real-world text. This is the same range-versus-cost reasoning that runs through the whole data-representation topic: you choose the representation that best fits the expected data, not the one that is simplest in the abstract.

This also clarifies a subtle point students often miss: an encoding's "minimum bytes per character" is not its "always bytes per character". UTF-8's minimum is 1, but a single emoji can occupy 4 bytes; UTF-16's minimum is 2, but the same emoji also occupies 4 (via a surrogate pair). The encoding adapts to the code point, which is precisely what lets one scheme cover every character of every script without wasting space on the common case.

Character Codes, Collation and Arithmetic

The single most useful consequence of a well-ordered character set is that comparing and sorting text reduces to comparing numbers. Because ASCII (and the ASCII-compatible start of Unicode) lays letters out contiguously and in order:

Character Encoding: ASCII and Unicode

Character Encoding: ASCII and Unicode

Spec Mapping

Character Sets

ASCII (7-bit)

Why 'A' = 65 and '0' = 48 matter

Worked character arithmetic

A note on storage

Extended ASCII (8-bit)

Unicode

UTF-8

UTF-16

Why Unicode needs more bits than ASCII

How UTF-8 signals its byte count

Worked example: counting bytes for a short string

A note on fixed-width versus variable-width

Character Codes, Collation and Arithmetic

More in Computer Science