You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
Computers ultimately store everything as numbers, so before any text can be saved, displayed or transmitted it must be turned into numbers by an agreed scheme. This lesson covers character sets and code points, the historical progression from ASCII through extended ASCII to Unicode, the encoding formats UTF-8 / UTF-16 / UTF-32, and the fundamental relationship between the number of bits and the number of representable characters.
This lesson addresses the H446 1.4.1 Data Types content on representing characters:
(This is a paraphrase of the specification content, not a verbatim quotation.)
A character set is an agreed table that maps each character — a letter, digit, punctuation mark, control code or symbol — to a unique number. That number is called the character's code point, and it is what the computer actually stores (as binary) in place of the character. The link between abstract character and stored bits is therefore a two-stage idea:
For simple schemes such as ASCII these two stages collapse into one (the code point is the stored byte), but for Unicode they are genuinely separate, which is the single most important idea in this topic.
The reason a shared standard matters is interoperability. If one computer saved 'A' as 65 and another expected 'A' to be 200, text copied between them would be gibberish. A common character set means any system can store text that any other system can read back correctly — exactly the kind of agreed convention that lets the internet move text between billions of devices.
The number of characters a fixed-width scheme can represent is governed by the same power-of-two relationship met throughout this unit:
distinct characters=2nfor n bits per character
so 7 bits give 27=128 characters, 8 bits give 28=256, and 16 bits give 216=65,536. To support more characters you fundamentally need more bits — that single fact drives the whole history below.
ASCII, standardised in the 1960s, was the dominant early scheme for English-language text.
| Feature | Detail |
|---|---|
| Bits per character | 7 |
| Number of characters | 27=128 |
| Contents | Uppercase A–Z, lowercase a–z, digits 0–9, punctuation, and 33 control characters |
| Character | Denary | Binary (7-bit) |
|---|---|---|
| Space | 32 | 0100000 |
| '0' | 48 | 0110000 |
| '9' | 57 | 0111001 |
| 'A' | 65 | 1000001 |
| 'Z' | 90 | 1011010 |
| 'a' | 97 | 1100001 |
| 'z' | 122 | 1111010 |
Suppose a file contains the bytes (shown in denary) 72, 105, 33. Reading each through the ASCII table: 72 = 'H', 105 = 'i', 33 = '!'. So the text is Hi!. Notice that decoding is unambiguous precisely because sender and receiver share the same character set — without that shared agreement, 72 could mean anything. This is the whole point of standardisation made concrete in three bytes.
The deliberate ordering of ASCII is no accident, and it pays dividends. Because 'A'…'Z' and 'a'…'z' are each contiguous blocks in code-point order, alphabetical sorting of text reduces to numeric sorting of code points — a computer can order a list of words simply by comparing their byte values, with no special knowledge of the alphabet. Designing the character set so that useful operations (case conversion, digit-to-number, sorting) become simple arithmetic on the codes is an early example of choosing a representation to make the common operations cheap.
ASCII is fundamentally too small. With only 128 code points it can cover unaccented English and basic punctuation, but it has no room for accented Latin letters, no Greek, Cyrillic, Arabic or Hebrew, none of the tens of thousands of Chinese, Japanese or Korean characters, and no emoji or technical symbols. For a global, multilingual computing world, 128 characters is nowhere near enough.
The obvious first fix was to use the eighth bit that most hardware already handled, doubling the code space.
| Feature | Detail |
|---|---|
| Bits per character | 8 |
| Number of characters | 28=256 |
| Extra characters (128–255) | Accented letters, currency symbols, line-drawing characters, and more |
The lower half (codes 0–127) is kept identical to ASCII, so all existing ASCII text remained valid. The upper half (128–255) holds the new characters. The fatal flaw, however, is that there is no single agreed meaning for codes 128–255: different regions defined different "code pages" — ISO 8859-1 (Western European), ISO 8859-5 (Cyrillic), Windows-1252, and dozens more. The same byte, say 0xE9, might mean 'é' on one system and a completely different character on another.
This ambiguity has a real cost. A file is just bytes; it does not inherently know which code page it was written with. So when a document created with one code page is opened assuming a different one, every byte above 127 is silently re-interpreted as the wrong character — producing the familiar scrambled output known as mojibake (for example, a French word's accented letters turning into apparently random symbols). There is no general way to recover the original text without knowing the intended code page, because the information needed to disambiguate was never stored. Extended ASCII therefore traded one problem (too few characters) for another (incompatible, ambiguous standards) — and even at its best it still could not handle large scripts: a single code page of 256 codes cannot begin to hold the tens of thousands of Chinese characters, so East Asian computing needed entirely separate multi-byte schemes. The result by the early 1990s was a chaos of dozens of mutually incompatible encodings, which is precisely the mess Unicode was created to end.
Unicode is the universal standard designed to give every character in every writing system a single, unambiguous code point.
| Feature | Detail |
|---|---|
| Characters defined | Well over 100,000 and growing with each revision |
| Scripts covered | Latin, Greek, Cyrillic, Arabic, Hebrew, the CJK (Chinese/Japanese/Korean) characters, Devanagari, emoji, mathematical and musical symbols, and many more |
| Code point range | U+0000 to U+10FFFF — over 1.1 million possible code points |
Code points are written in hexadecimal with a U+ prefix, e.g. U+0041 for 'A' (65) and U+1F600 for a grinning-face emoji. Crucially, the first 128 Unicode code points are identical to ASCII (U+0041 is still 65), which preserves backward compatibility with decades of existing text.
A subtle but examinable point: Unicode itself is a character set (a mapping of characters to code points), not a storage format. It says "this character has this number"; it does not say how that number is stored as bytes. That second job belongs to the encoding formats.
Why separate the two layers at all? The split is what gives Unicode its flexibility. By fixing which number each character has just once, every script in the world gets a single unambiguous identity — no more clashing code pages. By leaving how those numbers are stored to interchangeable encodings, the same text can be saved compactly for the web (UTF-8), handled conveniently inside a particular platform (UTF-16 in Windows or Java), or processed with simple fixed-width logic (UTF-32), all without ever changing the underlying character identities. The character set is the stable agreement; the encoding is an implementation choice you make per situation. Recognising that one character set can be realised by several encodings — and that converting between encodings does not change the text, only its byte layout — is exactly the conceptual step examiners are probing when they ask you to "explain the difference between Unicode and UTF-8".
Because Unicode code points range up to U+10FFFF, they need more than one byte — but storing every character in its full width would waste enormous space on simple text. The encoding formats are different answers to "how do we store these code points as bytes?".
| Feature | Detail |
|---|---|
| Length | 1 to 4 bytes per character, chosen by code point |
| ASCII compatibility | Code points 0–127 use a single byte, identical to ASCII |
| Typical sizes | Western European: 1–2 bytes; most Asian scripts: 3 bytes; rarest characters and emoji: 4 bytes |
| Adoption | The dominant encoding of the web by a wide margin |
UTF-8's design is clever: the number of leading 1-bits in the first byte announces how many bytes the character uses, and continuation bytes all begin 10. The byte templates are:
| Code point range | Bytes | Bit pattern |
|---|---|---|
| U+0000–U+007F | 1 | 0xxxxxxx |
| U+0080–U+07FF | 2 | 110xxxxx 10xxxxxx |
| U+0800–U+FFFF | 3 | 1110xxxx 10xxxxxx 10xxxxxx |
| U+10000–U+10FFFF | 4 | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
The x positions are filled with the bits of the code point. A single-byte character that starts 0 is therefore exactly an ASCII byte, which is why ASCII files are already valid UTF-8. This design makes UTF-8 self-synchronising — if you jump into the middle of a UTF-8 stream you can find the next character boundary by scanning for a byte that does not start 10. Its three headline advantages are:
| Feature | Detail |
|---|---|
| Length | 2 bytes for most common characters, 4 bytes for the rarest (via "surrogate pairs") |
| ASCII compatibility | No — even 'A' takes 2 bytes |
| Used by | Windows internal APIs, Java and JavaScript strings |
UTF-16 is a reasonable compromise for text that is mostly in the busier parts of Unicode (such as East Asian text), where it can be more compact than UTF-8, but it wastes space on Latin text and is not ASCII-compatible. The "surrogate pair" mechanism is worth a sentence: code points beyond U+FFFF (such as most emoji) cannot fit in a single 16-bit unit, so UTF-16 represents them with two special 16-bit units reserved for this purpose. This is why UTF-16 is genuinely variable-width despite most characters taking two bytes, and it is a common source of bugs where a program assumes "two bytes = one character" and mishandles emoji.
| Feature | Detail |
|---|---|
| Length | Exactly 4 bytes per character, always |
| Mapping | Trivial — the stored value is the code point |
| Cost | Wasteful: even 'A' occupies 4 bytes |
| Used by | Internal processing where fixed-width indexing matters |
Subscribe to continue reading
Get full access to this lesson and all 10 lessons in this course.