Character Encoding

This lesson covers how computers represent text using character encoding schemes. You need to understand ASCII, Extended ASCII, and Unicode (including UTF-8, UTF-16, and UTF-32) for the OCR H446 specification.

What is Character Encoding?

A character encoding is a system that maps characters (letters, digits, symbols) to numerical codes that a computer can store and process. Each character is assigned a unique binary number.

Without an agreed encoding scheme, two computers exchanging text data would not be able to interpret the characters correctly.

ASCII (American Standard Code for Information Interchange)

ASCII was developed in the 1960s as a standard for text communication.

Feature	Detail
Bits per character	7 bits
Number of characters	2^7 = 128
Characters included	Uppercase A-Z, lowercase a-z, digits 0-9, punctuation, control characters (e.g., newline, tab)

Key ASCII Values

Character	Denary Code	Binary
Space	32	0100000
0	48	0110000
9	57	0111001
A	65	1000001
Z	90	1011010
a	97	1100001
z	122	1111010

Key Properties

Uppercase letters are 65-90; lowercase are 97-122.
The difference between upper and lower case of the same letter is always 32 (e.g., A=65, a=97).
Digits 0-9 are coded 48-57.
The first 32 codes (0-31) are control characters (non-printable).

Limitations of ASCII

Only 128 characters — sufficient for English but not for other languages.
No accented characters (e.g., e with accent, u with umlaut).
No characters for languages like Chinese, Japanese, Korean, Arabic, Hindi, etc.
No emoji or special symbols.

Extended ASCII

Extended ASCII uses 8 bits per character, doubling the available characters.

Feature	Detail
Bits per character	8 bits
Number of characters	2^8 = 256
Extra characters	Accented letters, additional symbols, box-drawing characters

Key Points

The first 128 characters (0-127) are identical to standard ASCII.
Characters 128-255 vary between different extended ASCII standards (e.g., ISO 8859-1, Windows-1252).
Problem: Multiple incompatible extended ASCII standards exist, causing garbled text when files are shared between systems using different standards.

Unicode

Unicode is a universal character encoding standard designed to represent every character from every writing system in the world.

Feature	Detail
Characters defined	Over 149,000 (as of 2023)
Scripts covered	Latin, Cyrillic, Arabic, Chinese, Japanese, Korean, Devanagari, emoji, and many more
Code points	U+0000 to U+10FFFF (over 1.1 million possible)

Why Unicode Matters

Globalisation: The internet connects users worldwide who use different scripts.
Consistency: One standard replaces hundreds of incompatible encodings.
Emoji and symbols: Unicode includes emoji, mathematical symbols, musical notation, and more.
Backward compatibility: The first 128 Unicode code points are identical to ASCII.

Unicode Encoding Formats

Unicode defines code points (abstract numbers for characters), but the actual encoding format determines how these code points are stored as bytes.

Character Encoding

Character Encoding

What is Character Encoding?

ASCII (American Standard Code for Information Interchange)

Key ASCII Values

Key Properties

Limitations of ASCII

Extended ASCII

Key Points

Unicode

Why Unicode Matters

Unicode Encoding Formats

UTF-8 (Unicode Transformation Format — 8-bit)

More in Computer Science