Data Compression and Encryption

The previous lessons showed that raw media is large: a single Full-HD photograph is around 6 MiB uncompressed, and a three-minute stereo track around 30 MiB. Storing and transmitting data efficiently demands compression — making files smaller. Quite separately, sending data securely demands encryption — making files unreadable to anyone without the key. The two are often confused but solve entirely different problems: compression attacks size, encryption attacks secrecy. This lesson develops both rigorously, with fully worked examples of run-length encoding and dictionary-based compression, the lossy vs lossless distinction and when each is acceptable, and the Caesar and Vernam ciphers — culminating in the profound difference between the Vernam cipher's perfect secrecy and the computational security of every practical cipher.

Spec Mapping

This lesson covers the compression and encryption strands of the AQA A-Level Computer Science (7517) Fundamentals of data representation area:

Compression — its purpose, and the distinction between lossy and lossless compression and when each is appropriate.
Run-length encoding (RLE) — a worked lossless technique exploiting runs of repeated values.
Dictionary-based compression — a worked lossless technique replacing repeated sequences with codes from a dictionary.
Encryption — its purpose and the idea of a key.
The Caesar cipher — a worked shift cipher and why it is weak.
The Vernam cipher (one-time pad) — its method, the requirements on its key, and why it offers perfect security.
The distinction between perfect security (Vernam) and computational security (every practical cipher).

These ideas connect directly to image/sound representation (what is being compressed) and to network security.

Compression: Why and Two Kinds

Compression is the process of encoding data so that it occupies fewer bits than its original representation. The benefits are concrete: less storage space, faster transmission, and lower bandwidth/cost. There are two fundamentally different categories.

Lossless compression reduces size with no loss of information — the original data is reconstructed exactly when decompressed. It works by removing statistical redundancy (repetition and predictability) rather than discarding content.
Lossy compression reduces size by permanently discarding information judged less important (often using models of human perception). The decompressed data is an approximation of the original — the discarded detail is gone forever — but the files are far smaller.

The decisive question is whether the lost data matters:

Use lossless where every bit is essential and must survive intact: text and source code, executable programs, spreadsheets, databases, archive (ZIP) files, and images where exactness matters (PNG, medical scans). A single flipped character in a program or a legal document is unacceptable.
Use lossy where some imperceptible or tolerable loss is an acceptable price for much smaller files: photographs (JPEG), streamed music (MP3/AAC) and video (H.264). The eye and ear cannot perceive the discarded detail, so the saving is "free" in practice.

Exam Tip: "Lossless" does not mean "no compression" — it means no loss of information while still shrinking the file. And lossy compression is not "low quality"; it is a deliberate, often imperceptible trade of fidelity for size. Justify the choice by asking whether the discarded data would matter: a contract → lossless; a holiday photo → lossy is fine.

Run-Length Encoding (RLE)

Run-length encoding is a simple lossless technique that exploits runs — consecutive repetitions of the same value. Each run is replaced by a (count, value) pair, so a long run collapses to two items. It is highly effective on data with large uniform regions (simple graphics, icons, scanned line-art) and ineffective — even counter-productive — on data with little repetition.

Worked example 1 — a string

Take the 12-character string AAAAAABBBCCD. Scanning left to right, the runs are: six A's, three B's, two C's, one D. RLE encodes each as a count followed by the symbol:

Run	Encoded
AAAAAA	6A
BBB	3B
CC	2C
D	1D

Encoded output: 6A 3B 2C 1D. The original 12 symbols become 8 items (4 counts + 4 symbols), a saving of one third — and the original is recoverable exactly, so it is lossless.

Worked example 2 — a row of bitmap pixels

RLE shines on bitmap images with blocks of identical colour. Consider one row of 20 pixels in a two-colour (black B / white W) image:

WWWWWWWWBBBBWWWWWWWW

The runs are eight W, four B, eight W, encoded as the (count, colour) pairs (8, W) (4, B) (8, W) — three pairs in place of twenty pixel values. The compression ratio here is dramatic because the row is so uniform.

When RLE fails

RLE only helps when runs exist. Apply it to ABABAB (no repeats): every "run" is length 1, giving 1A 1B 1A 1B 1A 1B — twice the original size. This is the key limitation: on data with little consecutive repetition RLE can expand rather than compress, which is why real formats only apply it where it pays.

def rle_encode(data: str) -> list[tuple[int, str]]:
    if not data:
        return []
    runs = []
    count = 1
    for i in range(1, len(data)):
        if data[i] == data[i - 1]:
            count += 1                 # extend the current run
        else:
            runs.append((count, data[i - 1]))  # close the run
            count = 1
    runs.append((count, data[-1]))     # close the final run
    return runs

print(rle_encode("AAAAAABBBCCD"))
# [(6, 'A'), (3, 'B'), (2, 'C'), (1, 'D')]

Dictionary-Based Compression

Dictionary-based compression is a lossless technique that exploits repeated sequences (not just consecutive ones). It builds a dictionary mapping each recurring sequence to a short code, replaces occurrences in the data with their codes, and stores (or rebuilds) the dictionary so the decoder can reverse the substitution. This is the family behind ZIP, GIF and PNG (the LZW/LZ77 algorithms).

Worked example — compressing text

Take the phrase:

the cat sat on the mat the cat ran

The word the recurs three times and cat twice. Build a dictionary assigning each repeated token a code:

Code	Sequence
1	`the`
2	`cat`
3	`sat`
4	`on`
5	`mat`
6	`ran`

Replacing every word by its code, the message becomes the token stream:

1 2 3 4 1 5 1 2 6

Instead of repeatedly spelling out the (3 characters) and cat each time, we store each once in the dictionary and reference it by a compact code thereafter. The decoder reads the codes and looks each up in the dictionary to reconstruct the exact original text — hence lossless. The more (and the longer) the repeated sequences, the greater the saving; on data with little repetition the dictionary overhead can mean little or no gain.

A practical refinement used by LZW is that the dictionary is built dynamically as the data is scanned and can be reconstructed by the decoder from the compressed stream itself, so it need not be transmitted separately — a subtlety worth knowing for the top band.

RLE vs dictionary compression

	Run-length encoding	Dictionary-based
Exploits	consecutive repeats (runs)	repeated sequences anywhere
Replaces	a run with (count, value)	a sequence with a dictionary code
Best for	large uniform regions (simple graphics)	text, data with recurring patterns
Lossless?	yes	yes
Fails when	few consecutive repeats (can expand)	little repetition (dictionary overhead)

Measuring compression

The effectiveness of any compression is quantified by the compression ratio, the original size divided by the compressed size:

$\text{compression ratio} = \frac{\text{uncompressed size}}{\text{compressed size}}$

A ratio of 4:1 means the compressed file is a quarter of the original. For the RLE pixel-row example earlier (WWWWWWWWBBBBWWWWWWWW), twenty pixel values became three (count, colour) pairs; if each pixel value and each count occupies one unit, that is roughly $20 \div 6 \approx 3.3{:}1$ . The figure depends entirely on the data: a uniform image compresses enormously under RLE, while noisy or already-compressed data may achieve a ratio near 1:1 (or worse). This is why you cannot meaningfully "compress a file twice" to keep shrinking it — once the redundancy a method targets has been removed, there is nothing left for it to exploit, and a second pass typically achieves nothing or slightly expands the file owing to its own overhead. The compression ratio is therefore a property of the data-and-method together, not of the method alone, and it is the headline figure used to compare formats (lossy JPEG might reach 10:1 on a photograph where lossless PNG manages only 2:1, precisely because the lossy method is permitted to discard data).

Encryption and the Key

Encryption transforms readable plaintext into unreadable ciphertext so that only an authorised party — one holding the secret key — can reverse the process (decryption) and recover the plaintext. An eavesdropper who intercepts the ciphertext without the key learns nothing useful. Critically, encryption is about secrecy, not size — encrypted data is typically the same size as (or slightly larger than) the original, the exact opposite goal of compression.

The key is the secret parameter that controls the transformation. A core security principle (Kerckhoffs's principle) is that the algorithm should be assumed public; all the security must rest in the secrecy of the key. A good cipher is one that stays secure even when the attacker knows exactly how it works and has only to find the key.

A practical point that ties this lesson's two halves together: when data must be both compressed and encrypted, it should be compressed first, then encrypted. Compression works by removing the redundancy and patterns in data — but good encryption deliberately produces output that looks like random noise, with no patterns left to exploit, so trying to compress after encrypting achieves almost nothing. Doing it in the right order (compress → encrypt) gives both a smaller and a secure result, and is exactly what archive tools and secure protocols do.

The Caesar Cipher

The Caesar cipher is a substitution cipher that shifts every letter a fixed number of places along the alphabet. The key is the shift amount. With a shift of 3, A→D, B→E, …, X→A, Y→B, Z→C (wrapping around the end).

Encryption and decryption are modular arithmetic on letter positions (A = 0 … Z = 25):

$\text{cipher} = (\text{plain} + k) \bmod 26, \qquad \text{plain} = (\text{cipher} - k) \bmod 26$

Worked example — encrypting with a shift of 3

Encrypt HELLO with key $k = 3$ . Using A = 0:

Letter	H	E	L	L	O
Position	7	4	11	11	14
+3 (mod 26)	10	7	14	14	17
Cipher letter	K	H	O	O	R

So HELLO → KHOOR. Decryption subtracts 3: $10 - 3 = 7 \to$ H, and so on, recovering HELLO. The wrap-around matters: encrypting Z (25) with shift 3 gives $(25 + 3) \bmod 26 = 2 \to$ C.

def caesar_encrypt(text: str, key: int) -> str:
    result = ""
    for ch in text.upper():
        if ch.isalpha():
            pos = ord(ch) - ord('A')            # 0..25
            shifted = (pos + key) % 26           # wrap with mod 26
            result += chr(shifted + ord('A'))
        else:
            result += ch                          # leave non-letters
    return result

print(caesar_encrypt("HELLO", 3))   # KHOOR

Why the Caesar cipher is weak

The Caesar cipher is trivially broken:

Tiny key space. There are only 25 useful shifts (a shift of 0 or 26 changes nothing), so an attacker can simply try all of them — a brute-force attack — and read off the shift that yields sensible English.
It preserves letter frequencies. Because each letter always maps to the same cipher letter, the statistical fingerprint of the language survives. Frequency analysis — the most common ciphertext letter is very likely 'E', the next 'T', and so on — cracks it even without brute force, and works on any simple substitution cipher, not just shifts.

The Caesar cipher is therefore of historical and educational value only; it offers no real security.

The Vernam Cipher (One-Time Pad)

The Vernam cipher, or one-time pad, is the polar opposite of the Caesar cipher: under strict conditions it is mathematically unbreakable. It combines the plaintext with a key that is a truly random sequence at least as long as the message, using the XOR operation bit by bit (equivalently, modular addition of letters). The same key XORed back recovers the plaintext, because XOR is its own inverse:

$\text{cipher} = \text{plain} \oplus \text{key}, \qquad \text{plain} = \text{cipher} \oplus \text{key}$

The strict requirements on the key

The Vernam cipher is perfectly secure only if every one of these holds (break any one and it becomes breakable):

The key is truly random (from a genuine physical random source, not a pseudo-random algorithm).
The key is at least as long as the plaintext (never repeated within the message).
The key is used only once — never reused for another message (hence "one-time pad").
The key is kept completely secret and shared securely between sender and receiver.

Worked example — XOR on bits

Encrypt the 8-bit plaintext 01001000 (ASCII 'H') with a random 8-bit key 10110101, XORing bit by bit (XOR rule: output 1 when the inputs differ):

Data Compression and Encryption

Data Compression and Encryption

Spec Mapping

Compression: Why and Two Kinds

Run-Length Encoding (RLE)

Worked example 1 — a string

Worked example 2 — a row of bitmap pixels

When RLE fails

Dictionary-Based Compression

Worked example — compressing text

RLE vs dictionary compression

Measuring compression

Encryption and the Key

The Caesar Cipher

Worked example — encrypting with a shift of 3

Why the Caesar cipher is weak

The Vernam Cipher (One-Time Pad)

The strict requirements on the key

Worked example — XOR on bits

More in Computer Science