You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
The previous lessons showed that raw media is large: a single Full-HD photograph is around 6 MiB uncompressed, and a three-minute stereo track around 30 MiB. Storing and transmitting data efficiently demands compression — making files smaller. Quite separately, sending data securely demands encryption — making files unreadable to anyone without the key. The two are often confused but solve entirely different problems: compression attacks size, encryption attacks secrecy. This lesson develops both rigorously, with fully worked examples of run-length encoding and dictionary-based compression, the lossy vs lossless distinction and when each is acceptable, and the Caesar and Vernam ciphers — culminating in the profound difference between the Vernam cipher's perfect secrecy and the computational security of every practical cipher.
This lesson covers the compression and encryption strands of the AQA A-Level Computer Science (7517) Fundamentals of data representation area:
These ideas connect directly to image/sound representation (what is being compressed) and to network security.
Compression is the process of encoding data so that it occupies fewer bits than its original representation. The benefits are concrete: less storage space, faster transmission, and lower bandwidth/cost. There are two fundamentally different categories.
The decisive question is whether the lost data matters:
Exam Tip: "Lossless" does not mean "no compression" — it means no loss of information while still shrinking the file. And lossy compression is not "low quality"; it is a deliberate, often imperceptible trade of fidelity for size. Justify the choice by asking whether the discarded data would matter: a contract → lossless; a holiday photo → lossy is fine.
Run-length encoding is a simple lossless technique that exploits runs — consecutive repetitions of the same value. Each run is replaced by a (count, value) pair, so a long run collapses to two items. It is highly effective on data with large uniform regions (simple graphics, icons, scanned line-art) and ineffective — even counter-productive — on data with little repetition.
Take the 12-character string AAAAAABBBCCD. Scanning left to right, the runs are: six A's, three B's, two C's, one D. RLE encodes each as a count followed by the symbol:
| Run | Encoded |
|---|---|
| AAAAAA | 6A |
| BBB | 3B |
| CC | 2C |
| D | 1D |
Encoded output: 6A 3B 2C 1D. The original 12 symbols become 8 items (4 counts + 4 symbols), a saving of one third — and the original is recoverable exactly, so it is lossless.
RLE shines on bitmap images with blocks of identical colour. Consider one row of 20 pixels in a two-colour (black B / white W) image:
WWWWWWWWBBBBWWWWWWWW
The runs are eight W, four B, eight W, encoded as the (count, colour) pairs (8, W) (4, B) (8, W) — three pairs in place of twenty pixel values. The compression ratio here is dramatic because the row is so uniform.
RLE only helps when runs exist. Apply it to ABABAB (no repeats): every "run" is length 1, giving 1A 1B 1A 1B 1A 1B — twice the original size. This is the key limitation: on data with little consecutive repetition RLE can expand rather than compress, which is why real formats only apply it where it pays.
def rle_encode(data: str) -> list[tuple[int, str]]:
if not data:
return []
runs = []
count = 1
for i in range(1, len(data)):
if data[i] == data[i - 1]:
count += 1 # extend the current run
else:
runs.append((count, data[i - 1])) # close the run
count = 1
runs.append((count, data[-1])) # close the final run
return runs
print(rle_encode("AAAAAABBBCCD"))
# [(6, 'A'), (3, 'B'), (2, 'C'), (1, 'D')]
Dictionary-based compression is a lossless technique that exploits repeated sequences (not just consecutive ones). It builds a dictionary mapping each recurring sequence to a short code, replaces occurrences in the data with their codes, and stores (or rebuilds) the dictionary so the decoder can reverse the substitution. This is the family behind ZIP, GIF and PNG (the LZW/LZ77 algorithms).
Take the phrase:
the cat sat on the mat the cat ran
The word the recurs three times and cat twice. Build a dictionary assigning each repeated token a code:
| Code | Sequence |
|---|---|
| 1 | the |
| 2 | cat |
| 3 | sat |
| 4 | on |
| 5 | mat |
| 6 | ran |
Replacing every word by its code, the message becomes the token stream:
1 2 3 4 1 5 1 2 6
Instead of repeatedly spelling out the (3 characters) and cat each time, we store each once in the dictionary and reference it by a compact code thereafter. The decoder reads the codes and looks each up in the dictionary to reconstruct the exact original text — hence lossless. The more (and the longer) the repeated sequences, the greater the saving; on data with little repetition the dictionary overhead can mean little or no gain.
A practical refinement used by LZW is that the dictionary is built dynamically as the data is scanned and can be reconstructed by the decoder from the compressed stream itself, so it need not be transmitted separately — a subtlety worth knowing for the top band.
| Run-length encoding | Dictionary-based | |
|---|---|---|
| Exploits | consecutive repeats (runs) | repeated sequences anywhere |
| Replaces | a run with (count, value) | a sequence with a dictionary code |
| Best for | large uniform regions (simple graphics) | text, data with recurring patterns |
| Lossless? | yes | yes |
| Fails when | few consecutive repeats (can expand) | little repetition (dictionary overhead) |
The effectiveness of any compression is quantified by the compression ratio, the original size divided by the compressed size:
compression ratio=compressed sizeuncompressed size
A ratio of 4:1 means the compressed file is a quarter of the original. For the RLE pixel-row example earlier (WWWWWWWWBBBBWWWWWWWW), twenty pixel values became three (count, colour) pairs; if each pixel value and each count occupies one unit, that is roughly 20÷6≈3.3:1. The figure depends entirely on the data: a uniform image compresses enormously under RLE, while noisy or already-compressed data may achieve a ratio near 1:1 (or worse). This is why you cannot meaningfully "compress a file twice" to keep shrinking it — once the redundancy a method targets has been removed, there is nothing left for it to exploit, and a second pass typically achieves nothing or slightly expands the file owing to its own overhead. The compression ratio is therefore a property of the data-and-method together, not of the method alone, and it is the headline figure used to compare formats (lossy JPEG might reach 10:1 on a photograph where lossless PNG manages only 2:1, precisely because the lossy method is permitted to discard data).
Encryption transforms readable plaintext into unreadable ciphertext so that only an authorised party — one holding the secret key — can reverse the process (decryption) and recover the plaintext. An eavesdropper who intercepts the ciphertext without the key learns nothing useful. Critically, encryption is about secrecy, not size — encrypted data is typically the same size as (or slightly larger than) the original, the exact opposite goal of compression.
The key is the secret parameter that controls the transformation. A core security principle (Kerckhoffs's principle) is that the algorithm should be assumed public; all the security must rest in the secrecy of the key. A good cipher is one that stays secure even when the attacker knows exactly how it works and has only to find the key.
A practical point that ties this lesson's two halves together: when data must be both compressed and encrypted, it should be compressed first, then encrypted. Compression works by removing the redundancy and patterns in data — but good encryption deliberately produces output that looks like random noise, with no patterns left to exploit, so trying to compress after encrypting achieves almost nothing. Doing it in the right order (compress → encrypt) gives both a smaller and a secure result, and is exactly what archive tools and secure protocols do.
The Caesar cipher is a substitution cipher that shifts every letter a fixed number of places along the alphabet. The key is the shift amount. With a shift of 3, A→D, B→E, …, X→A, Y→B, Z→C (wrapping around the end).
Encryption and decryption are modular arithmetic on letter positions (A = 0 … Z = 25):
cipher=(plain+k)mod26,plain=(cipher−k)mod26
Encrypt HELLO with key k=3. Using A = 0:
| Letter | H | E | L | L | O |
|---|---|---|---|---|---|
| Position | 7 | 4 | 11 | 11 | 14 |
| +3 (mod 26) | 10 | 7 | 14 | 14 | 17 |
| Cipher letter | K | H | O | O | R |
So HELLO → KHOOR. Decryption subtracts 3: 10−3=7→ H, and so on, recovering HELLO. The wrap-around matters: encrypting Z (25) with shift 3 gives (25+3)mod26=2→ C.
def caesar_encrypt(text: str, key: int) -> str:
result = ""
for ch in text.upper():
if ch.isalpha():
pos = ord(ch) - ord('A') # 0..25
shifted = (pos + key) % 26 # wrap with mod 26
result += chr(shifted + ord('A'))
else:
result += ch # leave non-letters
return result
print(caesar_encrypt("HELLO", 3)) # KHOOR
The Caesar cipher is trivially broken:
The Caesar cipher is therefore of historical and educational value only; it offers no real security.
The Vernam cipher, or one-time pad, is the polar opposite of the Caesar cipher: under strict conditions it is mathematically unbreakable. It combines the plaintext with a key that is a truly random sequence at least as long as the message, using the XOR operation bit by bit (equivalently, modular addition of letters). The same key XORed back recovers the plaintext, because XOR is its own inverse:
cipher=plain⊕key,plain=cipher⊕key
The Vernam cipher is perfectly secure only if every one of these holds (break any one and it becomes breakable):
Encrypt the 8-bit plaintext 01001000 (ASCII 'H') with a random 8-bit key 10110101, XORing bit by bit (XOR rule: output 1 when the inputs differ):
Subscribe to continue reading
Get full access to this lesson and all 10 lessons in this course.