You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
Storage and bandwidth are finite, so we routinely compress data — re-expressing it in fewer bits — to save space and transmit it faster. This lesson explains the fundamental split between lossy and lossless compression, works through run-length encoding (RLE) and a dictionary / Huffman-style approach in full, and shows how to choose the right method for a given kind of data.
This lesson addresses the H446 1.3.1 content on compression:
(This is a paraphrase of the specification content, not a verbatim quotation.)
Compression exploits redundancy — predictability in data that lets the same information be described more briefly. A page that is mostly white space, an image with large flat areas, or English text where some letters are far commoner than others all contain redundancy that can be squeezed out.
| Benefit | Explanation |
|---|---|
| Less storage | Smaller files occupy less disk/SSD space |
| Faster transmission | Fewer bytes cross a network in less time |
| Lower bandwidth/cost | Less data sent means less network capacity (and money) consumed |
| Better user experience | Web pages and media load faster |
The flip side, examined often, is that random or already-compressed data has no redundancy, so it cannot be compressed further — and a naive scheme may even make it slightly larger.
Worked example — transfer time. Suppose a 30 MB uncompressed file is sent over a link that sustains 10 Mbit/s. Since 30 MB =30×8=240 Mbit, the transfer takes 240÷10=24 seconds. Compress it losslessly to 12 MB first and the same link carries 12×8=96 Mbit in 96÷10=9.6 seconds — under half the time. The saving in transfer time and bandwidth cost is paid for with a little CPU time at each end to compress and decompress; on any network where time or data volume costs money, that trade is overwhelmingly worth it.
Why some data cannot shrink. Compression works only by removing redundancy. A file of truly random bytes, or a file that is already compressed (a JPEG, an MP3, a ZIP), has essentially no redundancy left for a second pass to exploit — there is no repeated run for RLE, no skewed symbol frequency for Huffman, no recurring substring for a dictionary. Worse, a naive scheme still adds its own bookkeeping (run counts, a codebook, dictionary codes), so attempting to compress incompressible data can leave it slightly larger. This is why you never see "double-zipped" files shrink and why media formats already incorporate compression rather than relying on the file system to add it.
The single most important distinction is whether the original data can be recovered exactly.
| Feature | Lossy | Lossless |
|---|---|---|
| Data removed | Yes — permanently discarded | None |
| Reversible | No — original cannot be reconstructed | Yes — perfect reconstruction |
| Typical ratio | Very high (often 10x+) | Moderate |
| Resulting quality | Slightly reduced (often imperceptible) | Identical to original |
| Examples | JPEG, MP3, AAC, H.264 | PNG, FLAC, ZIP, GIF |
| Suited to | Images, audio, video for human consumption | Text, code, executables, archives, medical/legal data |
Lossy compression deliberately throws away detail the human eye or ear is unlikely to notice — high frequencies in audio, subtle colour gradations in photos — to achieve large savings. Lossless compression re-expresses the data so that the exact original bits return on decompression.
A useful rule: if a human consumes it approximately, lossy is usually fine; if a machine must read it exactly, it must be lossless.
The reason lossy methods achieve such large ratios is that they target a perceptual model — they discard precisely the detail human senses are least able to notice, so the measurable data shrinks far more than the perceived quality:
The crucial, examinable point is that lossy compression is a one-way trade: the discarded data is gone forever, so repeatedly re-saving a JPEG or re-encoding an MP3 degrades it further each time ("generation loss"). That is acceptable for a holiday photo, unacceptable for a master copy you may need to edit — hence professionals archive in lossless formats and only export to lossy at the end.
By contrast, lossless compression guarantees the exact original bits return, which is non-negotiable for whole classes of data:
| Data | Why lossless is required |
|---|---|
| Source code / executables | A single changed bit can stop a program compiling or running |
| Databases / spreadsheets | A flipped figure silently corrupts records or totals |
| Text documents / contracts | Every character must survive; meaning must not change |
| ZIP / archive files | Files must extract byte-for-byte identical to the originals |
| Medical / scientific images | A discarded detail could mask a tumour or skew a measurement |
| PNG screenshots, RAW masters | Crisp edges and full detail must be preserved for editing |
The unifying rule remains: lossy for approximate human consumption, lossless whenever a machine — or a clinician, or a court — must read the data exactly.
RLE is a simple lossless method that replaces each run of consecutive identical values with a single (count, value) pair. It is ideal for data with long stretches of repetition.
| Field | Value |
|---|---|
| Original (20 chars) | AAAAAABBCCCCDDDDDDDD |
| Encoded (4 pairs) | 6A 2B 4C 8D |
The 20-character string becomes four (count, character) pairs — 8 symbols instead of 20.
A row of a 1-bit (black/white) image, reading left to right:
| Field | Value |
|---|---|
| Original (20 pixels) | 0 0 0 0 0 1 1 1 0 0 0 0 1 1 1 1 1 1 0 0 |
| Runs | five 0s, three 1s, four 0s, six 1s, two 0s |
| Encoded (5 pairs) | (5,0) (3,1) (4,0) (6,1) (2,0) |
The original is 20 values; the encoding is 10 values (5 pairs) — a 50% reduction, and the original is recovered perfectly by expanding each pair.
Putting it as a ratio, and assuming each value (count or pixel) costs the same number of bits:
compression ratio=1020=2:1,saving=2020−10×100=50%
Decoding is trivial and lossless: read each pair and write its value that many times — (5,0) ⇒ 00000, (3,1) ⇒ 111, and so on — reproducing the original row exactly. This perfect reconstruction is what makes RLE a lossless method, suitable even for data where every bit must survive.
Worked Example — a longer run. A scanned document line is 100 consecutive white pixels. RLE encodes it as the single pair (100, white) — two values in place of 100, a 100:2=50:1 ratio and a 98% saving. The longer the runs, the more dramatic the win, which is exactly why RLE thrives on fax pages, simple icons and scanned text but is wasted on photographs.
| Scenario | Effectiveness |
|---|---|
| Long runs (simple graphics, fax, icons, scanned documents) | Very effective |
| Few or no runs (text, photographs) | Ineffective — often expands the data |
The failure case is worth understanding precisely. Consider ABCDEF — six distinct characters with no runs. Naive RLE encodes each as a (count, value) pair: (1,A)(1,B)(1,C)(1,D)(1,E)(1,F) — twelve symbols instead of six, doubling the size. This is why RLE is reserved for data known to contain long runs, and why photographs (where adjacent pixels usually differ) compress poorly with it.
RLE encode (pseudocode)
i ← 0
WHILE i < length(data)
value ← data[i]
count ← 1
WHILE i + count < length(data) AND data[i + count] = value
count ← count + 1
ENDWHILE
OUTPUT count, value
i ← i + count
ENDWHILE
Dictionary-based compression (the family that includes LZW, used in GIF and ZIP) replaces repeated substrings — not just single repeated symbols — with short codes that point into a dictionary of patterns. The dictionary is built up as the data is read, so frequently recurring patterns earn a single short code.
ABABABABInitial dictionary: A = 1, B = 2.
| Step | Longest match | Output | New entry added |
|---|---|---|---|
| 1 | A (code 1) | 1 | AB = 3 |
| 2 | B (code 2) | 2 | BA = 4 |
| 3 | AB (code 3) | 3 | ABA = 5 |
| 4 | AB (code 3) | 3 | — (end of input) |
Output: 1 2 3 3 — four codes for the eight-character input. Notice the dictionary learns the pattern AB after seeing it once, then reuses it: steps 3 and 4 both emit a single code (3) for a two-character chunk. The more repetitive the data, the longer the patterns the dictionary accumulates, so dictionary methods get better as repetition grows.
Decoding rebuilds the same dictionary. This is the elegant part, and the reason — unlike Huffman — no codebook need be sent: the decoder starts from the same single-symbol dictionary and reconstructs every entry as it reads codes.
| Code in | Output | New entry the decoder adds |
|---|---|---|
| 1 | A | (nothing yet — needs the next symbol) |
| 2 | B | AB = 3 (previous A + this B) |
| 3 | AB | BA = 4 (previous B + start of this) |
| 3 | AB | ABA = 5 |
The decoder outputs A, B, AB, AB = ABABABAB, exactly the original, having rebuilt the identical dictionary AB=3, BA=4, ABA=5 along the way. Because both ends grow the dictionary by the same rule, only the codes travel — the dictionary is implicit. This self-building dictionary is the heart of the LZ-family algorithms inside GIF, ZIP and PNG.
Why it scales. On longer, repetitive data the dictionary entries grow ever longer, so a single code can stand for a whole word or phrase. Compressing English prose, for example, soon yields dictionary entries for common strings like the or ing, each then sent as one code — which is why dictionary coding handles general text far better than RLE (which needs literal runs) and competes with Huffman without the codebook overhead.
Huffman coding is a lossless method based on character frequency: it gives the most frequent symbols the shortest binary codes and rare symbols longer codes, instead of ASCII's fixed 8 bits each. The codes come from a binary tree built bottom-up.
The 11-character message has these frequencies:
| Symbol | A | B | R | C | D |
|---|---|---|---|---|---|
| Frequency | 5 | 2 | 2 | 1 | 1 |
Step 1 — combine the two smallest each time. Each line removes the two lowest-frequency nodes and replaces them with a parent equal to their sum, which goes back into the pool:
Subscribe to continue reading
Get full access to this lesson and all 10 lessons in this course.