Data Compression

This lesson covers data compression — reducing the size of files to save storage space and reduce transmission time. You need to understand lossy and lossless compression, and specific algorithms: Run-Length Encoding (RLE), Huffman coding, and dictionary encoding (LZW).

Why Compress Data?

Benefit	Explanation
Reduced storage	Smaller files use less disk space
Faster transmission	Smaller files transfer more quickly over networks
Lower bandwidth usage	Less data to send = less network capacity needed
Cost savings	Less storage and bandwidth = lower costs

Lossy vs Lossless Compression

Feature	Lossy	Lossless
Data loss	Some data is permanently removed	No data is lost
File size	Typically much smaller	Smaller than original, but larger than lossy
Quality	Reduced (may be imperceptible)	Identical to original
Reversible	No — original cannot be recovered	Yes — original can be perfectly reconstructed
Examples	JPEG, MP3, AAC, H.264	PNG, FLAC, ZIP, GIF
Best for	Images, audio, video where small quality loss is acceptable	Text, programs, medical images, archives

When to Use Each

Lossy: Streaming music (MP3), web images (JPEG), video calls — human perception does not notice small losses.
Lossless: Source code, databases, legal documents, medical scans — every bit matters.

Run-Length Encoding (RLE)

RLE is a lossless compression algorithm that replaces consecutive repeated values with a count and the value.

How RLE Works

Original: AAAAAABBCCCCDDDDDDDD
Encoded:  6A2B4C8D

Each run of identical characters is stored as (count, character).

Worked Example: Image Data

Consider a row of pixels in a 1-bit black-and-white image:

Original: 0 0 0 0 0 1 1 1 0 0 0 0 1 1 1 1 1 1 0 0
Encoded:  5,0  3,1  4,0  6,1  2,0

Original: 20 values. Encoded: 10 values (5 pairs). 50% reduction.

When RLE Works Well

Scenario	Effectiveness
Long runs of repeated data	Very effective (e.g., simple graphics, fax documents)
Data with few repetitions	Ineffective — encoded data may be larger than original
Photographic images	Generally poor — pixel values vary frequently

Limitations of RLE

Only effective when there are long runs of repeated values.
For random data or complex images, RLE may increase file size.
It is a very simple algorithm — more sophisticated methods usually achieve better compression.

Huffman Coding

Huffman coding is a lossless compression algorithm that assigns shorter binary codes to more frequently occurring characters and longer codes to less frequent characters.

The Key Idea

In standard ASCII, every character uses the same number of bits (7 or 8). Huffman coding uses variable-length codes so that common characters use fewer bits.

Steps to Build a Huffman Code

Count the frequency of each character in the data.
Create a leaf node for each character with its frequency.
Build a binary tree by repeatedly combining the two lowest-frequency nodes into a parent node whose frequency is their sum.
Assign codes by traversing the tree: left branch = 0, right branch = 1.

Worked Example

Message: "ABRACADABRA" (11 characters)

Step 1: Frequency count

Character	Frequency
A	5
B	2
R	2
C	1
D	1

Step 2: Build the tree

Combine the two lowest frequencies:

C(1) + D(1) = CD(2)
B(2) + R(2) = BR(4)
CD(2) + BR(4) = CDBR(6)...

Actually, let us build it properly, always picking the two smallest:

C(1) + D(1) -> (2)
B(2) + R(2) -> (4)
(2) + (4) -> (6)... but A has frequency 5
A(5) + node(6) -> root(11)

Data Compression

Data Compression

Why Compress Data?

Lossy vs Lossless Compression

When to Use Each

Run-Length Encoding (RLE)

How RLE Works

Worked Example: Image Data

When RLE Works Well

Limitations of RLE

Huffman Coding

The Key Idea

Steps to Build a Huffman Code

Worked Example

More in Computer Science