Hash Tables

This lesson covers the hash table — the data structure that achieves constant average-time insertion, search, and deletion by using a hash function to compute the storage index directly from the key, rather than searching for it. The hash table is the implementation behind almost every dictionary, set, cache, and symbol table in modern software, and it is the structure you reach for whenever a problem says "look up by key, and look it up fast". The examinable core is the chain of ideas key → hash function → index → collision → resolution → load factor → rehash, and the ability to trace both linear probing and separate chaining by hand.

Spec Mapping

This lesson addresses AQA A-Level Computer Science (7517), section 4.2.5 (Hash tables). It covers the role of a hash function, what makes a hash function good (determinism, uniform distribution, speed), the inevitability of collisions and the two examinable resolution strategies — separate chaining and open addressing with linear probing — together with the load factor $\lambda = n/m$ and the rehashing that keeps it low. It connects forward to §4.10 (databases: hash indexing) and back to §4.2.6 (linked lists, used as the chains) and §4.2.1 (the static array that underlies the table). All wording below is original; AQA's published specification text is not reproduced.

What is a Hash Table?

A hash table (also called a hash map) stores key–value pairs in an underlying array. Instead of searching the array for a free slot, it computes where each pair belongs by passing the key through a hash function $h$ that returns an array index. To store a pair you compute $h(\text{key})$ and place the value there; to retrieve it you compute $h(\text{key})$ again and read that slot. Because computing the index is a fixed amount of work and reaching an array slot is $O(1)$ (§4.2.1 address arithmetic), the whole lookup is $O(1)$ on average — no traversal, no comparisons against other keys.

Term	Meaning
Key	The unique identifier used to find an item (e.g. a username, a product code).
Value	The data associated with the key (e.g. the user's record).
Hash function $h$	Maps a key to an integer index in the range $0 \dots m-1$ .
Bucket / slot	One position in the underlying array of size $m$ .
Load factor $\lambda$	The fullness ratio $n/m$ ( $n$ items, $m$ slots).

The defining promise of the hash table is average-case $O(1)$ for insert, search, and delete. That promise is conditional: it holds only while collisions are rare, which is why the load factor and the choice of hash function matter so much.

Hash Functions

A hash function takes a key and returns an integer index inside the table's bounds. Three properties make one fit for purpose:

Deterministic — the same key must always map to the same index, or you could never find what you stored.
Uniform distribution — keys should be spread evenly across all $m$ slots, so that collisions are minimised. A hash that piles many keys onto a few slots destroys the $O(1)$ behaviour.
Fast to compute — the hash must be $O(1)$ (or $O(k)$ in the key length $k$ for strings); a slow hash would defeat the point of avoiding a search.

A Simple String Hash

A standard teaching hash for string keys sums the ASCII (Unicode) codes of the characters and reduces the total modulo the table size $m$ . The modulo is essential: it folds an arbitrarily large sum back into a valid index $0 \dots m-1$ .

$h(k) = \left( \sum_{i} \text{ord}(k_i) \right) \bmod m$

Worked for a table of size 10:

$h(\texttt{"Cat"}) = (67 + 97 + 116) \bmod 10 = 280 \bmod 10 = 0$

$h(\texttt{"Dog"}) = (68 + 111 + 103) \bmod 10 = 282 \bmod 10 = 2$

def simple_hash(key: str, table_size: int) -> int:
    total = 0
    for char in key:
        total += ord(char)        # add Unicode code point of each character
    return total % table_size     # fold into a valid index 0 .. table_size-1

print(simple_hash("Cat", 10))     # 0
print(simple_hash("Dog", 10))     # 2

This sum-of-codes hash is easy to trace but distributes poorly: any anagram hashes to the same slot ("Cat" and "Act" both give 0), and short keys cluster at low indices. Production hashes (such as the polynomial rolling hash $h = (h \times 31 + \text{ord}(c)) \bmod m$ ) multiply by a constant so that order affects the result, spreading keys far more evenly. You should be able to name why the naive hash is weak — order-independence and clustering — even if the exam only asks you to compute with it.

Choosing a Table Size

A prime table size is preferred for modulo-based hashing because it reduces systematic collisions when keys share common factors with $m$ . If $m = 100$ and many keys are multiples of 10, they collide heavily; a prime $m$ has no such small factors, so the remainder spreads the keys more uniformly. This is a small but examinable design point.

Collisions

A collision occurs when two different keys hash to the same index. Collisions are not a bug — they are mathematically unavoidable. The set of possible keys (every string, say) is vastly larger than the $m$ slots, so by the pigeonhole principle some keys must share a slot once you store more than a handful. A hash table is therefore defined as much by how it resolves collisions as by its hash function. The two strategies you must know are separate chaining and open addressing (linear probing).

Collision Resolution 1: Open Addressing (Linear Probing)

In open addressing, every item lives inside the array itself — there are no external lists. When the home slot $h(k)$ is occupied, the algorithm probes for the next free slot. Linear probing simply tries the following slots in order, wrapping around the end with modulo:

$\text{probe}(k, i) = (h(k) + i) \bmod m \quad \text{for } i = 0, 1, 2, \dots$

Traced Insertion

Table size $m = 7$ , hash function $h(k) = k \bmod 7$ . Insert the keys 10, 17, 24, 5 in that order:

Key	$h(k)$	Probe sequence	Lands in slot	Reason
10	3	slot 3	3	empty — placed
17	3	slot 3 (taken) → 4	4	collision at 3, probe +1
24	3	3 (taken) → 4 (taken) → 5	5	collision at 3 and 4, probe +2
5	5	slot 5 (taken) → 6	6	collision at 5, probe +1

The resulting array, drawn as buckets:

flowchart LR
    S0["slot 0\n(empty)"]
    S1["slot 1\n(empty)"]
    S2["slot 2\n(empty)"]
    S3["slot 3\nkey 10"]
    S4["slot 4\nkey 17"]
    S5["slot 5\nkey 24"]
    S6["slot 6\nkey 5"]
    S0 --- S1 --- S2 --- S3 --- S4 --- S5 --- S6

Searching and the Deletion Problem

To search, you repeat the same probe sequence until you find the key or hit an empty slot (which proves the key is absent). This creates a subtlety: you cannot simply blank a slot on deletion. If key 17 were deleted by emptying slot 4, a later search for 24 would probe 3, then hit the now-empty slot 4 and wrongly conclude 24 is absent — even though it sits in slot 5. The fix is a tombstone: a special "deleted" marker that searches skip over but insertions may reuse. This deletion hazard is a classic discrimination point between probing and chaining.

Primary Clustering

Linear probing suffers from primary clustering: occupied slots tend to form long contiguous runs, because any key hashing anywhere into a cluster extends it. The longer the run, the longer the average probe, so performance degrades sharply as the table fills. Quadratic probing ( $+1, +4, +9, \dots$ ) and double hashing mitigate this, but linear probing is the version AQA expects you to trace.

Exam Tip: Linear probing is the most commonly traced method. Always write out the probe formula $(h(k) + i) \bmod m$ , show each slot you test, and remember the wrap-around. If asked about deletion, mention tombstones.

Collision Resolution 2: Separate Chaining

In separate chaining, each array slot holds the head of a linked list (a "chain") of all items that hashed there. A collision simply appends to that slot's list, so the table can never "fill up". Inserting keys 10, 17, 24 (all hashing to slot 3 under $k \bmod 7$ ) and 15, 22 (both to slot 1) produces:

flowchart LR
    I0["slot 0"] --> NUL0["null"]
    I1["slot 1"] --> N15["15 | next"] --> N22["22 | next"] --> NUL1["null"]
    I2["slot 2"] --> NUL2["null"]
    I3["slot 3"] --> N10["10 | next"] --> N17["17 | next"] --> N24["24 | next"] --> NUL3["null"]
    I4["slot 4"] --> NUL4["null"]

To search, hash to the slot then walk its (usually short) chain. The two strategies trade off as follows:

Feature	Linear probing (open addressing)	Separate chaining
Where items live	Inside the array	In linked lists hanging off each slot
Can the table fill?	Yes — at $\lambda = 1$ it is full	No — chains grow on the heap
Clustering	Suffers primary clustering	None — only the relevant chain grows
Deletion	Needs tombstones	Trivial — unlink the node
Memory	One contiguous array, no pointers	Array + a pointer per node (overhead)
Cache performance	Better (contiguous probing)	Worse (pointer-chasing to scattered nodes)
Behaviour past $\lambda = 1$	Impossible	Degrades gracefully

This is a direct application of §4.2.6: the chains are singly linked lists, with all the heap-allocation and $O(n)$ -traversal properties from that lesson.

Hash Table Implementation (Chaining)

class HashTable:
    def __init__(self, size: int = 11):          # prime size reduces collisions
        self.__size = size
        self.__table = [[] for _ in range(size)]  # each slot is a chain (list)

    def __hash(self, key: str) -> int:
        total = 0
        for char in key:
            total += ord(char)
        return total % self.__size

    def put(self, key: str, value):
        index = self.__hash(key)
        for i, (k, v) in enumerate(self.__table[index]):
            if k == key:
                self.__table[index][i] = (key, value)   # key exists → update
                return
        self.__table[index].append((key, value))        # new key → append to chain

    def get(self, key: str):
        index = self.__hash(key)
        for k, v in self.__table[index]:                 # walk the chain
            if k == key:
                return v
        raise KeyError(f"Key not found: {key}")

    def delete(self, key: str):
        index = self.__hash(key)
        for i, (k, v) in enumerate(self.__table[index]):
            if k == key:
                del self.__table[index][i]               # unlink — no tombstone needed
                return
        raise KeyError(f"Key not found: {key}")

Note that put checks the chain first so that re-inserting an existing key updates rather than duplicates it — the behaviour every dictionary needs.

Load Factor and Rehashing

The load factor measures how full the table is:

$\lambda = \frac{n}{m} \qquad (n = \text{items stored},\; m = \text{table size})$

Under chaining, $\lambda$ is the average chain length, so an unsuccessful search costs roughly $1 + \lambda$ steps. Under open addressing the cost grows much faster, approaching infinity as $\lambda \to 1$ . Either way, performance is excellent while $\lambda$ is small and deteriorates as it climbs:

Load factor $\lambda$	Effect
$< 0.5$	Excellent — collisions rare, lookups near $O(1)$
$0.5 - 0.75$	Good — collisions manageable
$> 0.75$	Poor — collisions frequent; resize the table

Worked example. A table with $m = 11$ slots holding $n = 6$ items has $\lambda = 6/11 \approx 0.55$ . Adding three more gives $\lambda = 9/11 \approx 0.82$ , past the usual 0.75 threshold — time to rehash.

Rehashing restores a low load factor: allocate a larger array (commonly the next prime above double the size), then re-insert every existing item by recomputing its hash against the new $m$ . You cannot just copy slots across, because the index of each key depends on $m$ — a key in slot 3 of an 11-slot table may belong in slot 25 of a 53-slot table. A single rehash is $O(n)$ , but because the table doubles, rehashes become rarer as it grows and the amortised cost of insertion stays $O(1)$ — exactly the doubling argument used for dynamic arrays in §4.2.1.

Time Complexity

Operation	Average case	Worst case
Insert	$O(1)$	$O(n)$ — all keys collide into one chain/cluster
Search	$O(1)$	$O(n)$
Delete	$O(1)$	$O(n)$

The worst case arises when every key hashes to the same slot: chaining degenerates into a single linked list, and probing into one long cluster, so all operations become $O(n)$ . This is why a good, uniform hash function plus a controlled load factor are not optional extras — they are what converts the worst case into the average case in practice. Note also that a hash table provides no efficient ordered access: to list keys in sorted order you must extract and sort them at $O(n \log n)$ , because the hash scatters keys with no relation to their order.

End-to-End Worked Trace (String Keys, Chaining)

Putting the pieces together, trace a chained hash table of size $m = 7$ storing animal names with the sum-of-codes hash $h(k) = \left(\sum \text{ord}(c)\right) \bmod 7$ . First compute each key's home slot:

Key	ASCII sum	$\bmod 7$	Slot
`"ant"`	97 + 110 + 116 = 323	323 mod 7 = 1	1
`"bee"`	98 + 101 + 101 = 300	300 mod 7 = 6	6
`"cat"`	99 + 97 + 116 = 312	312 mod 7 = 4	4
`"owl"`	111 + 119 + 108 = 338	338 mod 7 = 2	2
`"eel"`	101 + 101 + 108 = 310	310 mod 7 = 2	2 (collision with `"owl"`)

"owl" and "eel" both land in slot 2, so under chaining slot 2's list holds both. The table now looks like:

flowchart LR
    H0["slot 0"] --> X0["null"]
    H1["slot 1"] --> A1["ant"] --> X1["null"]
    H2["slot 2"] --> A2["owl"] --> B2["eel"] --> X2["null"]
    H3["slot 3"] --> X3["null"]
    H4["slot 4"] --> A4["cat"] --> X4["null"]
    H5["slot 5"] --> X5["null"]
    H6["slot 6"] --> A6["bee"] --> X6["null"]

To get "eel": compute $h(\texttt{"eel"}) = 2$ , go to slot 2, walk its chain — first node is "owl" (no match), second is "eel" (match), return its value. Two comparisons, because of the one collision. To get "dog" (absent): $h(\texttt{"dog"}) = (100+111+103) \bmod 7 = 314 \bmod 7 = 6$ ; walk slot 6's chain, find only "bee", reach null, conclude absent. With five items in seven slots the load factor is $\lambda = 5/7 \approx 0.71$ , just under the rehash threshold — so far the chains stay short and look-up stays near $O(1)$ .

This trace shows the full machinery an exam can ask for: compute the hash with shown working, resolve the collision by chaining, and walk a chain to confirm presence or absence. The same keys under linear probing would instead push "eel" from its taken home slot 2 to slot 3 — worth tracing both ways for practice.

Hash Tables

Hash Tables

Spec Mapping

What is a Hash Table?

Hash Functions

A Simple String Hash

Choosing a Table Size

Collisions

Collision Resolution 1: Open Addressing (Linear Probing)

Traced Insertion

Searching and the Deletion Problem

Primary Clustering

Collision Resolution 2: Separate Chaining

Hash Table Implementation (Chaining)

Load Factor and Rehashing

Time Complexity

End-to-End Worked Trace (String Keys, Chaining)

Applications of Hash Tables

More in Computer Science