You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
This lesson covers the hash table — the data structure that achieves constant average-time insertion, search, and deletion by using a hash function to compute the storage index directly from the key, rather than searching for it. The hash table is the implementation behind almost every dictionary, set, cache, and symbol table in modern software, and it is the structure you reach for whenever a problem says "look up by key, and look it up fast". The examinable core is the chain of ideas key → hash function → index → collision → resolution → load factor → rehash, and the ability to trace both linear probing and separate chaining by hand.
This lesson addresses AQA A-Level Computer Science (7517), section 4.2.5 (Hash tables). It covers the role of a hash function, what makes a hash function good (determinism, uniform distribution, speed), the inevitability of collisions and the two examinable resolution strategies — separate chaining and open addressing with linear probing — together with the load factor λ=n/m and the rehashing that keeps it low. It connects forward to §4.10 (databases: hash indexing) and back to §4.2.6 (linked lists, used as the chains) and §4.2.1 (the static array that underlies the table). All wording below is original; AQA's published specification text is not reproduced.
A hash table (also called a hash map) stores key–value pairs in an underlying array. Instead of searching the array for a free slot, it computes where each pair belongs by passing the key through a hash function h that returns an array index. To store a pair you compute h(key) and place the value there; to retrieve it you compute h(key) again and read that slot. Because computing the index is a fixed amount of work and reaching an array slot is O(1) (§4.2.1 address arithmetic), the whole lookup is O(1) on average — no traversal, no comparisons against other keys.
| Term | Meaning |
|---|---|
| Key | The unique identifier used to find an item (e.g. a username, a product code). |
| Value | The data associated with the key (e.g. the user's record). |
| Hash function h | Maps a key to an integer index in the range 0…m−1. |
| Bucket / slot | One position in the underlying array of size m. |
| Load factor λ | The fullness ratio n/m (n items, m slots). |
The defining promise of the hash table is average-case O(1) for insert, search, and delete. That promise is conditional: it holds only while collisions are rare, which is why the load factor and the choice of hash function matter so much.
A hash function takes a key and returns an integer index inside the table's bounds. Three properties make one fit for purpose:
A standard teaching hash for string keys sums the ASCII (Unicode) codes of the characters and reduces the total modulo the table size m. The modulo is essential: it folds an arbitrarily large sum back into a valid index 0…m−1.
h(k)=(∑iord(ki))modm
Worked for a table of size 10:
h("Cat")=(67+97+116)mod10=280mod10=0
h("Dog")=(68+111+103)mod10=282mod10=2
def simple_hash(key: str, table_size: int) -> int:
total = 0
for char in key:
total += ord(char) # add Unicode code point of each character
return total % table_size # fold into a valid index 0 .. table_size-1
print(simple_hash("Cat", 10)) # 0
print(simple_hash("Dog", 10)) # 2
This sum-of-codes hash is easy to trace but distributes poorly: any anagram hashes to the same slot ("Cat" and "Act" both give 0), and short keys cluster at low indices. Production hashes (such as the polynomial rolling hash h=(h×31+ord(c))modm) multiply by a constant so that order affects the result, spreading keys far more evenly. You should be able to name why the naive hash is weak — order-independence and clustering — even if the exam only asks you to compute with it.
A prime table size is preferred for modulo-based hashing because it reduces systematic collisions when keys share common factors with m. If m=100 and many keys are multiples of 10, they collide heavily; a prime m has no such small factors, so the remainder spreads the keys more uniformly. This is a small but examinable design point.
A collision occurs when two different keys hash to the same index. Collisions are not a bug — they are mathematically unavoidable. The set of possible keys (every string, say) is vastly larger than the m slots, so by the pigeonhole principle some keys must share a slot once you store more than a handful. A hash table is therefore defined as much by how it resolves collisions as by its hash function. The two strategies you must know are separate chaining and open addressing (linear probing).
In open addressing, every item lives inside the array itself — there are no external lists. When the home slot h(k) is occupied, the algorithm probes for the next free slot. Linear probing simply tries the following slots in order, wrapping around the end with modulo:
probe(k,i)=(h(k)+i)modmfor i=0,1,2,…
Table size m=7, hash function h(k)=kmod7. Insert the keys 10, 17, 24, 5 in that order:
| Key | h(k) | Probe sequence | Lands in slot | Reason |
|---|---|---|---|---|
| 10 | 3 | slot 3 | 3 | empty — placed |
| 17 | 3 | slot 3 (taken) → 4 | 4 | collision at 3, probe +1 |
| 24 | 3 | 3 (taken) → 4 (taken) → 5 | 5 | collision at 3 and 4, probe +2 |
| 5 | 5 | slot 5 (taken) → 6 | 6 | collision at 5, probe +1 |
The resulting array, drawn as buckets:
flowchart LR
S0["slot 0\n(empty)"]
S1["slot 1\n(empty)"]
S2["slot 2\n(empty)"]
S3["slot 3\nkey 10"]
S4["slot 4\nkey 17"]
S5["slot 5\nkey 24"]
S6["slot 6\nkey 5"]
S0 --- S1 --- S2 --- S3 --- S4 --- S5 --- S6
To search, you repeat the same probe sequence until you find the key or hit an empty slot (which proves the key is absent). This creates a subtlety: you cannot simply blank a slot on deletion. If key 17 were deleted by emptying slot 4, a later search for 24 would probe 3, then hit the now-empty slot 4 and wrongly conclude 24 is absent — even though it sits in slot 5. The fix is a tombstone: a special "deleted" marker that searches skip over but insertions may reuse. This deletion hazard is a classic discrimination point between probing and chaining.
Linear probing suffers from primary clustering: occupied slots tend to form long contiguous runs, because any key hashing anywhere into a cluster extends it. The longer the run, the longer the average probe, so performance degrades sharply as the table fills. Quadratic probing (+1,+4,+9,…) and double hashing mitigate this, but linear probing is the version AQA expects you to trace.
Exam Tip: Linear probing is the most commonly traced method. Always write out the probe formula (h(k)+i)modm, show each slot you test, and remember the wrap-around. If asked about deletion, mention tombstones.
In separate chaining, each array slot holds the head of a linked list (a "chain") of all items that hashed there. A collision simply appends to that slot's list, so the table can never "fill up". Inserting keys 10, 17, 24 (all hashing to slot 3 under kmod7) and 15, 22 (both to slot 1) produces:
flowchart LR
I0["slot 0"] --> NUL0["null"]
I1["slot 1"] --> N15["15 | next"] --> N22["22 | next"] --> NUL1["null"]
I2["slot 2"] --> NUL2["null"]
I3["slot 3"] --> N10["10 | next"] --> N17["17 | next"] --> N24["24 | next"] --> NUL3["null"]
I4["slot 4"] --> NUL4["null"]
To search, hash to the slot then walk its (usually short) chain. The two strategies trade off as follows:
| Feature | Linear probing (open addressing) | Separate chaining |
|---|---|---|
| Where items live | Inside the array | In linked lists hanging off each slot |
| Can the table fill? | Yes — at λ=1 it is full | No — chains grow on the heap |
| Clustering | Suffers primary clustering | None — only the relevant chain grows |
| Deletion | Needs tombstones | Trivial — unlink the node |
| Memory | One contiguous array, no pointers | Array + a pointer per node (overhead) |
| Cache performance | Better (contiguous probing) | Worse (pointer-chasing to scattered nodes) |
| Behaviour past λ=1 | Impossible | Degrades gracefully |
This is a direct application of §4.2.6: the chains are singly linked lists, with all the heap-allocation and O(n)-traversal properties from that lesson.
class HashTable:
def __init__(self, size: int = 11): # prime size reduces collisions
self.__size = size
self.__table = [[] for _ in range(size)] # each slot is a chain (list)
def __hash(self, key: str) -> int:
total = 0
for char in key:
total += ord(char)
return total % self.__size
def put(self, key: str, value):
index = self.__hash(key)
for i, (k, v) in enumerate(self.__table[index]):
if k == key:
self.__table[index][i] = (key, value) # key exists → update
return
self.__table[index].append((key, value)) # new key → append to chain
def get(self, key: str):
index = self.__hash(key)
for k, v in self.__table[index]: # walk the chain
if k == key:
return v
raise KeyError(f"Key not found: {key}")
def delete(self, key: str):
index = self.__hash(key)
for i, (k, v) in enumerate(self.__table[index]):
if k == key:
del self.__table[index][i] # unlink — no tombstone needed
return
raise KeyError(f"Key not found: {key}")
Note that put checks the chain first so that re-inserting an existing key updates rather than duplicates it — the behaviour every dictionary needs.
The load factor measures how full the table is:
λ=mn(n=items stored,m=table size)
Under chaining, λ is the average chain length, so an unsuccessful search costs roughly 1+λ steps. Under open addressing the cost grows much faster, approaching infinity as λ→1. Either way, performance is excellent while λ is small and deteriorates as it climbs:
| Load factor λ | Effect |
|---|---|
| <0.5 | Excellent — collisions rare, lookups near O(1) |
| 0.5−0.75 | Good — collisions manageable |
| >0.75 | Poor — collisions frequent; resize the table |
Worked example. A table with m=11 slots holding n=6 items has λ=6/11≈0.55. Adding three more gives λ=9/11≈0.82, past the usual 0.75 threshold — time to rehash.
Rehashing restores a low load factor: allocate a larger array (commonly the next prime above double the size), then re-insert every existing item by recomputing its hash against the new m. You cannot just copy slots across, because the index of each key depends on m — a key in slot 3 of an 11-slot table may belong in slot 25 of a 53-slot table. A single rehash is O(n), but because the table doubles, rehashes become rarer as it grows and the amortised cost of insertion stays O(1) — exactly the doubling argument used for dynamic arrays in §4.2.1.
| Operation | Average case | Worst case |
|---|---|---|
| Insert | O(1) | O(n) — all keys collide into one chain/cluster |
| Search | O(1) | O(n) |
| Delete | O(1) | O(n) |
The worst case arises when every key hashes to the same slot: chaining degenerates into a single linked list, and probing into one long cluster, so all operations become O(n). This is why a good, uniform hash function plus a controlled load factor are not optional extras — they are what converts the worst case into the average case in practice. Note also that a hash table provides no efficient ordered access: to list keys in sorted order you must extract and sort them at O(nlogn), because the hash scatters keys with no relation to their order.
Putting the pieces together, trace a chained hash table of size m=7 storing animal names with the sum-of-codes hash h(k)=(∑ord(c))mod7. First compute each key's home slot:
| Key | ASCII sum | mod7 | Slot |
|---|---|---|---|
"ant" | 97 + 110 + 116 = 323 | 323 mod 7 = 1 | 1 |
"bee" | 98 + 101 + 101 = 300 | 300 mod 7 = 6 | 6 |
"cat" | 99 + 97 + 116 = 312 | 312 mod 7 = 4 | 4 |
"owl" | 111 + 119 + 108 = 338 | 338 mod 7 = 2 | 2 |
"eel" | 101 + 101 + 108 = 310 | 310 mod 7 = 2 | 2 (collision with "owl") |
"owl" and "eel" both land in slot 2, so under chaining slot 2's list holds both. The table now looks like:
flowchart LR
H0["slot 0"] --> X0["null"]
H1["slot 1"] --> A1["ant"] --> X1["null"]
H2["slot 2"] --> A2["owl"] --> B2["eel"] --> X2["null"]
H3["slot 3"] --> X3["null"]
H4["slot 4"] --> A4["cat"] --> X4["null"]
H5["slot 5"] --> X5["null"]
H6["slot 6"] --> A6["bee"] --> X6["null"]
To get "eel": compute h("eel")=2, go to slot 2, walk its chain — first node is "owl" (no match), second is "eel" (match), return its value. Two comparisons, because of the one collision. To get "dog" (absent): h("dog")=(100+111+103)mod7=314mod7=6; walk slot 6's chain, find only "bee", reach null, conclude absent. With five items in seven slots the load factor is λ=5/7≈0.71, just under the rehash threshold — so far the chains stay short and look-up stays near O(1).
This trace shows the full machinery an exam can ask for: compute the hash with shown working, resolve the collision by chaining, and walk a chain to confirm presence or absence. The same keys under linear probing would instead push "eel" from its taken home slot 2 to slot 3 — worth tracing both ways for practice.
Subscribe to continue reading
Get full access to this lesson and all 10 lessons in this course.