You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
A hash table is the structure that makes the near-instant lookups we take for granted possible — looking a word up in a dictionary app, checking whether a username is already taken, finding a database record by its key. Where a BST narrows a search to O(logn) by halving at each step, a hash table aims higher: it tries to jump straight to the right slot in constant time, by computing where an item belongs from the item itself. The price of that speed is the awkward problem of collisions — two different keys computing the same slot — and most of this lesson is about the two standard ways to resolve them and how to reason about the resulting performance. The hashing idea also appears in the data-representation course (in the encryption lesson, for passwords and integrity); here we use it to build a data structure, and cross-reference that treatment rather than duplicating it.
This lesson addresses the H446 1.4.2 Data Structures content on hash tables:
(Phrasing here paraphrases the specification content; it is not a verbatim quote.)
A hash table (or hash map) stores key–value pairs in an underlying array, using a hash function to compute, from the key alone, the array index at which the value should live. To store the pair (key, value) you compute i=h(key) and place the value at slot i; to look it up later you recompute the same i and read slot i directly. There is no searching — the key tells you where to look. That is the whole idea, and it is why hash tables outperform every structure met so far on lookup:
| Operation | Unsorted array | Sorted array | BST (balanced) | Hash table |
|---|---|---|---|---|
| Search | O(n) | O(logn) | O(logn) | O(1) average |
| Insert | O(1) at end | O(n) (shift) | O(logn) | O(1) average |
| Delete | O(n) | O(n) (shift) | O(logn) | O(1) average |
The word average in that last column is doing a great deal of work, and the rest of the lesson explains exactly when the O(1) promise holds and when it breaks down to O(n).
The conceptual leap worth dwelling on is the difference between searching for an item and computing where it lives. Every structure so far has searched: a linear scan checks items one by one; binary search and a BST repeatedly halve the space; both examine data to home in on the target. A hash table refuses to search at all. It treats the key as the address: feed the key to the hash function and you are told, in one step, the exact slot — no comparisons against other items, no narrowing. This is why hash tables can beat the O(logn) floor that comparison-based structures cannot cross: they are not comparing keys, they are calculating with them. The catch, of course, is that the calculation must squeeze a huge space of possible keys into a small array of slots, and that is what forces collisions and the machinery to handle them.
A hash function h maps a key of any kind (an integer, a string, a record) to an integer index in the range 0 to m−1, where m is the table size. A good hash function has four properties:
| Property | Why it matters |
|---|---|
| Deterministic | the same key must always hash to the same slot, or you could never find what you stored |
| Uniform | keys should spread evenly across all slots, so no slot becomes overloaded |
| Fast | hashing must itself be O(1), or it defeats the purpose |
| Collision-minimising | different keys should usually map to different slots |
The simplest workable hash function for integer keys uses the modulo operator:
h(key)=keymodm
This always yields a value in [0,m−1], so it is a valid index. Taking a table size of m=7:
| Key | h(key)=keymod7 | Slot |
|---|---|---|
| 14 | 14mod7=0 | 0 |
| 21 | 21mod7=0 | 0 — collision with 14 |
| 15 | 15mod7=1 | 1 |
| 10 | 10mod7=3 | 3 |
| 23 | 23mod7=2 | 2 |
Already, with just five keys, 14 and 21 have collided — both want slot 0. This is not bad luck; it is unavoidable, as the next section explains.
String keys are first turned into an integer — commonly by summing the character codes — then reduced modulo m:
h(s)=(∑charcode(c))modm
For s="CAT" with m=11, using C=67,A=65,T=84:
h("CAT")=(67+65+84)mod11=216mod11=7
(Real string hashes are more sophisticated — they weight each character by position so that "CAT" and "ACT" hash differently — but the sum-then-mod idea is enough for H446.)
Exam Tip: You must be able to apply a given hash function to a set of keys and say which slot each lands in. Always show the modulo arithmetic, and flag every collision as it arises — collisions are usually the whole point of the question.
A collision occurs when two distinct keys hash to the same slot. Because a slot holds one value, something must be done. Collisions are not a sign of a bad hash function — they are mathematically guaranteed whenever the number of possible keys exceeds the table size, by the pigeonhole principle: if you place more pigeons than there are pigeonholes, at least one hole must hold two pigeons. Since the universe of possible keys (every integer, every string) is vastly larger than any finite table, some keys must share a slot. The job of collision resolution is therefore not to prevent collisions but to handle them gracefully. There are two standard strategies.
The scale of the mismatch is worth appreciating. A hash table storing usernames might have only a few thousand slots, yet the set of possible usernames — say up to 15 lower-case letters — runs to 2615, an astronomically larger number. The hash function is being asked to compress that enormous space down to a few thousand pigeonholes, so vast numbers of distinct keys are forced to share each slot in principle. A good hash function makes the keys you actually store land in different slots most of the time, but it can never make collisions impossible — which is exactly why every hash table needs a resolution strategy built in, and why "just use a better hash function" is never a complete answer to a collision question.
Chaining (also called closed addressing) keeps every slot pointing to a linked list of all the items that hashed there. On a collision the new item is simply appended to (or prepended to) the list at that slot.
Inserting our earlier keys 14, 21, 15, 10, 23 into a size-7 table by chaining — note that 14 and 21 both hash to 0 and so share a chain:
flowchart LR
S0["Slot 0"] --> A14["14"] --> A21["21"] --> An0["null"]
S1["Slot 1"] --> A15["15"] --> An1["null"]
S2["Slot 2"] --> A23["23"] --> An2["null"]
S3["Slot 3"] --> A10["10"] --> An3["null"]
S4["Slot 4"] --> An4["null"]
S5["Slot 5"] --> An5["null"]
S6["Slot 6"] --> An6["null"]
To search for a key, hash it to find the slot, then walk that slot's chain comparing keys. To delete, hash, find, and unlink from the list — which is genuinely easy, one of chaining's advantages. Chaining is exactly the linked list structure from the earlier lesson, used as the value type of every array slot, so it inherits both the linked list's easy insertion/deletion and its poor cache behaviour (the nodes are scattered through the heap).
| Operation | Average case | Worst case |
|---|---|---|
| Search | O(1+α) | O(n) — all keys in one chain |
| Insert | O(1) | O(1) |
| Delete | O(1+α) | O(n) |
Here α is the load factor (next section). With a good hash function the chains are short and even, so a search examines only about 1+α items on average — effectively constant. In the pathological worst case every key hashes to the same slot, the table degenerates into a single linked list, and search becomes O(n) — the same degeneration seen in an unbalanced BST.
Tracing a chained search makes the cost concrete. In the chained table above (slot 0 holds the chain 14 → 21), searching for 21 hashes to slot 0, then walks that chain: it compares 14 (no match), follows the pointer to 21 (match) — two comparisons. Searching for the absent key 35 (35mod7=0) also lands on slot 0, walks 14 → 21 → null, finds no match, and reports absence after exhausting that one short chain. The number of comparisons a chained lookup makes is simply one plus the length of the chain at the target slot — short when the hash spreads keys evenly, long only when many keys pile into one slot.
Open addressing stores every item directly in the array — there are no separate lists. When the computed slot is occupied, the algorithm probes for another slot according to a fixed rule. The simplest rule is linear probing: try the next slot, then the next, wrapping around modulo m:
probe sequence: h(k), h(k)+1, h(k)+2, …(modm)
Insert 14, 21, 15, 10, 7 into a size-7 table with h(k)=kmod7:
| Key | h(k) | Probing | Final slot |
|---|---|---|---|
| 14 | 0 | slot 0 empty | 0 |
| 21 | 0 | slot 0 holds 14 → try 1 (empty) | 1 |
| 15 | 1 | slot 1 holds 21 → try 2 (empty) | 2 |
| 10 | 3 | slot 3 empty | 3 |
| 7 | 0 | 0,1,2,3 all occupied → try 4 (empty) | 4 |
The resulting array:
| Slot | 0 | 1 | 2 | 3 | 4 | 5 | 6 |
|---|---|---|---|---|---|---|---|
| Value | 14 | 21 | 15 | 10 | 7 | — | — |
To search, hash to the start slot and probe forward: if a slot holds the key, success; if a slot is empty, the key is absent (it would have been placed there); if it holds a different key, keep probing. The danger is primary clustering: once a run of occupied slots forms, any key hashing anywhere into that run extends it, so clusters grow and lengthen probe sequences — visible above, where 7 had to probe four slots because 14, 21, 15 had clumped together. Deletion is also awkward: you cannot simply blank a slot, because that would prematurely end a later search's probe sequence; instead the slot must be marked with a special "deleted" tombstone that searches skip over but insertions may reuse.
It is worth tracing two lookups against the filled table above (slots 0–4 holding 14, 21, 15, 10, 7) to see both outcomes:
| Search for | Probe steps | Result |
|---|---|---|
| 15 | slot 1 holds 21 (not 15) → slot 2 holds 15 → match | found after 2 probes |
| 28 | 28mod7=0: slot 0 holds 14 → slot 1 holds 21 → slot 2 holds 15 → slot 3 holds 10 → slot 4 holds 7 → slot 5 empty | not found after 6 probes (stops at first empty slot) |
The unsuccessful search for 28 illustrates two important points: a search stops the instant it reaches an empty slot (28 is provably absent, because had it ever been inserted it would have occupied slot 5 or earlier), and clustering has made even a failed lookup expensive — six probes where an empty table would have taken one. This is precisely why the load factor must be kept low: long clusters punish every probe sequence that runs into them.
The load factor α measures how full the table is:
α=mn
where n is the number of items stored and m is the table size. It is the single most important number governing hash-table performance:
| α | Meaning | Effect |
|---|---|---|
| 0.0 | empty | — |
| 0.5 | half full | few collisions, fast, some wasted space |
| 0.75 | three-quarters full | common resize threshold |
| 1.0 | full (open addressing) | many collisions; open addressing cannot exceed this |
A low load factor means few collisions and fast operations but wasted memory; a high load factor packs memory tightly but lengthens chains and probe sequences. With chaining, α can exceed 1 (chains just get longer); with open addressing it cannot exceed 1, since there are only m slots for n items. Most implementations keep α in the 0.5–0.75 range as a deliberate space-versus-time compromise.
Subscribe to continue reading
Get full access to this lesson and all 10 lessons in this course.