UNIT 4: HASH TABLES - COMPREHENSIVE
NOTES
TABLE OF CONTENTS
1. Introduction to Hash Tables
2. Direct Addressing vs Hashing
3. Collision Resolution Techniques
4. Hash Function Design
5. Open Addressing Methods
6. Performance Analysis
7. Practical Considerations
1. INTRODUCTION TO HASH TABLES
1.1 What is a Hash Table?
Hash Table (Hash Map): A data structure that implements an associative
array—a structure that maps keys to values using a hash function.
Key Components: 1. Hash Function (h): Maps keys from a large universe
U to slots in a table T of size m 2. Hash Table T: Array of m slots (0 to
m-1) 3. Collision Resolution: Method to handle multiple keys mapping to
the same slot
Basic Idea:
Universe of Keys U (size possibly very large)
↓ (apply hash function h)
Hash Table T (size m, usually much smaller than |U|)
1.2 Problem Context
Motivation: - Cannot store all keys from universe U (too large) - Need to store
only the actual keys K present - Want O(1) average-case search, insert, delete
times
Trade-off: - Direct Addressing: O(1) worst-case, but requires space O(|U|)
(wasteful) - Hash Tables: O(1) average-case, requires space O(|K|) (efficient)
1.3 Key Definitions
Hash Function: h: U → {0, 1, …, m-1}
Hash Value: h(k) is the slot where key k maps
1
Collision: When two distinct keys k� and k� have h(k�) = h(k�)
Load Factor: � = n/m (ratio of elements to table size)
2. DIRECT ADDRESSING VS HASHING
2.1 Direct Addressing
Concept: Store element with key k directly in slot k
Table Size: |U| (potentially huge)
Operations:
Direct-Address-Search(T, k)
return T[k]
Direct-Address-Insert(T, x)
T[[Link]] = x
Direct-Address-Delete(T, x)
T[[Link]] = NIL
Advantages: - O(1) worst-case for search, insert, delete - Simple to understand
and implement
Disadvantages: - Space requirement: O(|U|) (wasteful if |U| » |K|) - Requires
keys to be small integers - Cannot handle infinite or large key universes
Example: If U = {0, 1, …, 10�}, need array of size 10�, but only 100 keys stored
2.2 Hashing Overview
Concept: Use hash function to map keys to smaller table of size m
Table Size: m (where m « |U|)
Key Insight: - Use function h to compute slot: T[h(k)] - Storage reduced to
O(|K|) � O(n) - Trade worst-case for good average-case performance
Basic Hash Table Operations:
Hash-Search(T, k)
return T[h(k)] ← Simple, but collisions complicate this
Hash-Insert(T, x)
T[h([Link])] = x ← Must handle collisions
Hash-Delete(T, x)
T[h([Link])] = NIL ← Must handle collisions
2
Challenge: Collisions—multiple keys hashing to same slot
3. COLLISION RESOLUTION TECHNIQUES
3.1 Chaining (Separate Chaining)
3.1.1 Concept
• Each slot in hash table points to a linked list (chain)
• All elements hashing to same slot are in that slot’s list
• Each chain stores elements independently
Visualization:
Hash Table T (size m):
T[0] → [k�] → [k�] → NIL
T[1] → NIL
T[2] → [k�] → NIL
T[3] → [k�] → [k�] → [k�] → NIL
...
T[m-1] → [k�] → NIL
3.1.2 Operations Chained-Hash-Search(T, k)
return List-Search(T[h(k)], k)
Chained-Hash-Insert(T, x)
List-Prepend(T[h([Link])], x) ← Insert at front of list
Chained-Hash-Delete(T, x)
List-Delete(T[h([Link])], x) ← Delete element x from its list
3.1.3 Advantages and Disadvantages Advantages: - Simple to imple-
ment - Deletion is straightforward (O(1) if lists are doubly-linked) - Works well
with unknown load factors - Load factor � can be > 1
Disadvantages: - Extra space for pointers - May be slower in practice due to
cache misses (following pointers) - Deletion requires access to element, not just
key
3.1.4 When to Use Doubly-Linked Lists Doubly-Linked: - Deletion:
O(1) (no need to search list) - Requires extra “previous” pointer
Singly-Linked: - Deletion: O(length of list) (must find predecessor) - Saves
one pointer per element
Recommendation: Use doubly-linked if deletion is frequent
3
3.1.5 Expected Chain Length If there are n elements and m slots: - Ex-
pected length of each chain: E[|chain|] = n/m = � (load factor) - Longer
chains → worse performance - Keep � = O(1) for good performance
3.2 Open Addressing
3.2.1 Concept
• All elements stored directly in hash table (no external lists)
• When collision occurs, probe alternative slots
• Must find empty slot to insert element
• Load factor � = n/m < 1 (table cannot fill completely)
Basic Idea:
Hash-Insert(T, k)
i = 0
repeat
q = h(k, i) ← Probe sequence depends on key and probe number
if T[q] == NIL
T[q] = k
return q
else
i = i + 1
until i == m
error "hash table overflow"
3.2.2 Probe Sequence Requirement For every key k, the probe sequence:
h(k, 0), h(k, 1), ..., h(k, m-1)
must be a permutation of {0, 1, …, m-1}
This ensures every slot is eventually examined as the table fills.
3.2.3 Types of Open Addressing Three common methods differ in how
probe sequences are determined:
1. Linear Probing
2. Quadratic Probing
3. Double Hashing
3.3 Linear Probing
3.3.1 Definition
h(k, i) = (h�(k) + i) mod m
Probe sequence: h�(k), h�(k)+1, h�(k)+2, …, wrapping around
Example:
4
If h�(k) = 5 and m = 10:
Probe sequence: 5, 6, 7, 8, 9, 0, 1, 2, 3, 4
3.3.2 Advantages and Disadvantages Advantages: - Simple to imple-
ment - No need for secondary hash function - Good cache locality (sequential
access)
Disadvantages: - Primary Clustering: Tends to create long runs of occu-
pied slots - Search scans entire cluster, even if key is not present - Degrades
performance as table fills
Primary Clustering Example:
After insertions, creates pattern:
[k�][k�][k�][empty]...
Inserting new key hashing to position 1 must scan through all 3 elements
3.3.3 When to Use
• Small tables where clustering is minimal
• When cache locality is important
• Simple applications where implementation ease matters
3.4 Double Hashing
3.4.1 Definition
h(k, i) = (h�(k) + i·h�(k)) mod m
Two auxiliary hash functions: - h�(k): Initial probe position - h�(k): Step
size for subsequent probes
Probe sequence: h�(k), h�(k)+h�(k), h�(k)+2·h�(k), …, all mod m
Example (from Unit 4 PDF):
m = 13, h�(k) = k mod 13, h�(k) = 1 + (k mod 11)
Inserting key 14:
h�(14) = 14 mod 13 = 1
h�(14) = 1 + (14 mod 11) = 1 + 3 = 4
Probe sequence: 1, 5, 9, 0, 4, 8, 12, 3, 7, 11, 2, 6, 10
Position 1: occupied
Position 5: occupied
Position 9: empty → insert key 14 here
3.4.2 Key Requirements For probe sequence to visit all m slots: -
h�(k) must be relatively prime to m
5
Strategies: - m is prime: Choose h�(k) < m (automatically relatively prime)
- m is power of 2: Choose h�(k) to always be odd
3.4.3 Number of Probe Sequences Double Hashing: Produces m differ-
ent probe sequences (each h�(k), h�(k) pair is distinct)
Ideal (Independent Uniform Permutation Hashing): Would produce m!
permutations
Comparison: - m! » m (double hashing limited, but still good) - Better than
linear probing (only m sequences) - Approximates ideal behavior well in practice
3.4.4 Advantages and Disadvantages Advantages: - Many possible
probe sequences (m different ones) - Avoids primary clustering of linear probing
- Good empirical performance - No secondary clustering
Disadvantages: - Requires good secondary hash function h�(k) - More complex
than linear probing - Two hash function computations per probe
3.4.5 When to Use
• General-purpose applications
• When good hash functions are available
• Performance more important than simplicity
• Deletions are infrequent (or handled specially)
3.5 Quadratic Probing
3.5.1 Definition
h(k, i) = (h�(k) + c�·i + c�·i²) mod m
where c�, c� are constants
Probe sequence: h�(k), h�(k)+c�+c�, h�(k)+c�+2c�+4c�, …
3.5.2 Characteristics Primary Clustering: Avoided (unlike linear prob-
ing)
Secondary Clustering: Can still occur if h�(k) produces same hash for two
different keys
Probe Sequence Permutation: May not visit all m slots (depends on m and
constants)
3.5.3 When to Use
• Middle ground between linear and double hashing
• When good secondary hash function not available
• Table size constraints allow careful constant selection
6
4. HASH FUNCTION DESIGN
4.1 Properties of Good Hash Functions
4.1.1 Independent Uniform Hashing Assumption Definition: A hash
function h satisfies independent uniform hashing if:
1. Uniformity: Each key is equally likely to hash to any of the m slots
2. Independence: Hashing of one key is independent of another
Mathematically: For any two distinct keys k�, k�: - Pr[h(k�) = j] = 1/m for
all slots j - Pr[h(k�) = j � h(k�) = q] = 1/m² for all j, q
In Practice: Difficult to achieve perfectly, but we can approximate it
4.1.2 Desirable Properties
1. Efficiency: Computable in O(1) time
2. Uniformity: Distributes keys evenly across slots
3. Independence: Unrelated to patterns in data
4. Determinism: Same key always maps to same slot
5. Avalanche Effect: Small change in key → large change in hash value
4.2 Static Hashing Methods
Static Hashing: Single fixed hash function, randomness comes from data
distribution
4.2.1 Division Method Formula:
h(k) = k mod m
Example: - m = 12, k = 100: h(100) = 100 mod 12 = 4
Advantages: - Very fast (single division operation) - Simple to implement
Disadvantages: - Quality depends on choice of m - May not distribute keys
uniformly
Best Practices: - Choose m to be prime - Avoid powers of 2 (keys with
patterns may cluster) - Avoid powers of 10 (for decimal keys)
When m is Prime: - Better distribution of keys - Unrelated patterns less
likely to cause clustering
4.2.2 Multiplication Method Formula:
h(k) = �m(kA mod 1)�
7
where A is a constant 0 < A < 1, and kA mod 1 is fractional part of kA
Alternative (Multiply-Shift):
h(k) = �(a·k mod 2^w) / 2^(w-l)�
where a is random w-bit constant, extract l highest-order bits
Advantages: - Value of m is not critical (can be any power of 2) - Works
well with various data distributions - Efficient to compute with bit operations
Disadvantages: - Requires careful choice of constant A - More complex than
division method
When to Use: - m is power of 2 (efficient in binary) - Need good performance
across different data distributions - Constant A can be pre-computed
4.2.3 Example Hash Functions For integer keys:
h(k) = k mod m (division)
h(k) = �m(k·� mod 1)� (multiplication with � = golden ratio)
For string keys:
h(s) = (Σ s[i]·2^i) mod m
or
h(s) = (Σ s[i]·p^i) mod m where p is prime
4.3 Random Hashing (Universal Hashing)
4.3.1 Motivation Problem with Static Hashing: - Fixed hash function
may work poorly on certain data - No guarantee of good average-case perfor-
mance - Adversarial input could cause worst-case behavior
Solution: Choose hash function randomly from family of functions
4.3.2 Universal Hash Family Definition: A family H of hash functions is
universal if for any two distinct keys k�, k�:
|{h � H : h(k�) = h(k�)}| � |H|/m
Interpretation: For random h chosen from H, collision probability is at most
1/m
Probability bound: Pr[h(k�) = h(k�)] � 1/m
4.3.3 Benefits
1. Works well for any input distribution
2. Randomization comes from algorithm, not data
3. Guarantees good average-case performance
8
4. No adversarial inputs
4.3.4 Example: Universal Hashing for Integers Choose: - Prime p �
|U| - Random integers a, b with 0 � a < p, 0 � b < p, a � 0 - m = size of hash
table
Hash function:
h(k) = ((a·k + b) mod p) mod m
Property: Family of all such functions is universal (with appropriate parame-
ters)
5. ANALYSIS OF HASH TABLE OPERATIONS
5.1 Chaining Analysis
5.1.1 Unsuccessful Search (11.1) Theorem - Unsuccessful Search
with Chaining:
Under independent uniform hashing, an unsuccessful search takes Θ(1 + �)
time on average.
Proof Sketch: - Element searched for is not in table - Search examines entire
chain at h(k) - Expected chain length = � = n/m - Time: O(1) for hash function
+ O(�) for searching chain - Total: Θ(1 + �)
5.1.2 Successful Search (11.2) Theorem - Successful Search with
Chaining:
Under independent uniform hashing, a successful search takes Θ(1 + �) time
on average.
Proof Sketch: - Search is for element x (definitely in table) - Examines x plus
elements inserted before x - On average, half the chain length before x - But
accounting carefully: still O(1 + �)
Key Insight: Same asymptotic time whether search is successful or unsuccess-
ful!
5.1.3 Insertion and Deletion Insertion: - Time: O(1) worst-case
(prepend to chain) - No search needed if element not already there
Deletion (doubly-linked list): - Time: O(1) worst-case - If given element x
(not just key k) - Must use singly-linked if given only key → O(chain length)
9
5.1.4 Overall Performance If � = n/m = O(1) (number of elements pro-
portional to table size): - Search (successful/unsuccessful): O(1) average-case -
Insert: O(1) worst-case - Delete: O(1) worst-case
Conclusion: All dictionary operations in O(1) average-case time with chain-
ing!
5.2 Open Addressing Analysis
5.2.1 Unsuccessful Search (11.6) Theorem - Unsuccessful Search:
Given an open-address hash table with load factor � = n/m < 1, the expected
number of probes in an unsuccessful search is at most 1/(1-�), assuming inde-
pendent uniform permutation hashing.
Proof Sketch: - Let X = number of probes - Probability first i-1 probes hit
occupied slots, i-th hits empty - Bound: Pr[X � i] � �^(i-1) - Expected value:
E[X] = Σ Pr[X � i] � Σ �^(i-1) = 1/(1-�)
Examples: - � = 0.5: E[X] � 2 (on average 2 probes) - � = 0.9: E[X] � 10 (on
average 10 probes) - � = 1.0: E[X] = ∞ (table full)
5.2.2 Insertion (11.7) Corollary - Insertion:
Inserting an element into an open-address table with load factor � < 1 requires
at most 1/(1-�) probes on average.
Proof: Insertion = unsuccessful search + place in first empty slot
5.2.3 Successful Search (11.8) Theorem - Successful Search:
A successful search in open-address table with load factor � < 1 takes at most
(1/�)·ln(1/(1-�)) probes on average.
Proof Sketch: - Element k was i-th to be inserted (for i = 1..n) - Load fac-
tor when inserted: (i-1)/m - Expected probes for insertion: 1/(1-(i-1)/m) -
Average over all: (1/n)·Σ(i=1 to n) 1/(1-(i-1)/m) - This sum evaluates to
(m/(m-n))·ln(m/(m-n))
Interpretation: As load factor increases, successful search becomes much
slower
Examples: - � = 0.5: E � 1.387 probes - � = 0.9: E � 2.559 probes - � = 1.0: E
= H� (harmonic number, very large)
5.2.4 Summary of Open Addressing
Operation Expected Time
Unsuccessful search O(1/(1-�))
Successful search O((1/�)·ln(1/(1-�)))
10
Operation Expected Time
Insertion O(1/(1-�))
Deletion Complex (need DELETED marker)
Keep Load Factor Small: - � = 0.5 gives O(1) expected time - As � → 1,
times increase significantly - Rehash when � gets too large
5.3 Comparison: Chaining vs Open Addressing
Aspect Chaining Open Addressing
Load factor � Can be > 1 Must be < 1
Search (unsuccessful) Θ(1 + �) O(1/(1-�))
Search (successful) Θ(1 + �) O((1/�)·ln(1/(1-�)))
Insertion O(1) O(1/(1-�))
Deletion O(1) or O(�) Difficult (DELETED needed)
Space overhead Pointers None (in-place)
Cache behavior Poor Good
When � = 0.5 O(1.5) O(1) to O(2)
When � = 0.9 O(1.9) O(10) to O(2.6)
Choose Chaining When: - Load factor varies unpredictably - Deletions are
frequent - Simplicity is important
Choose Open Addressing When: - Load factor stays small - Cache locality
matters - Extra space for pointers unacceptable
6. HANDLING DELETIONS IN OPEN ADDRESSING
6.1 The Deletion Problem
Issue: Deleting element by setting slot to NIL breaks probe sequences
Example:
Insert sequence: 10, 22, 15 all hash to slot 5
Using linear probing:
T[5] = 10
T[6] = 22
T[7] = 15
Deleting 10 (set T[5] = NIL):
T[5] = NIL, T[6] = 22, T[7] = 15
11
Search for 22:
Probe T[5] (NIL) → conclude 22 not in table (WRONG!)
6.2 Solution: DELETED Marker
Approach: - Mark deleted slot with special value DELETED (not NIL) - Insert
treats DELETED same as NIL (can reuse slot) - Search passes over DELETED
(key may be further along)
Modified HASH-SEARCH:
Hash-Search(T, k)
i = 0
repeat
q = h(k, i)
if T[q] = k
return q
if T[q] = NIL ← Only stop at NIL, not DELETED
return NIL
i = i + 1
until i = m
return NIL
Modified HASH-INSERT:
Hash-Insert(T, k)
i = 0
repeat
q = h(k, i)
if T[q] = NIL or T[q] = DELETED ← Treat DELETED as empty
T[q] = k
return q
i = i + 1
until i = m
error "table overflow"
6.3 Drawback of DELETED Marker
Problem: Search times no longer depend on load factor �
• Even with mostly empty table, must search past many DELETED slots
• Over time, accumulation of DELETED increases search cost
• Eventually must rehash entire table
When to Rehash: - When too many DELETED entries accumulate - Practical
rule: When DELETED count > some threshold - Rebuild table with new size
and hash function
12
7. PRACTICAL CONSIDERATIONS
7.1 Load Factor Management
7.1.1 Choosing Table Size For Chaining: - Keep m roughly equal to
expected n - Allows � = n/m � 1 = good performance - Can handle � > 1
without major degradation
For Open Addressing: - Keep m > n, so � < 1 - Typical: m � 2n (� � 0.5) -
Ensures 1/(1-�) � 2
Dynamic Resizing: - When � exceeds threshold (e.g., 0.75) - Create new table
of size 2m - Rehash all elements into new table - O(n) time, amortized O(1) per
insertion
7.1.2 Alpha and Performance
� = 0.25: Very sparse, wasted space
� = 0.50: Good balance (Open: 1/(1-�)=2, Chaining: 1.5)
� = 0.75: Getting dense (Open: 1/(1-�)=4, Chaining: 1.75)
� = 1.00: Full (Open: impossible, Chaining: slow)
7.2 Handling Different Key Types
7.2.1 Integer Keys Direct: h(k) = k mod m
Multiply: h(k) = �m·(k·A mod 1)�
7.2.2 String Keys Polynomial Rolling Hash:
h(s) = (s[0] + s[1]·p + s[2]·p² + ... + s[n-1]·p^(n-1)) mod m
where p is prime (e.g., p=31, p=37)
Benefits: - Works well for variable-length strings - Can be computed efficiently
with rolling hash - Good distribution properties
7.2.3 Composite Keys For objects with multiple fields:
h(object) = (h�(field1) + h�(field2) + h�(field3)) mod m
or
h(object) = (h�(field1) XOR h�(field2) XOR h�(field3)) mod m
7.3 Collision Resolution Trade-offs
7.3.1 Space Considerations Chaining: - Extra space for pointers - ~8
bytes per element on 64-bit system - Acceptable trade-off for simplicity and
flexibility
Open Addressing: - No extra pointers - Uses only array space - Tight coupling
of data and structure
13
7.3.2 Time Considerations Chaining: - Consistent O(1 + �) across opera-
tions - Good for unpredictable load factors - Cache misses from pointer-chasing
Open Addressing: - Better for � < 0.5 - Degrades quickly as � → 1 - Cache-
friendly sequential access
7.3.3 Deletion Considerations Chaining: - Simple O(1) deletion (if lists
doubly-linked) - No special markers needed - List can be singly-linked if deletions
rare
Open Addressing: - Requires DELETED marker (complicates search) - Over
time, accumulates DELETED entries - Eventually must rehash entire table
7.4 Implementation Tips
7.4.1 Universal Hash Functions
For prime p � |U|:
h(k) = ((a·k + b) mod p) mod m
where a, b chosen randomly with 0 � a < p, 0 � b < p, a � 0
Advantage: Theoretically provable good behavior
Disadvantage: Needs large prime p, two modulo operations
7.4.2 Collision Avoidance Best Effort: - Choose hash function carefully
- Size table appropriately (not too small, not too large) - Keep load factor in
good range
Can’t Avoid Entirely: - By pigeonhole principle, collisions unavoidable - Must
handle gracefully with good collision resolution
7.4.3 Testing Hash Functions Uniformity Test: - Insert many random
keys - Check distribution across slots - Variance should be low
Collision Test: - Count actual collisions - Compare with expected (n·(1-(1-
1/m)^n)) - Function is good if actual � expected
8. IMPORTANT THEOREMS & RESULTS
Chaining
(11.1) Unsuccessful search: Θ(1 + �) average time
(11.2) Successful search: Θ(1 + �) average time
14
Open Addressing
(11.6) Unsuccessful search: O(1/(1-�)) average probes
(11.7) Insertion: O(1/(1-�)) average probes
(11.8) Successful search: O((1/�)·ln(1/(1-�))) average probes
Properties
Independent Uniform Hashing: Collision probability = 1/m for any two
distinct keys
Universal Hash Family: Pr[collision for random h] � 1/m
KEY CONCEPTS SUMMARY
Essential Ideas
1. Hash Function: Maps large universe to small table via h: U → {0, 1, …,
m-1}
2. Collision Resolution:
• Chaining: Each slot has linked list
• Open Addressing: Find alternative slots (linear, quadratic, double
hashing)
3. Load Factor: � = n/m determines performance
4. Analysis:
• Chaining: O(1 + �) for all operations
• Open Addressing: O(1/(1-�)) for unsuccessful search
5. Hash Function Design:
• Static: Division, multiplication methods
• Random: Universal hashing for provable performance
6. Practical Issues:
• Dynamic resizing when � exceeds threshold
• Handling deletions (DELETED marker)
• Choosing between chaining and open addressing
PRACTICE PROBLEMS
1. Implement hash table with chaining (doubly-linked list)
2. Implement hash table with linear probing
15
3. Compare performance of different hash functions on various input distri-
butions
4. Implement dynamic resizing (grow table when � > 0.75)
5. Analyze collision patterns for given input and hash function
6. Design hash function for composite keys (e.g., (name, id) pairs)
7. Implement universal hashing and verify 1/m collision probability
8. Trace operations (insert, search, delete) with collisions
END OF UNIT 4 COMPREHENSIVE NOTES
16