Understanding Hashing in Data Structures
Understanding Hashing in Data Structures
Advantage-
• Unlike other searching techniques,
• Hashing is extremely efficient.
• The time taken by it to perform the search does not
depend upon the total number of elements.
• It completes the search with constant time complexity
O(1).
2
Hashing Mechanism-
In hashing,
• An array data structure called as Hash table is used to
store the data items.
• Based on the hash key value, data items are inserted into
the hash table.
3
4
Hash Function-
• Hash function is a function that maps any big number or
string to a small integer value.
• Hash function takes the data item as an input and returns
a small integer value as an output.
• The small integer value is called as a hash value.
• Hash value of the data item is then used as an index for
storing it into the hash table.
6
Hashing: Collision Resolution Schemes
• Collision Resolution Techniques
• Separate Chaining
– Separate Chaining with String Keys
– The class hierarchy of Hash Tables
– Implementation of Separate Chaining
8
Collision Resolution Techniques
9
Collision Resolution Techniques
Which is the Preferred Technique?
10
Collision Resolution Techniques
11
12
13
14
15
16
17
18
Separate Chaining
• The hash table is implemented as an array of linked lists.
• Inserting an item, r, that hashes at index i is simply insertion into the linked list at
position i.
19
Separate Chaining (cont’d)
• Retrieval of an item, r, with hash address, i, is simply retrieval from the linked list
at position i.
• Deletion of an item, r, with hash address, i, is simply deleting r from the linked list
at position i.
• Example: Load the keys 23, 13, 21, 14, 7, 8, and 15 , in this order, in a hash table
of size 7 using separate chaining with the hash function: h(key) = key % 7
h(23) = 23 % 7 = 2
h(13) = 13 % 7 = 6
h(21) = 21 % 7 = 0
h(14) = 14 % 7 = 0 collision
h(7) = 7 % 7 = 0 collision
h(8) = 8 % 7 = 1
h(15) = 15 % 7 = 1 collision
20
Separate Chaining with String Keys
• Recall that search keys can be numbers, strings or some other object.
• A hash function for a string s = c0c1c2…cn-1 can be defined as:
hash = (c0 + c1 + c2 + … + cn-1) % tableSize
this can be implemented as:
public static int hash(String key, int tableSize){
int hashValue = 0;
for (int i = 0; i < [Link](); i++){
hashValue += [Link](i);
}
return hashValue % tableSize;
}
class CommodityItem {
String name; // commodity name
int quantity; // commodity quantity needed
double price; // commodity price
}
21
Separate Chaining with String Keys (cont’d)
• Use the hash function hash to load the following commodity items into a
hash table of size 13 using separate chaining:
onion 1 10.0
tomato 1 8.50
cabbage 3 3.50
carrot 1 5.50
okra 1 6.50
mellon 2 10.0
potato 2 7.50
Banana 3 4.00
olive 2 15.0
salt 2 2.50
cucumber 3 4.50
mushroom 3 5.50
orange 2 3.00
• Solution:
1 onion carrot
2 Item Qty Price h(key)
onion 1 10.0 1
3
tomato 1 8.50 10
4 cabbage 3 3.50 4
cabbage carrot 1 5.50 1
5 okra 1 6.50 0
mellon 2 10.0 10
6 mushroom potato 2 7.50 0
Banana 3 4.0 11
7 olive 2 15.0 10
salt salt 2 2.50 7
8 cucumber 3 4.50 9
mushroom 3 5.50 6
9 cucumber orange 2 3.00 12
24
Implementing Hash Tables: The Hierarchy Tree
Container AbstractContainer
SearchableContainer
HashTable AbstractHashTable
ChainedHashTable
OpenScatterTable
25
Implementation of Separate Chaining
public class ChainedHashTable extends AbstractHashTable {
protected MyLinkedList [ ] array;
public ChainedHashTable(int size) {
array = new MyLinkedList[size];
for(int j = 0; j < size; j++)
array[j] = new MyLinkedList( );
}
public void insert(Object key) {
array[h(key)].append(key); count++;
}
public void withdraw(Object key) {
array[h(key)].extract(key); count--;
}
public Object find(Object key){
int index = h(key);
[Link] e = array[index].getHead( );
while(e != null){
if([Link]([Link]()) return [Link]();
e = [Link]();
}
return null;
}
26
}
Introduction to Open Addressing
• All items are stored in the hash table itself.
• In addition to the cell data (if any), each cell keeps one of the three states: EMPTY,
OCCUPIED, DELETED.
• While inserting, if a collision occurs, alternative cells are tried until an empty cell is found.
• Deletion: (lazy deletion): When a key is deleted the slot is marked as DELETED rather than
EMPTY otherwise subsequent searches that hash at the deleted cell will fail.
• Probe sequence: A probe sequence is the sequence of array indexes that is followed in
searching for an empty cell during an insertion, or in searching for a key during find or delete
operations.
27
Introduction to Open Addressing
28
Introduction to Open Addressing
29
Introduction to Open Addressing
30
Introduction to Open Addressing
31
Introduction to Open Addressing (cont’d)
• The function c(i) is used to resolve collisions.
• To insert item r, we examine array location h0(r) = h(r). If there is a collision, array locations
h1(r), h2(r), ..., hn-1(r) are examined until an empty slot is found.
• Similarly, to find item r, we examine the same sequence of locations in the same order.
• Note: For a given hash function h(key), the only difference in the open addressing collision
resolution techniques (linear probing, quadratic probing and double hashing) is in the
definition of the function c(i).
• Common definitions of c(i) are:
33
Open Addressing Facts
• In general, primes give the best table sizes.
• The search time depends only on the load factor, not on the table size.
• We can use the desired load factor to determine appropriate table size:
34
Open Addressing Facts
35
Open Addressing Facts
36
Open Addressing: Linear Probing
• c(i) is a linear function in i of the form c(i) = a*i.
• Usually c(i) is chosen as:
c(i) = i for i = 0, 1, . . . , tableSize – 1
37
Open Addressing: Linear Probing
38
Open Addressing: Linear Probing
39
Open Addressing: Linear Probing
40
Open Addressing: Linear Probing
41
Open Addressing: Linear Probing
42
Linear Probing (cont’d)
Example: Perform the operations given below, in the given order, on
an initially empty hash table of size 13 using linear probing with
c(i) = i and the hash function: h(key) = key % 13:
43
Linear Probing (cont’d)
a
0 O 26
1 E
2 E
3 E
4 E
5 O 18
6 E
7 E
8 O 47
9 D 35
10 O 9
11 E
12 O 64
44
Disadvantage of Linear Probing: Primary Clustering
• Linear probing is subject to a primary clustering phenomenon.
• Elements tend to cluster around table locations that they originally hash to.
• Primary clusters can combine to form larger clusters. This leads to long probe
sequences and hence deterioration in hash table efficiency.
Example of a primary cluster: Insert keys: 18, 41, 22, 44, 59, 32, 31, 73, in this order, in an
originally empty hash table of size 13, using the hash function h(key) = key % 13 and c(i) = i:
h(18) = 5
h(41) = 2
h(22) = 9
h(44) = 5+1
h(59) = 7
h(32) = 6+1+1
h(31) = 5+1+1+1+1+1
h(73) = 8+1+1+1
45
Open Addressing: Quadratic Probing
• Quadratic probing eliminates primary clusters.
• c(i) is a quadratic function in i of the form c(i) = a*i2 + b*i. Usually c(i) is chosen
as:
c(i) = i2 for i = 0, 1, . . . , tableSize – 1
or
c(i) = i2 for i = 0, 1, . . . , (tableSize – 1) / 2
47
Quadratic Probing (cont’d)
h0(23) = (23 % 7) % 7 = 2 hi(key) = (h(key) i2) % 7 i = 0, 1, 2, 3
h0(13) = (13 % 7) % 7 = 6
h0(21) = (21 % 7) % 7 = 0
h0(14) = (14 % 7) % 7 = 0 collision
0 O 21
h1(14) = (0 + 12) % 7 = 1
h0(7) = (7 % 7) % 7 = 0 collision
h1(7) = (0 + 12) % 7 = 1 collision 1 O 14
h-1(7) = (0 - 12) % 7 = -1
NORMALIZE: (-1 + 7) % 7 = 6 collision 2 O 23
h2(7) = (0 + 22) % 7 = 4
h0(8) = (8 % 7)%7 = 1 collision 3 O 15
h1(8) = (1 + 12) % 7 = 2 collision
h-1(8) = (1 - 12) % 7 = 0 collision 4 O 7
h2(8) = (1 + 22) % 7 = 5
h0(15) = (15 % 7)%7 = 1 collision 5 O 8
2
h1(15) = (1 + 1 ) % 7 = 2 collision
h-1(15) = (1 - 12) % 7 = 0 collision 6 O 13
2
h2(15) = (1 + 2 ) % 7 = 5 collision
h-2(15) = (1 - 22) % 7 = -3
NORMALIZE: (-3 + 7) % 7 = 4 collision 48
h3(15) = (1 + 32)%7 = 3
Secondary Clusters
• Quadratic probing is better than linear probing because it eliminates primary
clustering.
• However, it may result in secondary clustering: if h(k1) = h(k2) the probing
sequences for k1 and k2 are exactly the same. This sequence of locations is called
a secondary cluster.
• Secondary clustering is less harmful than primary clustering because secondary
clusters do not combine to form large clusters.
• Example of Secondary Clustering: Suppose keys k0, k1, k2, k3, and k4 are
inserted in the given order in an originally empty hash table using quadratic
probing with c(i) = i2. Assuming that each of the keys hashes to the same array
index x. A secondary cluster will develop and grow in size:
49
Double Hashing
• To eliminate secondary clustering, synonyms must have different probe sequences.
• Double hashing achieves this by having two hash functions that both depend on the
hash key.
• The function c(i) = i*hp(r) satisfies Property 2 provided hp(r) and tableSize are
relatively prime.
Example: Load the keys 18, 26, 35, 9, 64, 47, 96, 36, and 70 in this order, in an
empty hash table of size 13
(a) using double hashing with the first hash function: h(key) = key % 13 and the
second hash function: hp(key) = 1 + key % 12
(b) using double hashing with the first hash function: h(key) = key % 13 and
the second hash function: hp(key) = 7 - key % 7
Show all computations.
51
Double Hashing (cont’d)
54
Implementation of Open Addressing (Con’t.)
/* finds the index of the first unoccupied slot
in the probe sequence of obj */
protected int findIndexUnoccupied(Comparable obj){
int hashValue = h(obj);
int tableSize = getLength();
int indexDeleted = -1;
for(int i = 0; i < tableSize; i++){
int index = (hashValue + c(i)) % tableSize;
if(array[index].state == OCCUPIED
&& [Link](array[index].object))
throw new IllegalArgumentException(
"Error: Duplicate key");
else if(array[index].state == EMPTY ||
(array[index].state == DELETED &&
[Link](array[index].object)))
return indexDeleted ==-1?index:indexDeleted;
else if(array[index].state == DELETED &&
indexDeleted == -1)
indexDeleted = index;
}
if(indexDeleted != -1) return indexDeleted;
throw new IllegalArgumentException(
"Error: Hash table is full");
}
55
Implementation of Open Addressing (Con’t.)
protected int findObjectIndex(Comparable obj){
int hashValue = h(obj);
int tableSize = getLength();
59
REHASHING
Consider we have to insert the elements 37, 90, 55, 22, 17, 49, and
87. the table size is 10 and will use hash function:
37 % 10 = 7
90 % 10= 0
55 % 10 = 5
22 % 10 = 2
17 % 10 = 7
49 % 10 = 9
60
REHASHING
Now this table is almost full and if we try to insert more elements
collisions will occur and eventually further insertions will fail.
Hence we will rehash by doubling the table size. The old table
size is 10 then we should double this size for new table, that
becomes 20. But 20 is not a prime number, we will prefer to make
the table size as 23. And new hash function will be
62
EXTENDIBLE HASHING
• Extendible Hashing is a dynamic hashing method wherein directories, and
buckets are used to hash data. It is an aggressively flexible method in which the
hash function also experiences dynamic changes.
63
EXTENDIBLE HASHING
Frequently used terms in Extendible Hashing:
• Directories: These containers store pointers to buckets. Each directory is given a unique
id which may change each time when expansion takes place. The hash function returns
this directory id which is used to navigate to the appropriate bucket. Number of
Directories = 2^Global Depth.
• Buckets: They store the hashed keys. Directories point to buckets. A bucket may
contain more than one pointers to it if its local depth is less than the global depth.
• Global Depth: It is associated with the Directories. They denote the number of bits
which are used by the hash function to categorize the keys. Global Depth = Number of
bits in directory id.
• Local Depth: It is the same as that of Global Depth except for the fact that Local Depth
is associated with the buckets and not the directories. Local depth in accordance with
the global depth is used to decide the action that to be performed in case an overflow
occurs. Local Depth is always less than or equal to the Global Depth.
• Bucket Splitting: When the number of elements in a bucket exceeds a particular size,
then the bucket is split into two parts.
• Directory Expansion: Directory Expansion Takes place when a bucket overflows.
Directory Expansion is performed when the local depth of the overflowing bucket is
equal to the global depth.
64
EXTENDIBLE HASHING
Basic Working of Extendible Hashing:
65
EXTENDIBLE HASHING
Basic Working of Extendible Hashing:
Step 1 – Analyze Data Elements: Data elements may exist in various forms eg. Integer,
String, Float, etc.. Currently, let us consider data elements of type integer. eg: 49.
Step 2 – Convert into binary format: Convert the data element in Binary form. For string
elements, consider the ASCII equivalent integer of the starting character and then convert
the integer into binary form. Since we have 49 as our data element, its binary form is
110001.
Step 3 – Check Global Depth of the directory. Suppose the global depth of the Hash-
directory is 3.
Step 4 – Identify the Directory: Consider the ‘Global-Depth’ number of LSBs in the binary
number and match it to the directory id.
Eg. The binary obtained is: 110001 and the global-depth is 3. So, the hash function will
return 3 LSBs of 110001 viz. 001.
Step 5 – Navigation: Now, navigate to the bucket pointed by the directory with directory-id
001.
Step 6 – Insertion and Overflow Check: Insert the element and check if the bucket
overflows. If an overflow is encountered, go to step 7 followed by Step 8, otherwise, go to
step 9.
66
EXTENDIBLE HASHING
Step 7 – Tackling Over Flow Condition during Data Insertion: Many times, while inserting
data in the buckets, it might happen that the Bucket overflows. In such cases, we need to
follow an appropriate procedure to avoid mishandling of data.
First, Check if the local depth is less than or equal to the global depth. Then choose one of
the cases below.
• Case1: If the local depth of the overflowing Bucket is equal to the global depth, then
Directory Expansion, as well as Bucket Split, needs to be performed. Then increment
the global depth and the local depth value by 1. And, assign appropriate pointers.
Directory expansion will double the number of directories present in the hash structure.
• Case2: In case the local depth is less than the global depth, then only Bucket Split takes
place. Then increment only the local depth value by 1. And, assign appropriate
pointers.
67
EXTENDIBLE HASHING
Step 8 – Rehashing of Split Bucket Elements: The Elements present in the overflowing
bucket that is split are rehashed w.r.t the new global depth of the directory.
Step 9 – The element is successfully hashed.
68
EXTENDIBLE HASHING
69
EXTENDIBLE HASHING
70
EXTENDIBLE HASHING
71
EXTENDIBLE HASHING
72
EXTENDIBLE HASHING
73
EXTENDIBLE HASHING
74
EXTENDIBLE HASHING
75
EXTENDIBLE HASHING
76