HASHING
Lecture 6
Recommended Text
• Tomas H. Cormen et al., Introduction to
Algorithms, MIT press
Hashing Data Structure
Hashing is an important Data Structure which is
designed to use a special function called the
Hash function which is used to map a given
value with a particular key for faster access of
elements.
The efficiency of mapping depends of the
efficiency of the hash function used.
Hashing Data Structure
Let a hash function H(x) maps the value x at
the index x%10 in an Array.
For example if the list of values is
[11,12,13,14,15] it will be stored at positions
{1,2,3,4,5} in the array or Hash table
respectively.
Hashing Data Structure
Hashing Data Structure
Suppose we want to design a system for storing
employee records keyed using phone numbers.
And we want following queries to be performed
efficiently:
i. Insert a phone number and corresponding
information.
ii. Search a phone number and fetch the
information.
iii. Delete a phone number and related
information.
Hashing Data Structure
We can think of using the following data
structures to maintain information about
different phone numbers.
i. Array of phone numbers and records.
ii. Linked List of phone numbers and records.
iii. Balanced binary search tree with phone
numbers as keys.
iv. Direct Access Table.
For arrays and linked lists
For arrays and linked lists, we need to search
in a linear fashion, which can be costly in
practice.
If we use arrays and keep the data sorted, then
a phone number can be searched in O(Logn)
time using Binary Search, but insert and delete
operations become costly as we have to
maintain sorted order.
Balanced Binary Search
Tree
With balanced binary search tree, we get
moderate search, insert and delete times.
All of these operations can be guaranteed to be
in O(Logn) time.
Direct Access Table
Another solution that one can think of is to use a
DIRECT ACCESS TABLE where we make a big
array and use phone numbers as index in the
array.
An entry in array is NIL if phone number is not
present, else the array entry stores pointer to
records corresponding to phone number.
Time complexity wise this solution is the best
among all, we can do all operations in O(1)
time.
Direct Access Table
Direct Address Table is a data structure that
has the capability of mapping records to their
corresponding keys using arrays.
In direct address tables, records are placed
using their key values directly as indexes.
They facilitate fast searching, insertion and
deletion operations.
Direct Access Table
Example:
Create an array of size equal to maximum value
plus one (assuming 0 based index) and then use
values as indexes.
For example, in the following diagram key 21 is
used directly as index.
Direct Access Table
Advantages
Searching in O(1) Time: Direct address tables
use arrays which are random access data
structure, so, the key values (which are also the
index of the array) can be easily used to search
the records in O(1) time.
Insertion in O(1) Time: We can easily insert
an element in an array in O(1) time. The same
thing follows in a direct address table also.
Advantages
Deletion in O(1) Time: Deletion of an element
takes O(1) time in an array. Similarly, to delete
an element in a direct address table we need
O(1) time.
Advantages
Deletion in O(1) Time: Deletion of an element
takes O(1) time in an array. Similarly, to delete
an element in a direct address table we need
O(1) time.
Limitations
i. Prior knowledge of maximum key value
ii. Practically useful only if the maximum value
is very less.
iii. It causes wastage of memory space if there
is a significant difference between total
records and maximum value.
Limitations
This solution has many practical limitations.
First problem with this solution is that a huge
extra space is required.
For example if phone number is n digits, we need
O(m * 10n) space for table where m is size of a
pointer to record and n, digits of the telephone
number
Another problem is an integer in a
programming language may not store n digits.
Hashing Data Structure
Due to above limitations Direct Access Table
cannot always be used. Hashing is the solution
that can be used in almost all such situations
and performs extremely well compared to above
data structures like Array, Linked List, Balanced
BST in practice.
With hashing we get O(1) search time on
average (under reasonable assumptions) and
O(n) in worst case.
Hashing Data Structure
Hashing is an improvement over Direct Access
Table.
The idea is to use hash function that converts
a given phone number or any other key to a
smaller number and uses the small number as
index in a table called hash table.
Hashing Data Structure
Application
i. Database Indexing
ii. Caching
iii. Error Checking
iv. Programme Compilation
v. Password Authentication
Hash Function (Algorithm)
A function that converts a given big phone
number to a small practical integer value.
The mapped integer value is used as an index in
hash table. In simple terms, a hash function
maps a big number or string to a small integer
that can be used as index in hash table.
Properties of a Good Hash
Function
i. Efficiently computable.
ii. Should uniformly distribute the keys (Each
table position equally likely for each key)
For example for phone numbers a bad hash
function is to take first three digits.
A better function is consider last three digits.
NB: This may not be the best hash function.
There may be better ways.
Hash Table
An array that stores pointers to records
corresponding to a given phone number.
An entry in hash table is NIL if no existing phone
number has hash function value equal to the
index for the entry.
Collision Handling
Since a hash function gets us a small number
for a big key, there is possibility that two keys
result in same value.
The situation where a newly inserted key maps
to an already occupied slot in hash table is
called collision and must be handled using
some collision handling technique.
Collision Handling
Technique
i. Open Addressing
ii. Chaining
Open Addressing
In open addressing, all elements are stored in
the hash table itself.
Each table entry contains either a record or NIL.
Any point, size of the table must be greater than
or equal to the total number of keys (Note that
we can increase table size by copying old data if
needed)
Open Addressing
When searching for an element, we one by one
examine table slots until the desired element is
found or it is clear that the element is not in the
table.
Open Addressing
Insert(k): Keep probing until an empty slot is
found. Once an empty slot is found, insert k.
Search(k): Keep probing until slot’s key doesn’t
become equal to k or an empty slot is reached
Delete(k): Delete operation is interesting. If we
simply delete a key, then search may fail.
So slots of deleted keys are marked specially as
“deleted”.
Insert can insert an item in a deleted slot, but
the search doesn’t stop at a deleted slot.
Open Addressing (Types)
i. Linear Probing
ii. Quadratic Probing
iii. Double Hashing
Linear Probing
In linear probing, we linearly probe for next slot.
For example, typical gap between two probes is
1.
let hash(x) be the slot index computed using
hash function and S be the table size
Linear Probing
If slot hash(x) % S is full,
then we try (hash(x) + 1) % S
If (hash(x) + 1) % S is also full,
then we try (hash(x) + 2) % S
If (hash(x) + 2) % S is also full,
then we try (hash(x) + 3) % S
..................................................
..................................................
Linear Probing
Let us consider a simple hash function as “key
mod 7” and sequence of keys as 50, 700, 76,
85, 92, 73, 101.
Linear Probing
Let us consider a simple hash function as “key
mod 7” and sequence of keys as 50, 700, 76,
85, 92, 73, 101.
Linear Probing
Clustering: The main problem with linear
probing is clustering, many consecutive
elements form groups and it starts taking time
to find a free slot or to search an element.
Quadratic Probing
We look for i2th slot in ith iteration.
let hash(x) be the slot index computed using hash
function.
If slot hash(x) % S is full,
then we try (hash(x) + 1*1) % S
If (hash(x) + 1*1) % S is also full,
then we try (hash(x) + 2*2) % S
If (hash(x) + 2*2) % S is also full,
then we try (hash(x) + 3*3) % S
..................................................
..................................................
Double Hashing
We use another hash function hash2(x) and look for
i*hash2(x) slot in i’th rotation.
let hash(x) be the slot index computed using hash function.
If slot hash(x) % S is full,
then we try (hash(x) + 1*hash2(x)) % S
If (hash(x) + 1*hash2(x)) % S is also full,
then we try (hash(x) + 2*hash2(x)) % S
If (hash(x) + 2*hash2(x)) % S is also full,
then we try (hash(x) + 3*hash2(x)) % S
.............................................
.............................................
Comparison
Linear probing has the best cache
performance but suffers from clustering.
One more advantage of Linear probing is
easy to compute.
Quadratic probing lies between the two in
terms of cache performance and clustering.
Double hashing has poor cache performance
but no clustering.
Double hashing requires more computation
time as two hash functions need to be
computed.
Performance of Open
Addressing
m = Number of slots in the hash table
n = Number of keys to be inserted in the hash table
Load factor α = n/m ( < 1 )
Expected time to search/insert/delete < 1/(1 - α)
So Search, Insert and Delete take (1/(1 - α)) time
Chaining
The idea is to make each cell of hash table point
to a linked list of records that have same hash
function value.
Chaining is simple, but requires additional
memory outside the table.
Chaining
Consider a simple hash function as “key mod 7”
and sequence of keys as 50, 700, 76, 85, 92, 73,
101.
Chaining
Advantages
i. Simple to implement.
ii. Hash table never fills up, we can always add
more elements to chain.
iii. Less sensitive to the hash function or load
factors.
iv. It is mostly used when it is unknown how
many and how frequently keys may be
inserted or deleted.
Disadvantages
i. Cache performance of chaining is not good
as keys are stored using linked list. Open
addressing provides better cache
performance as everything is stored in same
table.
ii. Wastage of Space (Some Parts of hash table
are never used)
iii. If the chain becomes long, then search time
can become O(n) in worst case.
iv. Uses extra space for links.
Performance of Chaining
Performance of hashing can be evaluated under
the assumption that each key is equally likely to
be hashed to any slot of table (simple uniform
hashing).
Performance of Chaining
m = Number of slots in hash table
n = Number of keys to be inserted in hash table
Load factor α = n/m
Expected time to search = O(1 + α)
Expected time to insert/delete = O(1 + α)
Time complexity of search insert and delete is
O(1) if α is O(1)
Chaining Open Addressing
Chaining is Simpler to Open Addressing requires more
implement. computation.
In chaining, Hash table never
In open addressing, table may
fills up, we can always add more
become full.
elements to chain.
Open addressing requires extra
Chaining is Less sensitive to the
care for to avoid clustering and
hash function or load factors.
load factor.
Chaining is mostly used when it
Open addressing is used when
is unknown how many and how
the frequency and number of
frequently keys may be inserted
keys is known.
or deleted.
Chaining Open Addressing
Open addressing provides better
Cache performance of chaining is
cache performance as
not good as keys are stored using
everything is stored in the same
linked list.
table.
Wastage of Space (Some Parts of In Open addressing, a slot can
hash table in chaining are never be used even if an input doesn’t
used). map to it.
Chaining uses extra space for
No links in Open addressing
links.
Quiz
1. Given the following input (4322,
1334, 1471, 9679, 1989, 6171, 6173,
4199) and the hash function x mod
10, show which of the inputs have the
same hash value.
2. Differentiate between Hashing &
Direct Access Table.
3. The keys 22, 8, 12, 3, 13, 15 and 25
are inserted into an initially empty
hash table of length 9 using open
addressing with hash function h(k) =
k mod 10 and quadratic probing.
What is the resultant hash table?
End of Lecture 6