MODULE V
Hashing : Introduction, Static Hashing, Dynamic Hashing
Priority Queues : Single and Double Ended Priority Queues, Leftist Trees
Introduction to Efficient Binary Search Trees : Optimal Binary Search Tree.
HASHING
• With the introduction of the Hash data structure, it is now possible to easily
store data in constant time and retrieve them in constant time as well.
• There are majorly three components of hashing:
[Link]: A Key can be anything string or integer which is fed as input in the
hash function the technique that determines an index or location for storage
of an item in a data structure.
[Link] Function: The hash function receives the input key and returns the
index of an element in an array called a hash table. The index is known as
the hash index.
[Link] Table: Hash table is a data structure that maps keys to values using a
special function called a hash function. Hash stores the data in an associative
manner in an array where each data value has its own unique index.
HASHING
HASHING
• Hashing is a technique used in data structures to map data typically for faster
data retrieval. Hashing enables us to perform dictionary operations like
search insert and delete.
• It involves using a hash function to convert the input data into a hash code or
hash value, which is then used as an index or address in a data structure like
a hash table.
There are two types of hashing
Static
Dynamic
STATIC HASHING
• In static hashing, the hash table size is fixed and does not change during the
lifetime of the data structure.
• In static Hashing the dictionary pairs are stored in a table, ht called the hash
table.
• The hash table is partitioned into b buckets, ht[0],…….ht[b-1]
• Each bucket is capable of holding s dictionary pairs.
• Thus a bucket is said to consist of s slots. usually s=1
• The address or location of a pair whose key is k is determined by hash
function h which maps keys into buckets.
• Thus for any key k, h(k) is an integer in range 0 through b-1
STATIC HASHING
Hash Table (ht) h(k)=0,1, ……..(b-1)
0
1
2
.
Buckets
.
.
b-2
b-1
1 2 ………… s
S slots
Hash Functions
A hash function maps a key into a bucket in the hash table. A function H from
the set K of keys into the set L of memory addresses is called the hash
function
H : K -> L
Desired Properties are
Easy computation
Minimal number of collisions
Uniformly distribute the hash addresses throughout the set L
Hash Functions
• Division
Chose a number m larger than the number of keys in K. The number m is
chosen to be a prime number or a number without small divisors to reduce
collisions. The function is defined as
h(k) = k % m or h(k)= k mod m
Bucket addresses range from 0 to m-1 and the hash table must have m buckets
Example:if m=10 then h(25)=5, h(32)=2
Hash Functions
Mid Square
In this method the square of the key is found and appropriate number of
bits are used from the middle of the square to obtain the bucket address
F(K)=middle(K2)
The number of bits used to obtain bucket address depends on table size.
Example: K=3205 K2 = 10272025 H(K)= 72
Hash Functions
Folding
Partition the keys k into several parts
All parts except for the last one have the same length
The parts are added together to obtain the hash address
• Example k= 12320324111220
• x1=123, x2=203, x3=241, x4=112, x5=20, address=
123+203+241+112+20= 699
Hash Functions
Digit Analysis
Useful in the case of a static file where all the keys in the table are
known in advance
Each key is interpreted using some radix r.
The same radix is used for all the keys in the table
Digits are examined with this radix
Digits having the most skewed distributions are deleted.
Enough digits are deleted so that the remaining digits are small enough
to give and address in the range of hash table
Hash Functions
Digit Analysis
• Each key is interpreted using some radix r:
Let's say our keys are interpreted in base 10 (decimal), so the radix (r) is 10.
• The same radix is used for all the keys in the table:
All keys are in base 10
• Digits are examined with this radix:
Our keys are 4-digit numbers, so we have digits at positions 1, 10, 100, and 1000.
• Digits having the most skewed distributions are deleted:
Let's consider the key distribution: 1234, 5678, 1987, 4321, 8765, 9999.
If we observe that the thousands' place (the leftmost digit) has a skewed distribution, we
might decide to eliminate it during digit analysis.
Hash Functions
Digit Analysis
• Enough digits are deleted so that the remaining digits are small enough to give an
address in the range of the hash table:
Suppose we decide to remove the thousands' place digit. Our keys now become: 234,
678, 987, 321, 765, 999
• Hashing the Remaining Digits:
Now, we need to hash the remaining digits to get a value within the range of our 10-
bucket hash table.
For simplicity, let's perform a modulo operation with 10.
Hash Functions
Converting keys to integers
• Converting each character to a unique integer and summing these
unique integers.
• After that using x%13 to find hash index.
Overflow/Collision
• The hash function may return the same hash value for two or more keys.
• Hash function h maps several different keys into the same bucket
• Two keys, k1 and k2 are synonyms with respect to h
if h(k1) = h(k2)
• When two or more keys have the same hash value, a overflow happens. To
handle this overflow, we use overflow handling techniques.
Overflow Handling Techniques
Two popular ways To handle overflows
Open Addressing(Linear probing, Quadratic probing, Random probing,
Rehashing)
Chaining
OPEN ADDRESSING –LINEAR PROBING
• In linear probing, the hash table is searched sequentially that starts from the
original location of the hash.
• If in case the location that we get is already occupied, then we check for the
next location.
• In linear probing to insert a new data its key is k we search the hash table
buckets in the order
ht [ h(k)+i ] % b
Where ht -- represent hashtable
h(k) – hash function
b – size of hash table
i -- value of i ranges from 0 <= i <= (b-1)
OPEN ADDRESSING –LINEAR PROBING
• Let us consider a simple hash function h(k) as key mod 5 and a sequence of
keys that are to be inserted are 50, 70, 76, 93.
Empty hash table The first key is 50. It will map to slot number 0 because
50%5=0 . So insert it into slot number 0.
OPEN ADDRESSING –LINEAR PROBING
• Let us consider a simple hash function h(k) as key mod 5 and a sequence of
keys that are to be inserted are 50, 70, 76, 93.
Key is 70. It will map to slot number 0 because Key is 76. It will map to slot number 1
70%5=0 but 50 is already at slot number 0 so, because 76%5=1 but 70 is already at slot
search for the next empty slot and insert it. number 1 so, search for the next empty
slot and insert it.
OPEN ADDRESSING –LINEAR PROBING
• Let us consider a simple hash function h(k) as key mod 5 and a sequence of
keys that are to be inserted are 50, 70, 76, 93.
The next key is 93 It will map to slot number 3 because 93%5=3, So insert it into slot number 3.
OPEN ADDRESSING –QUADRATIC PROBING
• In quadratic probing, when there is an overflow a quadratic
polynomial is added to the hash index thus by avoiding clustering.
• In quadratic probing to insert a new data its key is k we search the
hash table buckets in the order
ht [ h(k)+ i2 ] % b
OPEN ADDRESSING –QUADRATIC PROBING
• Example : Let us consider table size = 7, hash function as h(k) = k % 7
and collision resolution strategy is i 2 . Insert = 22, 30, and 50
create table of size 7
OPEN ADDRESSING –QUADRATIC PROBING
• Example : Insert 22 and 30
• Hash(22) = 22 % 7 = 1, Since the cell at index 1 is empty, we can easily insert 22 at slot 1.
• Hash(30) = 30 % 7 = 2, Since the cell at index 2 is empty, we can easily insert 30 at slot 2.
OPEN ADDRESSING –QUADRATIC PROBING
• Example : Insert 50
Hash(50) = 50 % 7 = 1
In our hash table slot 1 is already occupied. So, we will search for slot 1+1 2, i.e. 1+1 = 2,
Again slot 2 is found occupied, so we will search for cell 1+22, i.e.1+4 = 5,
Now, cell 5 is not occupied so we will place 50 in slot 5.
• Rehashing : Alternative method to retard the growth of clusters is to use a
series of hash functions h1, h2, hm.
• Random Probing : It can be done by ht [ h(k)+ s(i)] % b .
where s(i) is a pseudo random number.
CHAINING
• The idea behind separate chaining is to keep a separate linked list called a
chain.
• Chaining is one of the most popular and commonly used techniques in order
to handle collisions.
• The linked list data structure is used to implement this technique. So what
happens is, when multiple elements are hashed into the same slot index, then
these elements are inserted into a singly-linked list which is known as a
chain.
• Here, all those elements that hash into the same slot index are inserted into a
linked list.
CHAINING( Let us consider a simple hash function as “key mod 7” and a sequence of keys as 50, 700,
76, 85, 92, 73, 101)
CHAINING
• While searching if two different elements have the same hash value then
we store both the elements in the same linked list one after the other.
DRAW BACKS OF STATIC HASHING
1. Table size is fixed and hence cannot accommodate data growth.
2. Collisions increases as data size grows.
DYNAMIC HASHING
• Dynamically increases the size of the hash table as collision occurs. There
are two types:
1) Dynamic hashing using directory or (Extendible hashing) : Uses a
directory that grows or shrinks depending on the data distribution. No
overflow buckets
2) Directory less Dynamic hashing or (Linear hashing): No directory. Splits
buckets in linear order, uses overflow buckets.
DYNAMIC HASHING
1) Dynamic hashing using directory or (Extendible hashing)
Uses a directory of pointers to buckets/bins which are collections of
records
The number of buckets are doubled by doubling the directory, and splitting
just the bin that overflowed.
Directory much smaller than file, so doubling it is much cheaper.
Dynamic hashing using
directory
or
(Extendible hashing)
Directory less dynamic hashing
• Here we assume that hash table is as large as possible and so there is no
possibility of increasing size dynamically.
• To avoid initializing such large array, we use 2 variables q and r ,
0<=q<2r to keep track of active buckets.
• At any time only buckets 0 through 2r +q-1 are active.
• The remaining buckets on the chain are called overflow buckets.
• Informally r is the number of bits of h(k) used to index into the hash table
and q is the bucket that will split next.
Directory less dynamic hashing
• Consider the example :
• For r=2, we should find the value of q, condition for q is : 0<=q<2r
• There for : 0<=q<2r => 0<=q<4
q => 0,1,2,3
• Also At any time only buckets 0 through 2r +q-1 are active
• It means
when r=2 and q=0 then 0 to 22 +0-1 buckets are active(ie 0 to 3)
when r=2 and q=1 then 0 to 22 +1-1 buckets are active(ie 0 to 4)
when r=2 and q=2 then 0 to 22 +2-1 buckets are active(ie 0 to 5)
when r=2 and q=3 then 0 to 22 +3-1 buckets are active(ie 0 to 6)
Directory less dynamic hashing