4 of 59

Hashing

Index /Hash value	Actual Data
0	1000
1	201
2	702
3
…
…
99	699

Hash function

Key

Actual Data

Hash value/Hash key/ Index

generated

5 of 59

Hash table

Hash table is a data structure

that stores key-value pairs
and maps the key to a numeric value (hash key) with the help of a hash function
such that insertion, deletion and search operations can be performed with constant time complexity O(1)

All the data values are inserted into the hash table based on the hash key value
Hash key value is used to map the data with index in the hash table
Hash key is generated for every data using some function – hash function
When a hash table is in use, some spots contain valid records, and other spots are "empty“

6 of 59

Applications of Hash table

Database systems

it helps in random access

Symbol tables

Compilers use this table to maintain information about different symbols
It needs to access information frequently –random access done

Data dictionaries

Data dictionary or any other dictionary can be implemented using hash table

Network processing algorithm

Hash tables are fundamental components of several network processing algorithm
Route lookup, packet classification etc.

7 of 59

Bucket

Hash table stores entries (key, value) pairs in buckets
A hash table uses a hash function to compute an index- that refers to the address of the bucket
Ideally, the hash function is written in such a manner that each key is mapped to a unique bucket

8 of 59

Issues in Hashing

Multiple keys can hash to the same slot – collisions are possible .

Design hash functions such that collisions are minimized.
But avoiding collisions is impossible.
Design collision-resolution techniques.

Search will cost Ө(n) time in the worst case

However, all operations can be made to have an expected complexity of Ө (1).

Overflow

When the home bucket for a new (key,value) pair is already full

9 of 59

An example

Student records having Roll numbers

The size is of 100
Key is last 2 digit of your roll number

Students from various colleges

Size is 100
Key is phone number

Can we truncate the phone number?

10 of 59

Hash Function

The process of converting a key to hash key is known as hash function
A hash function is any function that can be used to map a data set of an arbitrary size to a data set of a fixed size

Mapping from U to the slots of a hash table T[0..m–1].
h : U {0,1,…, m–1}

The values returned by the hash functions are called hash keys, hash values, hash codes, hash sums or hashes

hash key = function(key)
h[k] is the hash value of key k

Our aim is to find a good hash function

11 of 59

Properties of a good hash function

Easy to compute

Fast and easy computation

Uniform Distribution

Every hash value has an equal probability
Even if the keys are not uniformly distributed
If a hashing function groups key values together, this is called clustering of the keys.

Fewer collisions
Dispersing the keys in a random way
Should not waste space unnecessarily

For every index, there is at least one key that hashes to it
Load factor lambda λ = (number of keys / TableSize)

12 of 59

Different Hash functions

Direct hashing
Subtraction
Modulo-division method
Digit Extraction method
Multiplication method
Mid-square method
Folding method
Radix Transformation
Rotation method
Pseudo-random Method
Universal method

* The methods marked in Bold are important for exam

13 of 59

Direct Hashing

Address for a key is generated without any algorithmic manipulation.
The hash table must contain an address for every possible key
Roll number to index

01 – 1
68 - 68

Pros:

No calculation and easy to implement
No collision!

Cons

Wastage of space if the keys are not consecutive
Not suitable for big and diverse lists

14 of 59

Subtraction method

Slight variation from the previous method

Little bit calculation required
Suppose, keys are consecutive but not starting from 1
Example: Roll numbers are 1000 to 1100
Subtract 1000 from the key to get the address

15 of 59

Modulo-Division method

Divides the key by the list size and uses the remainder+1 for the address

h(k) = (k % ListSize) +1
Listsize should be prime.. Why?

Because keys are typically not randomly distributed, but usually have some pattern

mostly even
mostly multiples of 10
in general: mostly multiples of some x

If ‘x’ is a factor of ListSize, then only (ListSize/x) slots will ever be used!
Since the only factor of a prime number is itself, this phenomena only hurts in the (rare) case where x = ListSize

16 of 59

Modulo-Division method Example

Suppose we want to maintain data for 300 students
First prime number greater than 300 is 307
Let us choose the list size as 307
Let’s consider some keys and calculate the addresses:

7028870131 - 137 + 1 = 138
7057421821 - 213 + 1 = 214
7387645749 - 205 + 1 = 206
8080105892 – 51 + 1 = 52
8879528852 – 230 +1 = 231

https://miniwebtool.com/modulo-calculator

Check the hash value for your mobile number!

17 of 59

Modulo-Division method- Strings as Keys

If keys are strings, can get an integer by adding up ASCII values of characters in key

for (i=0;i<key.length();i++)

hashVal += key.charAt(i);

Problem 1: What if TableSize is 10,000 and all keys are 8 or less characters long?

Problem 2: What if keys often contain the same characters (“abc”, “bca”, etc.)?

18 of 59

Pros and Cons

Pros:

Fast and easy computation

Cons

‘list_size’ should be used chosen carefully
Problems with non-prime ‘list_size’
Consecutive keys map to consecutive addresses leading to clustering

19 of 59

Extraction / Truncation method

Ignoring part of the key and using the rest as the array index

It may not always be even distribution throughout the table
Let’s consider the key:

8879528852
Use just last three digits as the address / index -852

20 of 59

Multiplication method

Choose a constant – A (0<A<1)
Multiply key (k) by A
Extract the fractional part of k*A -> f

A number between 0 and 1

Multiply f by listsize m
Take floor of the multiplication

The f will be converted to a number between 0 and m-1

h(k) = floor[ m (kA mod 1) ]

21 of 59

Multiplication method - Example

k=123
m=100
A=0.618033
h(123) = 100 (123 * 0.618033 mod 1)

= 100 (76.018059 mod 1)

= 100 (0.018059)

= 1

22 of 59

Pros and Cons of Multiplication method

Pros:

‘m’ is not critical
It can work with any value of A. As per Knuth, the best A = (sqrt(5) – 1)/2 = .61803
If table size is a power of two, then getting the index from the hash could be implemented as bitwise AND operation

computing the table index by the key, with multiplication hashing, is very fast.

Cons

Complex calculation wrt Division method

23 of 59

Mid-square method

The key value is squared

If the key is non-numeric, some type of preprocessing needs to be done to create a numeric value

Middle ‘r’ digits or mid part of the result is used as the index

If the mid part is greater than the list_size, then modulo operation is performed

The whole key is used for hash generation,

leading to greater chance of finding a different address each time per key value
Result is not dominated by the bottom digit or top digit

Difficulty is choosing ‘r’ properly!
In application, powers of two are more efficient for the table size, and the middle of the bit string of the square of the key is used

24 of 59

Mid-square method - example

Suppose, List_size = 100

So, r = 2, as 2 digits are required to map the keys to indices

Suppose key k= 50
k*k = 2500
h(50) = 50
The hash value obtained is 50

Suppose key k= 2199
k*k = 4835601
h(50) = 56
The hash value obtained is 56

25 of 59

Folding method

Partition the key into several pieces and then individual pieces are combined (or folded) in some convenient way (may be summation)
Two types of folding are used

shift folding and
boundary folding

In shift folding, the parts are placed underneath each other and then processed (for example, by adding) –

Using a Mobile number, say 123456789, we can divide it into three parts - 123, 456, and 789
and add them to get 123 + 456 + 789 = 1368 (or 368 ignoring the carry)
This can then be modulo divided by List_Size to get the address

26 of 59

Folding method

With boundary folding, the key is visualized as being written on a piece of paper and folded on the boundaries between the parts

The result is that alternating parts of the key are reversed
So, the Mobile number part would be 123, 654, 789, totaling 123 + 654 + 789 = 1566 (or 566)
Or it can be 321 + 456 + 987 = 1764 (or 764)
As can be seen, in both versions, the key is divided into even length parts of some fixed size, plus any leftover digits
Then these are added together, and the result is divided modulo the table size

Consequently, this is very fast and efficient, especially if bit strings are used instead of numbers
With character strings, one approach is to exclusively-or the individual character together and use the result

27 of 59

Radix Transformation method

With radix transformation, the key is transformed into a different base
For instance, if K is 345 in decimal, its value in base 9 is 423
This result is then divided modulo List_Size and the resulting value becomes the address top which K is hashed
The drawback to this approach is collisions cannot be avoided
For example, if List_Size is 100, then although 345 and 245 in decimal will not collide, 345 and 264 will collide because 264 is 323 in base nine and 345 is 424 in base nine
These two values will collide when divided modulo 100

28 of 59

Rotation method

Rotation method is incorporated in combination with other hashing methods.
It is particularly useful when keys are assigned serially
This way, the keys can be spread more evenly throughout the table, instead of clustering

Original Key	Rotation	Rotated Key
600351	351	135
600352	352	235
600353	353	335

29 of 59

Pseudorandom method

The key is used as the seed in pseudorandom number generator and the resulting random number generated is mapped into the possible addresses using modulo-division
Y = a * key +c
a = 17, c=7, list_size = 307
Key = 121267
Y = (17 * 121267 + 7) % 307 + 1

= (2061539 + 7) %307 +1

= (2061546) %307 + 1

= 41 + 1

= 42

30 of 59

Universal hashing

If the list_size m is much smaller than universe size u, then for any hash function h, there is some large subset of U that has the same hash value.
We need a set of hash functions, from which we can choose any one that works well for S.
If most of the hash functions are better for S, we can choose random hash function
Leading to low number of collisions
Many universal families are known, and their evaluation is often very efficient

31 of 59

Collision Resolution Techniques

32 of 59

Collision

When two different keys produce the same address, collision happens.

h(k1) = h(k2) and k1 ≠ k2

The keys involved in collision are called synonyms

k1 and k2

33 of 59

Avoidance vs. Resolution

Avoiding collision is extremely difficult
It is better to deal with the collision and devise some strategies

Spreading out the records
Use extra memory
Keeping more than one record at a single address

34 of 59

Strategies for Collision Resolution

Chaining

To store colliding keys in a linked list at the same hash table index

Open Addressing

To store colliding keys somewhere else in the table
Trying the next available slot is called -Probing
Types:

Linear Probing
Quadratic Probing
Double Hashing

35 of 59

Chaining

“Chaining" or "separate chaining” or “open hashing”
All keys that map to the same table location are kept in a linked list (this linked list is known as "chain" or "bucket")
Example: � insert 10, 22, 86, 12, 42 � with h(x) = x % 10

0	/
1	/
2	/
3	/
4	/
5	/
6	/
7	/
8	/
9	/

36 of 59

What will happen in worst case?

37 of 59

Load factor?

0
1	/
2
3	/
4	/
5	/
6
7	/
8	/
9	/

38 of 59

Deletion in separate chaining

Find in table
Delete from bucket
Similar run-time as insert
Sensitive to underlying bucket structure

39 of 59

Closed Hashing/Open Addressing

Open Addressing or probing is carried out for insertion into fixed size hash tables
If the index determined by the hash function is already occupied, next empty bucket should be used
There are commonly three techniques for determining the next empty bucket

Linear probing

Without chaining
With chaining and without replacement
With chaining and with replacement

Quadratic probing
Double hashing

40 of 59

Closed Hashing/Open Addressing: Linear Probing

Separate chaining does not use all the space in the table. Why not use it?

Store directly in the array cell
No linked lists or buckets

How to deal with collisions?

If h(key) is already full,

try (h(key) + 1) % TableSize. If full,

try (h(key) + 2) % TableSize. If full,

try (h(key) + 3) % TableSize. If full…

Example: insert 38, 19, 8, 79, 10

0
1
2
3
4
5
6
7
8
9

41 of 59

Delete

Can we simply delete a key?

Search may fail
Deleted keys are marked as ‘deleted’ such that new items can be inserted there but search does not stop

42 of 59

Numerical

Insertion of keys into hash table using linear probing as collision resolution technique

The keys 12, 18, 13, 2, 3, 23, 5 and 15 are inserted into an initially empty hash table of length 10 using open addressing with hash function h(k) = k mod 10 and linear probing. What is the resultant hash table?

43 of 59

Numerical

Given a hash table with keys, verify/find possible sequence of keys leading to hash table –

A hash table of length 10 uses open addressing with hash function h(k)=k mod 10, and linear probing. After inserting 6 values into an empty hash table, the table is as shown below . Which one of the following choices gives a possible order in which the key values could have been inserted in the table?�(A) 46, 42, 34, 52, 23, 33�(B) 34, 42, 23, 52, 33, 46�(C) 46, 34, 42, 23, 52, 33�(D) 42, 46, 33, 23, 34, 52

44 of 59

Linear probing with chaining – without replacement

One extra column ‘chain’ is added to keep track of the synonyms
This column will have the value ‘-1’ at the start
When some key(k2) collides with key (k1), k2 is stored in the next available slot and the slot index is k2 is stored in the chain column of k1
Consider inserting:

3,33,42,63,89,45,93
h(key) = key %10

Index	Key	Chain
0		-1
1		-1
2		-1
3		-1
4		-1
5		-1
6		-1
7		-1
8		-1
9		-1

Index	Key	Chain
0		-1
1		-1
2	42	-1
3	3	4
4	33	5
5	63	7
6	45	-1
7	93	-1
8		-1
9	89	-1

45 of 59

Linear Probing with chaining – with replacement

In the previous example, key 45 got misplaced by probing, 63 is already occupying it!
To overcome this issue, key at index 5 is replaced by 45 and 63 is shifted to another empty location

Index	Key	Chain
0		-1
1		-1
2	42	-1
3	3	4
4	33	6
5	45	-1
6	63	7
7	93	-1
8		-1
9	89	-1

46 of 59

Load factor

Can the load factor ever exceed 1.0?
If the load factor is high

Too many probes -> losing O(1)

For any λ < 1, linear probing will find an empty slo

But clustering can be a problem

It is better to keep the load factor under 0.7
Double the table size and rehash if load factor gets high
Cost of Hash function f(x) must be minimized

47 of 59

Problem with Linear Probing

It turns out linear probing is a bad idea, even �though the probe function is quick to compute �(which is a good thing)

This tends to produce clusters, which lead to �long probe sequences
This is called primary clustering

48 of 59

Quadratic Probing

We can avoid primary clustering by changing the probe function from

i to f(i)
(h(key) + f(i)) % TableSize

For quadratic probing, f(i) = i²:

0^th probe: (h(key) + 0) % TableSize

1^st probe: (h(key) + 1) % TableSize

2^nd probe: (h(key) + 4) % TableSize

3^rd probe: (h(key) + 9) % TableSize

…

i^th probe: (h(key) + i²) % TableSize

Intuition: Probes quickly "leave the neighborhood"

49 of 59

Quadratic Probing - Example

TableSize = 10
insert(89)

89%10 = 9

insert(18)

18%10 = 8

insert(49)

49%10=9 Collision
(49+1)%10 = 0

insert(58)

58%10 = 8 Collision
(58+1)%10 = 9 Collision
(58+4)%10 = 2

insert(79)
79 % 10 = 9 collision!
(79 + 1) % 10 = 0 collision!
(79 + 4) % 10 = 3

0
1
2
3
4
5
6
7
8
9	89

0	49
1
2	58
3	79
4
5
6
7
8	18
9	89

50 of 59

Problem with Quadratic Probing

While linear probing always finds an empty cell, quadratic probing may not find any vacant cell even if there is available slot!
It may lead to secondary clustering
Suppose the table size is 16.

Probe offsets that will be tried:

1 mod 16 = 1
4 mod 16 = 4
9 mod 16 = 9
16mod 16 = 0
25mod 16 = 9
36 mod 16 = 4
49 mod 16 = 1
64 mod 16 = 0
81mod 16 = 1

only four different values!

51 of 59

Double Hashing

No need to use an arithmetic probe function, rather the address is rehashed
If the first hash function yields an address which is already occupied, apply second hash function to get a different address!

52 of 59

Double Hashing - Pseudorandom

If h₁(k2) = h₁(k1), apply h2

(h₁(k2) + i * h₂(k2)) % TABLE_SIZE

First hash function hash1(key) = key % TABLE_SIZE
A popular second hash function: hash2(key) = PRIME – (key % PRIME) where PRIME is a prime smaller than the TABLE_SIZE.

53 of 59

Example…

A good hash function:

It must never evaluate to zero
Must make sure that all cells can be probed

Insert 4371,1323,6173,4199,4344,9679
H1(k) = key%10
Hh2(k) = 7-(key%7)

4371 – h1(4371)=1
1323 – h1(1323)=3
6173 – h1(6173)=3 collision

h2(6173)=7-(6173%7) = 7-6 = 1
h1(6173) + 1.h2(6173) = 3+1 = 4

4199 - h1(4199)=9
4344 - h1(4344) = 4 collision

h2(4344) = 7 – (4344%7) =3
h1(4344) + 1.h2(4344) = 4+3 = 7

9679 – h1(9699) = 9 Collision

h2(9699) = 7-(9699%7) = 2
h1(9699) + 1.h2(9699) = 9+2 = 11 = 11%10=1 –Again Collision!!
h1(9699) + 2. h2(9699) = 9 + 4 = 13 -> 3
h(9699) + 3.h2(9699) = 9 +6 =15 ->5

0
1	4371
2
3	1323
4	6173
5	9699
6
7	4344
8
9	4199

54 of 59

Pros and Cons

No clustering here!
Poor cache performance

55 of 59

Double Hashing – Key Offset

Key offset is another double hashing method that produces different collision paths for different keys.
New address is a function of the old address and the key

Offset = key/listsize
Address = [(offset + old address)%listsize] + 1

56 of 59

Key Offset - Example

Key = 9699, list_size = 13
h1(9699) = 9699%13 = 1

If Collision

Offset = (9699/13) = 746
Address = (746 + 1)%13 + 1 =5+1 =6

57 of 59

Collision Resolution Techniques - Summary

58 of 59

Summary

Hash table is a great data structure for storing unordered data that supports insert, delete & find
Both separate chaining (open) and open addressing (closed) hashing are useful

separate chaining flexible
closed hashing uses less storage, but performs badly with load factors near 1
extendible hashing for very large disk-based data

Hashing pros and cons

+ very fast
+ simple to implement, supports insert, delete, find
- lazy deletion necessary in open addressing, can waste storage
- does not support operations dependent on order: min, max, range

59 of 59

Summary - Definitions

Hash Table- A hash table [hash map] is a data structure that provides virtually direct access to objects based on a key [a unique String or Integer]. A hash table uses a hash function to compute an index into an array of buckets or slots, from which the desired value can be found.
Synonyms – The set of keys that hash to the same location in the hash table, is called synonyms
Collision – Collision is the event that occurs when a hashing algorithm produces an address for an insertion key and that address is already occupied
Probing – Calculating the address and finding out the exact location is known as probing
Bucket- Bucket is the fast-access location, that is the result of the hash function. Basically, bucket is the place where actual data is kept.
Overflow: An overflow occurs when the home bucket for a new key-value pair is already occupied
Open Hashing: Keys are stored in linked list attached to cells of a hash table
Closed Hashing: All keys are stored in the hash table itself