1 of 42

Hashing: General Idea

  • Hash Functions
  • Collision Resolution Techniques
    • Separate Chaining
    • Open Addressing-
        • Linear probing,
        • Quadratic Probing,
        • Double Hashing.

Additional concepts:

    • Rehashing, Extendible Hashing
    • Implementation of Dictionaries.

UNIT- V

2 of 42

Hashing is defined as follows...

“Hashing is the process of indexing and retrieving element (data) in a data structure to provide a faster way of finding the element using a hash key”

  • Here, the hash key is a value which provides the index value where the actual data is likely to be stored in the data structure. In this data structure, we use a concept called Hash table.
  • Hash table:  Hash table to store data. All the data values are inserted into the hash table based on the hash key value.
  • The hash key value is used to map the data with an index or slot in the hash table. And the hash key is generated for every data using a hash function.

3 of 42

Hash Table is defined as follows...

“Hash table is just an array which maps a key (data) into the data structure with the help of hash function such that insertion, deletion and search operations are performed with constant time complexity (i.e. O(1)).”

4 of 42

Basic concept of hashing and hash table is shown in the following figure...

Example:

5 of 42

Example: 01

  • Suppose we have integer items {26, 70, 18, 31, 54, and 93}. One common method of determining a hash key is the division method of hashing and the formula is :

Data Item

Value % No. of Slots

Hash Value

26

26 % 10 = 6

6

70

70 % 10 = 0

0

18

18 % 10 = 8

8

31

31 % 10 = 1

1

54

54 % 10 = 4

4

93

93 % 10 = 3

3

41

41 % 10 = 1

1 (collision occurs)

74

74 % 10 = 4

4 (collision occurs)

6 of 42

Collision Resolution Techniques

When the hash value of a key maps to an already occupied bucket/slot/index of the hash table, it is called as a Collision.

7 of 42

To handle the collision,

  • This technique creates a linked list to the slot for which collision occurs.
  • The new key is then inserted in the linked list.
  • These linked lists to the slots appear like chains.
  • That is why; this technique is called as separate chaining.

 

Advantages of separate chaining

  • It is easy to implement.
  • The hash table never fills full, so we can add more elements to the chain.
  • It is less sensitive to the function of the hashing.

Disadvantages of separate chaining”

  • In this, cache performance of chaining is not good.
  • The memory wastage is too much in this method.
  • It requires more space for element links

8 of 42

Load Factor (λ)

Load factor (λ) is defined as-

 

Load factor (λ) = Number of Elements present in the hash table (n) /

Total size of the Hash Table (N)

9 of 42

Example: 01

Using the hash function ‘key mod 7’, insert the following sequence of keys in the hash table-

50, 700, 76, 85, 92, 73 and 101. Use separate chaining technique for collision resolution

Data Item

Value % No. of Slots

Hash Value

50

50 % 7 = 1

1

700

700 %7 = 0

0

76

76 % 7 = 6

6

85

85 % 7 = 1

1

( collision occurs)

92

92 % 7 = 1

1

(collision occurs)

73

73 % 7 = 3

3

101

101 % 7 =3

3

(collision occurs)

10 of 42

Open addressing is collision-resolution method that is used to control the collision in the hashing table. There is no key stored outside of the hash table. Therefore, the size of the hash table is always greater than or equal to the number of keys. It is also called closed hashing.

    • Linear probing:
  • This Hashing technique finds the hash key value through hash function and maps the key on particular position in hash table.

key = key % size;

  • In case if key has same hash address (collision) then it will find next empty position in the hash table.

key = (key+i) % size; here i is 0 to size-1

  • We take the hash table as circular array. So if table size is N then after N-1 position it will search from 0th position in the array.

11 of 42

Example: 01:

Suppose we have a list of size 20 (m = 20). We want to put some elements in linear probing fashion. The elements / keys are {96, 48, 63, 29, 87, 77, 48, 65, 69, 94, 61}

Data Item

Value % No. of Slots

Hash Value

probes

96

96 % 20 =16

16

1

48

48 % 20 =8

8

1

63

63 % 20 =3

3

1

29

29 % 20 =9

9

1

87

87 % 20 =7

7

1

77

77 % 20 =17

17

1

48

48 % 20 =8

(48+1)%20=9

(48+2)%20=10

Collision occurs

10

3

65

65 % 20 =5 

5

1

69

69 % 20 =9

(69+1) % 20=10

(69+2) % 20=11 

Collision occurs

11

3

94

94 % 20 =14

14

1

61

61 % 20 =1

1

1

12 of 42

Hash Table

  • The order of Elements /keys in Hash Table are:

--,61,--,63,--,65,--87,48,29,48,69,--,--,94,--,96,77,--,--

  • Total Number of Probes are 15

13 of 42

2. Quadratic Probing:

  • This Hashing technique finds the hash key value through hash function and maps the key on particular position in hash table.

key = key % size;

  • In case if key has same hash address (collision) then it will find next empty position in the hash table.

key = (key+i2) % size; here i is 0 to size-1

  • We take the hash table as circular array. So if table size is N then after N-1 position it will search from 0th position in the array.

14 of 42

Suppose we have a list of size 20 (size = 20). We want to put some elements / keys in Quadratic Probing fashion. The elements /Keys are { 96, 48, 63, 29, 87, 77, 48, 65, 69, 94, 61}.

Example: 01

Data Item

Value % No. of Slots

Hash Value

probes

96

96 % 20 =16

16

1

48

48 % 20 =8

8

1

63

63 % 20 =3

3

1

29

29 % 20 =9

9

1

87

87 % 20 =7

7

1

77

77 % 20 =17

17

1

48

48 % 20 =8

(48+1)%20=9

(48+4)%20=12

Collision occurs

12

3

65

65 % 20 =5

5

1

69

69 % 20 =9

(69+1) % 20=10

Collision occurs

10

2

94

94 % 20 =14

14

1

61

61 % 20 =1

1

1

15 of 42

Hash Table

  • The order of Elements /keys in Hash Table are: --,61,--,63,--,65,--87,48,29,69,--,48,--94,--96,77,--,--
  • Total Number of Probes are 14

16 of 42

3. Double Hashing:

  • Double hashing uses the idea of applying a second hash function to key when collision occurs.
  • First hash function typically h1 (key)=key%size
  • Double hashing can be done using (When collision Occurs):

(h1 (key) + i*h2 (key)) % Table size.

Here h2(key)= PRIME-(key% PRIME) where PRIME is a prime smaller than Table Size.

  • We repeat by increasing i when collision occurs ( i values from 0 to size-1)

 

NOTE: When collision occurs then finds the Second Hash function

17 of 42

Suppose we have a list of size 10 (size = 10). We want to put some elements / keys in Double hashing fashion. The elements /Keys are { 3,22,27,40,42,11,19}.

Example: 01

Data Item

h1=h1(key)% 10

h2=7-(key%7)

Final Hash value =((h1+i*h2))%10

probes

3

3 % 10 =3

-

 

1

22

22 % 10 =2

 

1

27

27 % 10 =7

 

1

40

40 % 10 =0

 

1

42

42 % 10 =2

Collision Occurred

7-(42%7)=7

(2+1*7)%10=9

2

11

11 % 10 =1

 

 

1

19

19%10=9

Collision Occurred

7-(19%7)=2

(9+1*2)%10=1

Collision Occurred

(9+2*2)%10=3

Collision Occurred

(9+3*2)%10=5

 

4

0

40

1

44

2

22

3

3

4

 

5

19

6

 

7

27

8

 

9

42

  • The order of Elements in Hash Table are: 40,44,22,3,--,19,--,27,--,42,--.
  • Total Number of Probes are 10

18 of 42

Separate Chaining Vs Open Addressing-

Separate Chaining

Open Addressing

Keys are stored inside the hash table as well as outside the hash table.

All the keys are stored only inside the hash table.

No key is present outside the hash table.

The number of keys to be stored in the hash table can even exceed the size of the hash table.

The number of keys to be stored in the hash table can never exceed the size of the hash table.

Deletion is easier.

Deletion is difficult.

Extra space is required for the pointers to store the keys outside the hash table.

No extra space is required.

Cache performance is poor.

This is because of linked lists which store the keys outside the hash table.

Cache performance is better.

This is because here no linked lists are used.

Some buckets of the hash table are never used which leads to wastage of space.

Buckets may be used even if no key maps to those particular buckets.

19 of 42

Conclusions-

  • Linear Probing has the best cache performance but suffers from clustering.
  • Quadratic probing lies between the two in terms of cache performance and clustering.
  • Double caching has poor cache performance but no clustering.

20 of 42

Additional concepts:

Rehashing is a technique in which the table is resized, i.e., the size of table is doubled by creating a new table. It is preferable is the total size of table is a next prime number. There are situations in which the rehashing is required.

    • When table is completely full
    • With Linear probing when the table is filled with 70 %
    • When insertions fail due to overflow.
  • Rehashing means hashing again. Basically, when the Load Factor Increase or its reaches to 1 or Greater than 1 them complexity Increases.

Load factor (λ) = Number of Elements present in the hash table (n) /

Total size of the Hash Table (N)

i.e λ<1 ,When λ=1 or λ>1 we need to increase the Hash Table Size this is Known as Rehashing.

  • In such situations, we have to transfer entries from old table to the new table by re computing their positions using hash functions.

21 of 42

Consider we have to insert the elements 37, 90, 55, 22, 17, 49, and 87. The table size is 10 and will use hash function

Example: 1

Data Item

Value % No. of Slots

Hash Value

probes

37

37%10=7

7

1

90

90%10=0

0

1

55

55%10=5

5

1

22

22%10=2

2

1

17

17%10=7

(17+1)%10=8

Collision occurs

8

2

49

49%10=9

9

1

87

87 % 10 =7

(87+1)%10=8

(87+2)%10=9

(87+3)%10=0

(87+4)%10=1

Collision occurs

1

5

0

90

1

87

2

22

3

 

4

 

5

55

6

 

7

37

8

17

9

49

  • Now this table is almost full and if we try to insert more elements collisions will occur and eventually further insertions will fail.
  • Hence we will rehash by doubling the table size. The old table size is 10 then we should double this size for new table that becomes 20. But 20 is not a prime number, we will prefer to make the table size as 23.

22 of 42

And new hash function will be

Data Item

Value % No. of Slots

Hash Value

Probs

37

37%23=14

14

1

90

90%23=21

21

1

55

55%23=9

9

1

22

22%23=22

22

1

17

17%23=17

17

1

49

49%23=3

3

1

87

87 % 23 =18

18

1

0

 

1

 

2

 

3

49

4

 

5

 

6

 

7

 

8

 

9

55

10

 

11

 

12

 

13

 

14

37

15

 

16

 

17

17

18

87

19

 

20

 

21

90

22

22

23 of 42

  • In open addressing or separate chaining, the major problem is collision could cause several buckets to be examined to find an alternative empty cell.
  • When the table gets too filled, a rehashing must be performed which is expensive.

 

Extendible hashing:

  • This is very simple strategy which provides quick access time for inserts and search operations on large database.
  • It is a dynamic hashing method. The directories and buckets are used to hash data.
  • The cost of expanding and updating is very low.

 

  • Directories: The directories store addresses of the buckets in pointers. An id is assigned to each directory which may change each time when Directory Expansion takes place.
  • Buckets: The buckets are used to hash the actual data.

24 of 42

Basic Structure of Extendible Hashing

25 of 42

  • Directories: These containers store pointers to buckets. Each directory is given a unique id which may change each time when expansion takes place. The hash function returns this directory id which is used to navigate to the appropriate bucket. Number of Directories = 2^Global Depth.
  • Buckets: They store the hashed keys. Directories point to buckets. A bucket may contain more than one pointer to it if its local depth is less than the global depth.
  • Global Depth: It is associated with the Directories. They denote the number of bits which are used by the hash function to categorize the keys. Global Depth = Number of bits in directory id.
  • Local Depth: It is the same as that of Global Depth except for the fact that Local Depth is associated with the buckets and not the directories. Local depth in accordance with the global depth is used to decide the action that to be performed in case an overflow occurs. Local Depth is always less than or equal to the Global Depth.

Frequently used terms in Extendible Hashing: 

26 of 42

  • Bucket Splitting: When the number of elements in a bucket exceeds a particular size, then the bucket is split into two parts.
  • Directory Expansion: Directory Expansion Takes place when a bucket overflows. Directory Expansion is performed when the local depth of the overflowing bucket is equal to the global depth.

27 of 42

Example-1: Now, let us consider a prominent example of hashing the following elements: 16,4,6,22,24,10,31,7,9,20,26. Bucket Size: 3 (Assume) Hash Function: Suppose the global depth is X. Then the Hash Function returns X LSBs. 

Solution: 

First, calculate the binary forms of each of the given numbers. �16- 10000 �4- 00100 �6- 00110 �22- 10110 �24- 11000 �10- 01010 �31- 11111 �7- 00111 �9- 01001 �20- 10100 �26- 11010

28 of 42

  • Initially, the global-depth and local-depth is always 1. Thus, the hashing frame looks like this: 

Inserting 16: The binary format of 16 is 10000 and global-depth is 1. The hash function returns 1 LSB of 10000 which is 0. Hence

29 of 42

Inserting 4 and 6: Both 4(100) and 6(110) have 0 in their LSB. Hence, they are hashed as follows: 

Inserting 22: The binary form of 22 is 10110. Its LSB is 0. The bucket pointed by directory 0 is already full. Hence, Over Flow occurs. 

30 of 42

As directed by Step 7-Case 1, Since Local Depth = Global Depth, the bucket splits and directory expansion takes place. Also, rehashing of numbers present in the overflowing bucket takes place after the split. And, since the global depth is incremented by 1, now, the global depth is 2. Hence, 16,4,6,22 are now rehashed w.r.t 2 LSBs.

[ 16(10000),4(100),6(110),22(10110) ] .

*Notice that the bucket which was underflow has remained untouched. But, since the number of directories has doubled, we now have 2 directories 01 and 11 pointing to the same bucket. This is because the local-depth of the bucket has remained 1. And, any bucket having a local depth less than the global depth is pointed-to by more than one directory.

31 of 42

Inserting 24 and 10: 24(11000) and 10 (1010) can be hashed based on directories with id 00 and 10. Here, we encounter no overflow condition. � 

Inserting 31,7,9: All of these elements[31(11111), 7(111), 9(1001) ] have either 01 or 11 in their LSBs. Hence, they are mapped on the bucket pointed out by 01 and 11. We do not encounter any overflow condition here

32 of 42

Inserting 20: Insertion of data element 20 (10100) will again cause the overflow problem. 

20 is inserted in bucket pointed out by 00. As directed by Step 7-Case 1, since the local depth of the bucket = global-depth, directory expansion (doubling) takes place along with bucket splitting. Elements present in overflowing bucket are rehashed with the new global depth. Now, the new Hash table looks like this: 

33 of 42

Inserting 26: Global depth is 3. Hence, 3 LSBs of 26(11010) are considered. Therefore 26 best fits in the bucket pointed out by directory 010

34 of 42

The bucket overflows, and, as directed by Step 7-Case 2, since the local depth of bucket < Global depth (2<3), directories are not doubled but, only the bucket is split and elements are rehashed. �Finally, the output of hashing the given list of numbers is obtained. 

35 of 42

36 of 42

Implementation of Dictionaries:

In  computer science, a dictionary is an abstract data type that represents an ordered or unordered list of key-value pair elements where keys are used to search/locate the elements in the list. In a dictionary ADT, the data to be stored is divided into two parts:

  • Key
  • Value.

Each item stored in a dictionary is represented by a key-value pair. Key is used to access the elements in the dictionary. With the key we can access value which has more information about the element.

37 of 42

Characteristics of Dictionary

  • Key-Value Pairs: Dictionaries store data as key-value pairs where each key is unique and maps to exactly one value.
  • Direct Access: The primary feature of dictionaries is to provide fast access to elements not by their position, as in lists or arrays, but by their keys.
  • Dynamic Size: Like many abstract data types, dictionaries typically allow for dynamic resizing. New key-value pairs can be added, and existing ones can be removed.
  • Ordering: Some dictionaries maintain the order of elements, such as ordered maps or sorted dictionaries. Others, like hash tables, do not maintain any particular order.
  • Key Uniqueness: Each key in a dictionary must be unique, though different keys can map to the same value

38 of 42

Types of Dictionary

There are two major variations of dictionaries:

Ordered dictionary.

  • In an ordered dictionary, the relative order is determined by comparison on keys.
  • The order should be completely dependent on the key.

Unordered dictionary.

  • In an unordered dictionary, no order relation is assumed on keys.
  • Only equality operation can be performed on the keys.

39 of 42

40 of 42

Basic Dictionary Operations

The dictionary ADT provides operations for inserting the records, deleting the records and searching the records in the collection of databases. Dictionaries typically support so many operations such as:

  • Insert(x, D) -> insertion of element x(key & value) in to dictionary D.

Insert(key, value)

  • Example: Insert(age, 40)

Delete(x, D) -> deletion of element x(key & value) from the dictionary D.

delete(key)

  • Example: delete(age)

Search(x, D) -> searching the prescribed value of x in the dictionary D with a key of an element x.

search(key) – value

41 of 42

Example: search(age) – 40

  • Member(x, D) -> It returns “true” if x belongs to D else returns “false”.
  • size(D) -> It returns the count of total number of elements in dictionary D.
  • Max(D) -> It returns the maximum element in the dictionary D.
  • Min(D) -> It returns the minimum element in the dictionary D.

42 of 42

Example: Consider an empty unordered dictionary and the following set of operations: