1 of 59

Hashing: An efficient Searching

2 of 59

Searching

  • Linear Search
  • Sentinel Search
  • Binary Search

3 of 59

Can we decrease the time for searching?

  • Hashing is the solution
    • Instead of checking multiple elements, we can directly reach the element!
    • The time required to search an element does not depend on the number of elements
    • The data is searched with constant time complexity – O(1) instead of O(n) in linear search and O(log2n) in binary search
  • By definition, “Hashing is the process of indexing and retrieving data in a data structure to provide a faster way of finding the element using the hash key”
    • A data structure called Hash table is used for this purpose

4 of 59

Hashing

Index /Hash value

Actual Data

0

1000

1

201

2

702

3

99

699

Hash function

Key

Actual Data

Hash value/Hash key/ Index

generated

5 of 59

Hash table

  • Hash table is a data structure
    • that stores key-value pairs
    • and maps the key to a numeric value (hash key) with the help of a hash function
    • such that insertion, deletion and search operations can be performed with constant time complexity O(1)
  • All the data values are inserted into the hash table based on the hash key value
  • Hash key value is used to map the data with index in the hash table
  • Hash key is generated for every data using some function – hash function
  • When a hash table is in use, some spots contain valid records, and other spots are "empty“

6 of 59

Applications of Hash table

  • Database systems
    • it helps in random access
  • Symbol tables
    • Compilers use this table to maintain information about different symbols
    • It needs to access information frequently –random access done
  • Data dictionaries
    • Data dictionary or any other dictionary can be implemented using hash table
  • Network processing algorithm
    • Hash tables are fundamental components of several network processing algorithm
    • Route lookup, packet classification etc.

7 of 59

Bucket

  • Hash table stores entries (key, value) pairs in buckets
  • A hash table uses a hash function to compute an index- that refers to the address of the bucket
  • Ideally, the hash function is written in such a manner that each key is mapped to a unique bucket

8 of 59

Issues in Hashing

  • Multiple keys can hash to the same slot – collisions are possible .
    • Design hash functions such that collisions are minimized.
    • But avoiding collisions is impossible.
    • Design collision-resolution techniques.
  • Search will cost Ө(n) time in the worst case
    • However, all operations can be made to have an expected complexity of Ө (1).
  • Overflow
    • When the home bucket for a new (key,value) pair is already full

9 of 59

An example

Student records having Roll numbers

  • The size is of 100
  • Key is last 2 digit of your roll number

Students from various colleges

  • Size is 100
  • Key is phone number
    • Can we truncate the phone number?

10 of 59

Hash Function

  • The process of converting a key to hash key is known as hash function
  • A hash function is any function that can be used to map a data set of an arbitrary size to a data set of a fixed size
    • Mapping from U to the slots of a hash table T[0..m–1].
    • h : U {0,1,…, m–1}
  • The values returned by the hash functions are called hash keys, hash values, hash codes, hash sums or hashes
    • hash key = function(key)
    • h[k] is the hash value of key k
  • Our aim is to find a good hash function

11 of 59

Properties of a good hash function

  • Easy to compute
    • Fast and easy computation
  • Uniform Distribution
    • Every hash value has an equal probability
    • Even if the keys are not uniformly distributed
    • If a hashing function groups key values together, this is called clustering of the keys.
  • Fewer collisions
  • Dispersing the keys in a random way
  • Should not waste space unnecessarily
    • For every index, there is at least one key that hashes to it
    • Load factor lambda λ = (number of keys / TableSize)

12 of 59

Different Hash functions

  • Direct hashing
  • Subtraction
  • Modulo-division method
  • Digit Extraction method
  • Multiplication method
  • Mid-square method
  • Folding method
  • Radix Transformation
  • Rotation method
  • Pseudo-random Method
  • Universal method

* The methods marked in Bold are important for exam

13 of 59

Direct Hashing

  • Address for a key is generated without any algorithmic manipulation.
  • The hash table must contain an address for every possible key
  • Roll number to index
    • 01 – 1
    • 68 - 68
  • Pros:
    • No calculation and easy to implement
    • No collision!
  • Cons
    • Wastage of space if the keys are not consecutive
    • Not suitable for big and diverse lists

14 of 59

Subtraction method

  • Slight variation from the previous method
    • Little bit calculation required
    • Suppose, keys are consecutive but not starting from 1
    • Example: Roll numbers are 1000 to 1100
    • Subtract 1000 from the key to get the address

15 of 59

Modulo-Division method

  • Divides the key by the list size and uses the remainder+1 for the address
    • h(k) = (k % ListSize) +1
    • Listsize should be prime.. Why?
      • Because keys are typically not randomly distributed, but usually have some pattern
        • mostly even
        • mostly multiples of 10
        • in general: mostly multiples of some x
      • If ‘x’ is a factor of ListSize, then only (ListSize/x) slots will ever be used!
      • Since the only factor of a prime number is itself, this phenomena only hurts in the (rare) case where x = ListSize

16 of 59

Modulo-Division method Example

  • Suppose we want to maintain data for 300 students
  • First prime number greater than 300 is 307
  • Let us choose the list size as 307
  • Let’s consider some keys and calculate the addresses:
    • 7028870131 - 137 + 1 = 138
    • 7057421821 - 213 + 1 = 214
    • 7387645749 - 205 + 1 = 206
    • 8080105892 – 51 + 1 = 52
    • 8879528852 – 230 +1 = 231
  • https://miniwebtool.com/modulo-calculator
    • Check the hash value for your mobile number!

17 of 59

Modulo-Division method- Strings as Keys

  • If keys are strings, can get an integer by adding up ASCII values of characters in key

for (i=0;i<key.length();i++)

hashVal += key.charAt(i);

  • Problem 1: What if TableSize is 10,000 and all keys are 8 or less characters long?

  • Problem 2: What if keys often contain the same characters (“abc”, “bca”, etc.)?

17

18 of 59

Pros and Cons

  • Pros:
    • Fast and easy computation
  • Cons
    • ‘list_size’ should be used chosen carefully
    • Problems with non-prime ‘list_size’
    • Consecutive keys map to consecutive addresses leading to clustering

19 of 59

Extraction / Truncation method

  • Ignoring part of the key and using the rest as the array index
    • It may not always be even distribution throughout the table
    • Let’s consider the key:
      • 8879528852
      • Use just last three digits as the address / index -852

20 of 59

Multiplication method

  • Choose a constant – A (0<A<1)
  • Multiply key (k) by A
  • Extract the fractional part of k*A -> f
    • A number between 0 and 1
  • Multiply f by listsize m
  • Take floor of the multiplication
    • The f will be converted to a number between 0 and m-1

h(k) = floor[ m (kA mod 1) ]

21 of 59

Multiplication method - Example

  • k=123
  • m=100
  • A=0.618033
  • h(123) = 100 (123 * 0.618033 mod 1)

= 100 (76.018059 mod 1)

= 100 (0.018059)

= 1

22 of 59

Pros and Cons of Multiplication method

  • Pros:
    • ‘m’ is not critical
    • It can work with any value of A. As per Knuth, the best A = (sqrt(5) – 1)/2 = .61803
    • If table size is a power of two, then getting the index from the hash could be implemented as bitwise AND operation
      • computing the table index by the key, with multiplication hashing, is very fast.
  • Cons
    • Complex calculation wrt Division method

23 of 59

Mid-square method

  • The key value is squared
    • If the key is non-numeric, some type of preprocessing needs to be done to create a numeric value
  • Middle ‘r’ digits or mid part of the result is used as the index
    • If the mid part is greater than the list_size, then modulo operation is performed
  • The whole key is used for hash generation,
    • leading to greater chance of finding a different address each time per key value
    • Result is not dominated by the bottom digit or top digit
  • Difficulty is choosing ‘r’ properly!
  • In application, powers of two are more efficient for the table size, and the middle of the bit string of the square of the key is used

24 of 59

Mid-square method - example

Suppose, List_size = 100

So, r = 2, as 2 digits are required to map the keys to indices

  • Suppose key k= 50
  • k*k = 2500
  • h(50) = 50
  • The hash value obtained is 50

  • Suppose key k= 2199
  • k*k = 4835601
  • h(50) = 56
  • The hash value obtained is 56

25 of 59

Folding method

  • Partition the key into several pieces and then individual pieces are combined (or folded) in some convenient way (may be summation)
  • Two types of folding are used
    • shift folding and
    • boundary folding
  • In shift folding, the parts are placed underneath each other and then processed (for example, by adding) –
    • Using a Mobile number, say 123456789, we can divide it into three parts - 123, 456, and 789
    • and add them to get 123 + 456 + 789 = 1368 (or 368 ignoring the carry)
    • This can then be modulo divided by List_Size to get the address

26 of 59

Folding method

  • With boundary folding, the key is visualized as being written on a piece of paper and folded on the boundaries between the parts
    • The result is that alternating parts of the key are reversed
    • So, the Mobile number part would be 123, 654, 789, totaling 123 + 654 + 789 = 1566 (or 566)
    • Or it can be 321 + 456 + 987 = 1764 (or 764)
    • As can be seen, in both versions, the key is divided into even length parts of some fixed size, plus any leftover digits
    • Then these are added together, and the result is divided modulo the table size
  • Consequently, this is very fast and efficient, especially if bit strings are used instead of numbers
  • With character strings, one approach is to exclusively-or the individual character together and use the result

27 of 59

Radix Transformation method

  • With radix transformation, the key is transformed into a different base
  • For instance, if K is 345 in decimal, its value in base 9 is 423
  • This result is then divided modulo List_Size and the resulting value becomes the address top which K is hashed
  • The drawback to this approach is collisions cannot be avoided
  • For example, if List_Size is 100, then although 345 and 245 in decimal will not collide, 345 and 264 will collide because 264 is 323 in base nine and 345 is 424 in base nine
  • These two values will collide when divided modulo 100

28 of 59

Rotation method

  • Rotation method is incorporated in combination with other hashing methods.
  • It is particularly useful when keys are assigned serially
  • This way, the keys can be spread more evenly throughout the table, instead of clustering

Original Key

Rotation

Rotated Key

600351

351

135

600352

352

235

600353

353

335

29 of 59

Pseudorandom method

  • The key is used as the seed in pseudorandom number generator and the resulting random number generated is mapped into the possible addresses using modulo-division
  • Y = a * key +c
  • a = 17, c=7, list_size = 307
  • Key = 121267
  • Y = (17 * 121267 + 7) % 307 + 1

= (2061539 + 7) %307 +1

= (2061546) %307 + 1

= 41 + 1

= 42

30 of 59

Universal hashing

  • If the list_size m is much smaller than universe size u, then for any hash function h, there is some large subset of U that has the same hash value.
  • We need a set of hash functions, from which we can choose any one that works well for S.
  • If most of the hash functions are better for S, we can choose random hash function
  • Leading to low number of collisions
  • Many universal families are known, and their evaluation is often very efficient

31 of 59

Collision Resolution Techniques

32 of 59

Collision

  • When two different keys produce the same address, collision happens.
    • h(k1) = h(k2) and k1 ≠ k2
  • The keys involved in collision are called synonyms
    • k1 and k2

33 of 59

Avoidance vs. Resolution

  • Avoiding collision is extremely difficult
  • It is better to deal with the collision and devise some strategies
    • Spreading out the records
    • Use extra memory
    • Keeping more than one record at a single address

34 of 59

Strategies for Collision Resolution

  • Chaining
    • To store colliding keys in a linked list at the same hash table index
  • Open Addressing
    • To store colliding keys somewhere else in the table
    • Trying the next available slot is called -Probing
    • Types:
      • Linear Probing
      • Quadratic Probing
      • Double Hashing

35 of 59

Chaining

  • “Chaining" or "separate chaining” or “open hashing”
  • All keys that map to the same table location are kept in a linked list (this linked list is known as "chain" or "bucket")
  • Example: � insert 10, 22, 86, 12, 42 � with h(x) = x % 10

0

/

1

/

2

/

3

/

4

/

5

/

6

/

7

/

8

/

9

/

10

/

22

/

86

/

12

/

42

/

36 of 59

What will happen in worst case?

  •  

37 of 59

Load factor?

0

1

/

2

3

/

4

/

5

/

6

7

/

8

/

9

/

10

/

42

86

/

12

22

/

 

 

38 of 59

Deletion in separate chaining

  • Find in table
  • Delete from bucket
  • Similar run-time as insert
  • Sensitive to underlying bucket structure

39 of 59

Closed Hashing/Open Addressing

  • Open Addressing or probing is carried out for insertion into fixed size hash tables
  • If the index determined by the hash function is already occupied, next empty bucket should be used
  • There are commonly three techniques for determining the next empty bucket
    • Linear probing
      • Without chaining
      • With chaining and without replacement
      • With chaining and with replacement
    • Quadratic probing
    • Double hashing

40 of 59

Closed Hashing/Open Addressing: Linear Probing

Separate chaining does not use all the space in the table. Why not use it?

  • Store directly in the array cell
  • No linked lists or buckets

How to deal with collisions?

If h(key) is already full,

try (h(key) + 1) % TableSize. If full,

try (h(key) + 2) % TableSize. If full,

try (h(key) + 3) % TableSize. If full…

Example: insert 38, 19, 8, 79, 10

0

1

2

3

4

5

6

7

8

9

41 of 59

Delete

  • Can we simply delete a key?
    • Search may fail
    • Deleted keys are marked as ‘deleted’ such that new items can be inserted there but search does not stop

42 of 59

Numerical

  • Insertion of keys into hash table using linear probing as collision resolution technique
    • The keys 12, 18, 13, 2, 3, 23, 5 and 15 are inserted into an initially empty hash table of length 10 using open addressing with hash function h(k) = k mod 10 and linear probing. What is the resultant hash table?

  • .

43 of 59

Numerical

  • Given a hash table with keys, verify/find possible sequence of keys leading to hash table –
    • A hash table of length 10 uses open addressing with hash function h(k)=k mod 10, and linear probing. After inserting 6 values into an empty hash table, the table is as shown below . Which one of the following choices gives a possible order in which the key values could have been inserted in the table?(A) 46, 42, 34, 52, 23, 33(B) 34, 42, 23, 52, 33, 46(C) 46, 34, 42, 23, 52, 33(D) 42, 46, 33, 23, 34, 52

44 of 59

Linear probing with chaining – without replacement

  • One extra column ‘chain’ is added to keep track of the synonyms
  • This column will have the value ‘-1’ at the start
  • When some key(k2) collides with key (k1), k2 is stored in the next available slot and the slot index is k2 is stored in the chain column of k1
  • Consider inserting:
    • 3,33,42,63,89,45,93
    • h(key) = key %10

Index

Key

Chain

0

-1

1

-1

2

-1

3

-1

4

-1

5

-1

6

-1

7

-1

8

-1

9

-1

Index

Key

Chain

0

-1

1

-1

2

42

-1

3

3

4

4

33

5

5

63

7

6

45

-1

7

93

-1

8

-1

9

89

-1

45 of 59

Linear Probing with chaining – with replacement

  • In the previous example, key 45 got misplaced by probing, 63 is already occupying it!
  • To overcome this issue, key at index 5 is replaced by 45 and 63 is shifted to another empty location

Index

Key

Chain

0

-1

1

-1

2

42

-1

3

3

4

4

33

6

5

45

-1

6

63

7

7

93

-1

8

-1

9

89

-1

46 of 59

Load factor

  • Can the load factor ever exceed 1.0?
  • If the load factor is high
    • Too many probes -> losing O(1)
  • For any λ < 1, linear probing will find an empty slo
    • But clustering can be a problem
  • It is better to keep the load factor under 0.7
  • Double the table size and rehash if load factor gets high
  • Cost of Hash function f(x) must be minimized

47 of 59

Problem with Linear Probing

It turns out linear probing is a bad idea, even �though the probe function is quick to compute �(which is a good thing)

  • This tends to produce clusters, which lead to �long probe sequences
  • This is called primary clustering

48 of 59

Quadratic Probing

  • We can avoid primary clustering by changing the probe function from
    • i to f(i)
    • (h(key) + f(i)) % TableSize

For quadratic probing, f(i) = i2:

0th probe: (h(key) + 0) % TableSize

1st probe: (h(key) + 1) % TableSize

2nd probe: (h(key) + 4) % TableSize

3rd probe: (h(key) + 9) % TableSize

ith probe: (h(key) + i2) % TableSize

Intuition: Probes quickly "leave the neighborhood"

49 of 59

Quadratic Probing - Example

  • TableSize = 10
  • insert(89)
    • 89%10 = 9
  • insert(18)
    • 18%10 = 8
  • insert(49)
    • 49%10=9 Collision
    • (49+1)%10 = 0
  • insert(58)
    • 58%10 = 8 Collision
    • (58+1)%10 = 9 Collision
    • (58+4)%10 = 2
  • insert(79)
  • 79 % 10 = 9 collision!
  • (79 + 1) % 10 = 0 collision!
  • (79 + 4) % 10 = 3

0

1

2

3

4

5

6

7

8

9

89

0

49

1

2

58

3

79

4

5

6

7

8

18

9

89

50 of 59

Problem with Quadratic Probing

  • While linear probing always finds an empty cell, quadratic probing may not find any vacant cell even if there is available slot!
  • It may lead to secondary clustering
  • Suppose the table size is 16.
    • Probe offsets that will be tried:
      • 1 mod 16 = 1
      • 4 mod 16 = 4
      • 9 mod 16 = 9
      • 16mod 16 = 0
      • 25mod 16 = 9
      • 36 mod 16 = 4
      • 49 mod 16 = 1
      • 64 mod 16 = 0
      • 81mod 16 = 1

only four different values!

51 of 59

Double Hashing

  • No need to use an arithmetic probe function, rather the address is rehashed
  • If the first hash function yields an address which is already occupied, apply second hash function to get a different address!

52 of 59

Double Hashing - Pseudorandom

  • If h1(k2) = h1(k1), apply h2
    • (h1(k2) + i * h2(k2)) % TABLE_SIZE
  • First hash function hash1(key) = key % TABLE_SIZE
  • A popular second hash function:  hash2(key) = PRIME – (key % PRIME) where PRIME is a prime smaller than the TABLE_SIZE.

53 of 59

Example…

  • A good hash function:
    • It must never evaluate to zero
    • Must make sure that all cells can be probed
  • Insert 4371,1323,6173,4199,4344,9679
  • H1(k) = key%10
  • Hh2(k) = 7-(key%7)

  • 4371 – h1(4371)=1
  • 1323 – h1(1323)=3
  • 6173 – h1(6173)=3 collision
    • h2(6173)=7-(6173%7) = 7-6 = 1
    • h1(6173) + 1.h2(6173) = 3+1 = 4
  • 4199 - h1(4199)=9
  • 4344 - h1(4344) = 4 collision
    • h2(4344) = 7 – (4344%7) =3
    • h1(4344) + 1.h2(4344) = 4+3 = 7
  • 9679 – h1(9699) = 9 Collision
    • h2(9699) = 7-(9699%7) = 2
    • h1(9699) + 1.h2(9699) = 9+2 = 11 = 11%10=1 –Again Collision!!
    • h1(9699) + 2. h2(9699) = 9 + 4 = 13 -> 3
    • h(9699) + 3.h2(9699) = 9 +6 =15 ->5

0

1

4371

2

3

1323

4

6173

5

9699

6

7

4344

8

9

4199

54 of 59

Pros and Cons

  • No clustering here!
  • Poor cache performance

55 of 59

Double Hashing – Key Offset

  • Key offset is another double hashing method that produces different collision paths for different keys.
  • New address is a function of the old address and the key
    • Offset = key/listsize
    • Address = [(offset + old address)%listsize] + 1

56 of 59

Key Offset - Example

  • Key = 9699, list_size = 13
  • h1(9699) = 9699%13 = 1
    • If Collision
  • Offset = (9699/13) = 746
  • Address = (746 + 1)%13 + 1 =5+1 =6

57 of 59

Collision Resolution Techniques - Summary

58 of 59

Summary

  • Hash table is a great data structure for storing unordered data that supports insert, delete & find
  • Both separate chaining (open) and open addressing (closed) hashing are useful
    • separate chaining flexible
    • closed hashing uses less storage, but performs badly with load factors near 1
    • extendible hashing for very large disk-based data
  • Hashing pros and cons
    • + very fast
    • + simple to implement, supports insert, delete, find
    • - lazy deletion necessary in open addressing, can waste storage
    • - does not support operations dependent on order: min, max, range

59 of 59

Summary - Definitions

  • Hash Table- A hash table [hash map] is a data structure that provides virtually direct access to objects based on a key [a unique String or Integer]. A hash table uses a hash function to compute an index into an array of buckets or slots, from which the desired value can be found.
  • Synonyms – The set of keys that hash to the same location in the hash table, is called synonyms
  • Collision – Collision is the event that occurs when a hashing algorithm produces an address for an insertion key and that address is already occupied
  • Probing – Calculating the address and finding out the exact location is known as probing
  • Bucket- Bucket is the fast-access location, that is the result of the hash function. Basically, bucket is the place where actual data is kept.
  • Overflow: An overflow occurs when the home bucket for a new key-value pair is already occupied
  • Open Hashing: Keys are stored in linked list attached to cells of a hash table
  • Closed Hashing: All keys are stored in the hash table itself