1 of 26

CHAPTER 9

Hash Tables

By Prof. Rafael Orta, M.S.

NOTE: The material in this presentation is a compilation from C++ for Dummies by Stephen Randy Davis and the book “Data Structures and other objects using C++” by Michael Main and Walter Savitvch, and original material from your professor.

2 of 26

https://quizizz.com/

3 of 26

This class is being recorded for your convenience

4 of 26

  • Pointers
  • Dynamic Arrays

Last week we covered

5 of 26

This week agenda

  • Hash Tables
  • Collision Resolution Techniques.

- Separate Chaining

- Linear Probing

- Quadratic probing

- Double Hashing

  • Common hash functions
  • Direct hashing
  • Hashing Algorithms: Cryptography, Password Hashing

6 of 26

Hash table overview

“A hash table is a data structure that stores unordered items by mapping (or hashing) each item to a location in an array (or vector).”

“A hash table's main advantage is that searching (or inserting / removing) an item may require only O(1), in contrast to O(N) for searching a list or to O(log N) for binary search. “

“In a hash table, an item's key is the value used to map to an index. For all items that might possibly be stored in the hash table, every key is ideally unique, so that the hash table's algorithms can search for a specific item by that key. “

“Each hash table array element is called a bucket. A hash function computes a bucket index from the item's key.”

7 of 26

Hash table overview

8 of 26

Knowledge check

9 of 26

10 of 26

Hash Tables

Empty cells

The approach for a hash table algorithm determining whether a cell is empty depends on the implementation. For example, if items are simply non-negative integers, empty can be represented as -1. More commonly, items are each an object with multiple fields (name, age, etc.), in which case each hash table array element may be a pointer. Using pointers, empty can be represented as null.

Collisions

A collision occurs when an item being inserted into a hash table maps to the same bucket as an existing item in the hash table. Ex: For a hash function of key % 10, 55 would be inserted in bucket 55 % 10 = 5; later inserting 75 would yield a collision because 75 % 10 is also 5. Various techniques are used to handle collisions during insertions, such as chaining or open addressing. Separate Chaining is a collision resolution technique where each bucket has a list of items (so bucket 5's list would become 55, 75). Open addressing is a collision resolution technique where collisions are resolved by looking for an empty bucket elsewhere in the table (so 75 might be stored in bucket 6).

11 of 26

Hash Tables – Separate Chaining

Chaining handles hash table collisions by using a list for each bucket, where each list may store multiple items that map to the same bucket. The insert operation first uses the item's key to determine the bucket, and then inserts the item in that bucket's list. Searching also first determines the bucket, and then searches the bucket's list. Likewise for removes.

12 of 26

13 of 26

Hash Tables – Linear Probing

A hash table with linear probing handles a collision by starting at the key's mapped bucket, and then linearly searches subsequent buckets until an empty bucket is found.

14 of 26

Hash Tables – Linear Probing

Empty bucket types

“Actually, linear probing distinguishes two types of empty buckets. An empty-since-start bucket has been empty since the hash table was created. An empty-after-removal bucket had an item removed that caused the bucket to now be empty. The distinction will be important during searches, since searching only stops for empty-since-start, not for empty-after-removal.”

Removals using linear probing

“Using linear probing, a hash table remove algorithm uses the sought item's key to determine the initial bucket. The algorithm probes each bucket until either a matching item is found, an empty-since-start bucket is found, or all buckets have been probed. If the item is found, the item is removed, and the bucket is marked empty-after-removal. “

15 of 26

Hash Tables – Linear Probing

Searching using linear probing

“In linear probing, a hash table search algorithm uses the sought item's key to determine the initial bucket. The algorithm probes each bucket until either the matching item is found (returning the item), an empty-since-start bucket is found (returning null), or all buckets are probed without a match (returning null). If an empty-after-removal bucket is found, the search algorithm continues to probe the next bucket.”

16 of 26

Hash Tables – Quadratic Probing

Overview and insertion

“A hash table with quadratic probing handles a collision by starting at the key's mapped bucket, and then quadratically searches subsequent buckets until an empty bucket is found. If an item's mapped bucket is H, the formula (𝐻+𝑐1∗𝑖+𝑐2∗𝑖2)mod(𝑡𝑎𝑏𝑙𝑒𝑠𝑖𝑧𝑒)

is used to determine the item's index in the hash table. c1 and c2 are programmer-defined constants for quadratic probing. Inserting a key uses the formula, starting with I = 0, to repeatedly search the hash table until an empty bucket is found. Each time an empty bucket is not found, i is incremented by 1. Iterating through sequential I values to obtain the desired table index is called the probing sequence.”

Search and removal

“The search algorithm uses the probing sequence until the key being searched for is found or an empty-since-start bucket is found. The removal algorithm searches for the key to remove and, if found, marks the bucket as empty-after-removal.”

17 of 26

18 of 26

Hash Tables – Double Hashing

Double hashing

It is an open-addressing collision resolution technique that uses 2 different hash functions to compute bucket indices. Using hash functions h1 and h2, a key's index in the table is computed with the formula (ℎ1(𝑘𝑒𝑦)+𝑖∗ℎ2(𝑘𝑒𝑦))mod(𝑡𝑎𝑏𝑙𝑒𝑠𝑖𝑧𝑒). Inserting a key uses the formula, starting with i = 0, to repeatedly search hash table buckets until an empty bucket is found. Each time an empty bucket is not found, i is incremented by 1. Iterating through sequential i values to obtain the desired table index is called the probing sequence.”

19 of 26

Common Hash Functions

A good hash function minimizes collisions

“A hash table is fast if the hash function minimizes collisions.

A perfect hash function maps items to buckets with no collisions. A perfect hash function can be created if the number of items and all possible item keys are known beforehand. The runtime for insert, search, and remove is O(1) with a perfect hash function.

A good hash function should uniformly distribute items into buckets. With chaining, a good hash function results in short bucket lists and thus fast inserts, searches, and removes. With linear probing, a good hash function will avoid hashing multiple items to consecutive buckets and thus minimize the average linear probing length to achieve fast inserts, searches, and removes. On average, a good hash function will achieve O(1) inserts, searches, and removes, but in the worst-case may require O(N).

A hash function's performance depends on the hash table size and knowledge of the expected keys. Ex: The hash function key % 10 will perform poorly if the expected keys are all multiples of 10, because inserting 10, 20, 30, ..., 90, and 100 will all collide at bucket 0. “

20 of 26

Common Hash Functions

Mid-square hash function

“A mid-square hash squares the key, extracts R digits from the result's middle, and returns the middle digits. Ex: For a hash table with 100 entries and a key of 453, the decimal (base 10) mid-square hash function computes 453 * 453 = 205209 and returns the middle two digits 52. For N buckets, R must be greater than or equal to ⌈log𝑁⌉ to index all buckets. The process of squaring and extracting middle digits reduces the likelihood of keys mapping to just a few buckets.”

Mid-square hash function base 2 implementation

“The mid-square hash function is typically implemented using binary (base 2), and not decimal, because a binary implementation is faster. A decimal implementation requires converting the square of the key to a string, extracting a substring for the middle digits, and converting that substring to an integer. A binary implementation only requires a few shift and bitwise AND operations. A binary mid-square hash function extracts the middle R bits and returns the remainder of the middle bits divided by hash table size N, where R is greater than or equal to ⌈log2𝑁⌉. Ex: For a hash table size of 200, R = 8, then 8 bits are needed for indices 0 to 199. “

21 of 26

Common Hash Functions

Multiplicative string hash function

“A multiplicative string hash repeatedly multiplies the hash value and adds the ASCII (or Unicode) value of each character in the string. A multiplicative hash function for strings starts with a large initial value. For each character, the hash function multiplies the current hash value by a multiplier (often prime) and adds the character's value. Finally, the function returns the remainder of the sum divided by the hash table size N. “

Hash table size = 1000

22 of 26

Common Hash Functions

Direct hashing overview

“A direct hash function uses the item's key as the bucket index. Ex: If the key is 937, the index is 937. A hash table with a direct hash function is called a direct access table. Given a key, a direct access table search algorithm returns the item at index key if the bucket is not empty and returns null (indicating item not found) if empty.”

Limitations of direct hashing

“A direct access table has the advantage of no collisions: Each key is unique (by definition of a key), and each gets a unique bucket, so no collisions can occur. However, a direct access table has two main limitations.

All keys must be non-negative integers, but for some applications keys may be negative.

The hash table's size equals the largest key value plus 1, which may be very large.”

23 of 26

Hashing Algorithms: Cryptography, Password Hashing

Cryptography

“Cryptography is a field of study focused on transmitting data securely. Secure data transmission commonly starts with encryption: alteration of data to hide the original meaning. The counterpart to encryption is decryption: reconstruction of original data from encrypted data.”

24 of 26

Hashing Algorithms: Cryptography, Password Hashing

Hashing functions for data

“A hash function can be used to produce a hash value for data in contexts other than inserting the data into a hash table. Such a function is commonly used for the purpose of verifying data integrity. Ex: A hashing algorithm called MD5 produces a 128-bit hash value for any input data. The hash value cannot be used to reconstruct the original data but can be used to help verify that data isn't corrupt and hasn't been altered.”

Cryptographic hashing

“A cryptographic hash function is a hash function designed specifically for cryptography. Such a function is commonly used for encrypting and decrypting data.

A password hashing function is a cryptographic hashing function that produces a hash value for a password. Databases for online services commonly store a user's password hash as opposed to the actual password. When the user attempts a login, the supplied password is hashed, and the hash is compared against the database's hash value. Because the passwords are not stored, if a database with password hashes is breached, attackers may still have a difficult time determining a user's password.”

25 of 26

Knowledge Check

26 of 26