1 of 81

Introduction to Sorting and Searching�

Mr. Abhijit T. Somnathe

2 of 81

What is Sorting?

The arrangement of data in a preferred order is called sorting.
By sorting data, it is easier to search through it quickly and easily.
Sorting can be done in ascending and descending order.
The simplest example of sorting is a dictionary.

3 of 81

What is Searching?

The process of finding the desired information from the set of items stored in the form of elements in the computer memory is called ‘searching’.
These sets of items are in various forms, such as an array, tree, graph, or linked list.

4 of 81

Searching Methods

Searching in the data structure can be done by applying searching algorithms to check for or extract an element from any form of stored data structure.
These algorithms are classified according to the type of search operation they perform, such as:
Sequential search - The list or array of elements is traversed sequentially while checking every component of the set. For example – Linear Search.

5 of 81

Searching Methods

Interval Search - The interval search includes algorithms that are explicitly designed for searching in sorted data structures.
In terms of efficiency, these algorithms are far better than linear search algorithms. Example - Binary search.

6 of 81

Searching Methods

These methods are evaluated based on the time taken by an algorithm to search an element matching the search item in the data collections and are given by,
The best possible time
The average time
The worst-case time

7 of 81

Linear Search

The linear search algorithm searches all elements in the array sequentially.
Its best execution time is one, whereas the worst execution time is n, where n is the total number of items in the search array.
It is the most simple search algorithm in data structure and checks each item in the set of elements until it matches the search element until the end of data collection.

8 of 81

Linear Search

When data is unsorted, a linear search algorithm is preferred.
Linear search has some complexities as given below:
Space Complexity - Space complexity for linear search is O(n) as it does not use any extra space where n is the number of elements in an array.

9 of 81

Linear Search

Time Complexity –
Best- case complexity = O(1) occurs when the search element is present at the first element in the search array.
Worst- case complexity = O(n) occurs when the search element is not present in the set of elements or array.
Average complexity = O(n) is referred to when the element is present somewhere in the search array.

10 of 81

Linear Search

Let’s take an array of elements as given below:

45, 78, 12, 67, 08, 51, 39, 26

To find ‘51’ in an array of 8 elements given above, a linear search algorithm will check each element sequentially till its pointer points to 51 in the memory space. It takes O(6) time to find 51 in an array. To find 12, in the above array, it takes O(3), whereas, for 26, it requires O(8) time.

11 of 81

Binary Search

This algorithm finds specific items by comparing the middlemost items in the data collection. When a match occurs, it returns the index of the item.
When the middle item is greater than the item, it searches for a central item of the left sub-array.

12 of 81

Binary Search

In contrast, if the middle item is smaller than the search item, it explores the middle of the item in the right sub-array.
It continues searching for an item until it finds it or until the sub-arrays size becomes zero.
Binary search needs sorted order of items. It is faster than a linear search algorithm. It works on the divide and conquers principle.

13 of 81

Binary Search

Run-time complexity = O(log n).
The binary search algorithm has complexities as given below:
Worst-case complexity = O (n log n)
Average complexity = O (n log n)
Best case complexity = O (1)

Let’s take a sorted algorithm of 08 elements: 08, 12, 26, 39, 45, 51, 67, 78

14 of 81

Binary Search

To find 51 in an array of the above elements, The algorithm will divide an array into two arrays, 08, 12, 26, 39 and 45, 51, 67, 78.
As 51 is greater than 39, it will start searching for elements on the array’s right side.
It will further divide the into two such as 45, 51 and 67, 78.

15 of 81

Binary Search

As 51 is smaller than 67, it will start searching left of that sub-array.
That subarray is again divided into two as 45 and 51.
As 51 is the number matching to the search element, it will return its index number of that element in the array.

16 of 81

Binary Search

It will conclude that the search element 51 is located at the 6th position in an array.
Binary search reduces the time to half as the comparison count is reduced significantly than the linear search algorithm.

17 of 81

Interpolation Search

It is an improved variant of the binary search algorithm and works on the search element’s probing position.
Similar to binary search algorithms, it works efficiently only on sorted data collection.
Worst execution time = O(n)

18 of 81

Interpolation Search

When the target element’s location is known in the data collection, an interpolation search is used.
To find a number in the telephone directory, if one wants to search Rohan’s telephone number, instead of using linear or binary search, one can directly probe to memory space storage where names start from ‘R’.

19 of 81

Internal and External Sorting

If the data sorting process takes place entirely within the internal memory Random-Access Memory (RAM), it’s called internal sorting. This is possible when the size of the dataset to be sorted is small enough to be held in RAM.
If the input data is such that it cannot be adjusted in the internal memory entirely at once, it needs to be stored in a hard disk, floppy disk, or any other storage device. This is called external sorting.

20 of 81

Internal Sorting

Following are few algorithms that can be used for internal sort:
Bubble Sort
Insertion Sort
Selection Sort
Quick Sort

21 of 81

Bubble Sort Algorithm

It’s a simple sorting algorithm that repeatedly steps through the list, compares adjacent elements, and swaps them if they’re in the wrong order.
This swapping process continues until we sort the input list.

22 of 81

Bubble Sort Algorithm

23 of 81

Insertion Sort Algorithm

This sorting algorithm works similarly to the way to sort playing cards.
The dataset is virtually split into a sorted and an unsorted part, then the algorithm picks up the elements from the unsorted part and places them at the correct position in the sorted part.

24 of 81

Insertion Sort Algorithm

25 of 81

Selection Sort Algorithm

Selection sort is a simple sorting algorithm.
This sorting algorithm is an in-place comparison-based algorithm in which the list is divided into two parts, the sorted part at the left end and the unsorted part at the right end.
Initially, the sorted part is empty and the unsorted part is the entire list.

26 of 81

Selection Sort Algorithm

The smallest element is selected from the unsorted array and swapped with the leftmost element, and that element becomes a part of the sorted array.
This process continues moving unsorted array boundary by one element to the right.
This algorithm is not suitable for large data sets as its average and worst case complexities are of Ο(n²), where n is the number of items.

27 of 81

Selection Sort Algorithm

Consider the following depicted array as an example.

28 of 81

Selection Sort Algorithm

For the first position in the sorted list, the whole list is scanned sequentially. The first position where 14 is stored presently, we search the whole list and find that 10 is the lowest value.

29 of 81

Selection Sort Algorithm

So we replace 14 with 10. After one iteration 10, which happens to be the minimum value in the list, appears in the first position of the sorted list.

30 of 81

Selection Sort Algorithm

For the second position, where 33 is residing, we start scanning the rest of the list in a linear manner.

31 of 81

Selection Sort Algorithm

We find that 14 is the second lowest value in the list and it should appear at the second place. We swap these values.

32 of 81

Selection Sort Algorithm

After two iterations, two least values are positioned at the beginning in a sorted manner.

The same process is applied to the rest of the items in the array.

33 of 81

Selection Sort Algorithm

Following is a pictorial depiction of the entire sorting process,

34 of 81

Selection Sort Algorithm

35 of 81

Quick Sort Algorithm

This sorting algorithm picks up a pivot element, then partitions the dataset into two sub-arrays, one sub-array is greater than the element and another sub-array is less than the element.
The same process is repeated for sub-arrays till the dataset is sorted.

36 of 81

Quick Sort Algorithm

37 of 81

External Sorting

External sort requires algorithms whose space complexity doesn’t increase with the size of the dataset. While the space complexity of Merge Sort is O(n), we can optimize it to O(1).
Mostly Merge sort algorithm is used as external sorting technique.

38 of 81

Merge Sort Algorithm

Divide the large dataset into smaller subsets of size less than size of the RAM.
The space complexity of this step is O(1) as it doesn’t increase with the size of the dataset.
Use any of the popular internal sorting algorithms to sort the subsets one batch at a time.
The space of complexity of these algorithms is O(n).

39 of 81

Merge Sort Algorithm

The size of these subsets is less than size of RAM, it’ll require the same amount of memory to sort them.
Iterate using pointers to merge sorted subsets.
During this, we compare the values of elements of current pointers to subsets and put the smallest value in the output list.

40 of 81

Merge Sort Algorithm

Then move the pointer to the next item of the subset which has the smallest value.
Since we use pointers, the space complexity of this step is O(1) and it doesn’t increase with the size of the dataset.
We’ll process the following example using this method:

A S O R T I N G A N D M E R G I N G E X A M P L E

41 of 81

Merge Sort Algorithm

We’re assuming that the computer can store 10 elements.
The first step would be to divide the dataset into 3 subsets and they’re loaded in an external disk.
A S O R T I N G A N
D M E R G I N G E X
A M P L E

42 of 81

Merge Sort Algorithm

Then, as a part of step 2, we’d:
Load 1st subset (A S O R T I N G A N) to the computer RAM and sort it to (A A G I N N O R S T). Then store it on an external disk.
Load 2nd subset(D M E R G I N G E X) to the computer RAM and sort it to (D E E G G I M N R X). Then store it on an external disk.
Load 3rd subset(A M P L E) to the computer RAM and sort it to (A E L M P). Then store it on an external disk.

43 of 81

Merge Sort Algorithm

Then, Initialize the 3 pointers (one each input subsets) and 1 for output. Pointer for 1st Subset, 2nd Subset, and 3rd Subset is 1. Also, the pointer for the output dataset is 1.
Load the elements as pointed by pointers to three sorted datasets to Computer RAM. It’ll load max 3 elements as there are 3 subsets. It’ll load (A D A) to Computer RAM.

44 of 81

Merge Sort Algorithm

Find the lowest value from these elements and load it to the output array at the location identified by the output pointer. Since A is lowest in (A D A), load A to output in external disk.
These steps will continue till all the elements from the three subsets are processed. The output will be following after all the elements are processed:

A A A D E E E G G G I I L M M N N N O P R R S T X

45 of 81

Comparison of Sorting Techniques Based on Their Complexity

In a comparison based sorting algorithms, we compare elements of an array with each other to determines which of two elements should occur first in the final sorted list.
All comparison-based sorting algorithms have a complexity lower bound of nlogn.
Here is the comparison of time and space complexities of sorting algorithms.

46 of 81

Comparison of Sorting Techniques Based on Their Complexity

47 of 81

Hashing

Hashing is an important data structure designed to solve the problem of efficiently finding and storing data in an array.
For example, if you have a list of 20000 numbers, and you have given a number to search in that list- you will scan each number in the list until you find a match.

48 of 81

Hashing

It requires a significant amount of your time to search in the entire list and locate that specific number.
This manual process of scanning is not only time-consuming but inefficient too.
With hashing in the data structure, you can narrow down the search and find the number within seconds.

49 of 81

Hashing

Hashing in the data structure is a technique of mapping a large chunk of data into small tables using a hashing function.
It is also known as the message digest function. It is a technique that uniquely identifies a specific item from a collection of similar items.

50 of 81

Hashing

It uses hash tables to store the data in an array format.
Each value in the array has assigned a unique index number.
Hash tables use a technique to generate these unique index numbers for each value stored in an array format.
This technique is called the hash technique.

51 of 81

Hashing

You only need to find the index of the desired item, rather than finding the data.
With indexing, you can quickly scan the entire list and retrieve the item you wish.
Indexing also helps in inserting operations when you need to insert data at a specific location.
No matter how big or small the table is, you can update and retrieve data within seconds.

52 of 81

Hashing

Hashing in a data structure is a two-step process.
The hash function converts the item into a small integer or hash value. This integer is used as an index to store the original data.
It stores the data in a hash table. You can use a hash key to locate data quickly.

53 of 81

Hashing

The following is the real-life example of hashing in the data structure.
A library has an infinite number of books. The librarian assigns a unique number to each book. This unique number helps in identifying the position of the books on the bookshelf.

54 of 81

Hash Function

The hash function in a data structure maps arbitrary size of data to fixed-sized data.
It returns the following values: hash value, hash codes, and hash sums.

hash = hashfunc(key)

index = hash % array_size

55 of 81

Hash Function

The hash function must satisfy the following requirements:
A good hash function is easy to compute.
A good hash function never gets stuck in clustering and distributes keys evenly across the hash table.
A good hash function avoids collision when two elements or items get assigned to the same hash value.

56 of 81

Hash Table

Hashing in data structure uses hash tables to store the key-value pairs.
The hash table then uses the hash function to generate an index.
Hashing uses this unique index to perform insert, update, and search operations.

57 of 81

Hash Table

We generate a hash for the input using the hash function and then store the element using the generated hash as the key in the hash table.

58 of 81

Operation in Hash Function

Hash function uses 3 basic operations.
Insert
Delete
Search

59 of 81

Operation in Hash Function

Insert - T[ h(key) ] = value;
It calculates the hash, uses it as the key and stores the value in hash table.
Delete - T[ h(key) ] = NULL;
It calculates the hash, resets the value stored in the hash table for that key.
Search - return T[ h(key) ];
It calculates the hash, finds and returns the value stored in the hash table for that key.

60 of 81

How to choose a Hash Function

An efficient hash function should be built such that the index value of the added item is distributed equally across the table.
An effective collision resolution technique should be created to generate an alternate index for a key whose hash index corresponds to a previously inserted position in a hash table.
We must select a hash algorithm that is fast to calculate.

61 of 81

Characteristics of a good Hash Function

Uniform Distribution: For distribution throughout the constructed table.
Fast: The generation of hash should be very fast, and should not produce any considerable overhead.

62 of 81

Hashing Collision

Hashing in data structure falls into a collision if two keys are assigned the same index number in the hash table.
The collision creates a problem because each index in a hash table is supposed to store only one value.

63 of 81

Hashing Collision

Hashing in data structure uses several collision resolution techniques to manage the performance of a hash table.
Following are the collision resolution techniques used:
Open Hashing (Separate chaining)
Closed Hashing (Open Addressing) - Liner Probing. Quadratic probing, Double hashing.

64 of 81

Open Hashing (Separate Chaining)

In this technique, a linked list is created from the slot in which collision has occurred, after which the new key is inserted into the linked list.
This linked list of slots looks like a chain, so it is called separate chaining.
It is used more when we do not know how many keys to insert or delete.

65 of 81

Open Hashing (Separate Chaining)

Time complexity
Its worst-case complexity for searching is O(n).
Its worst-case complexity for deletion is O(n).

66 of 81

Open Hashing (Separate Chaining)

Advantages of separate chaining
It is easy to implement.
The hash table never fills full, so we can add more elements to the chain.
It is less sensitive to the function of the hashing.

67 of 81

Open Hashing (Separate Chaining)

Disadvantages of separate chaining
In this, cache performance of chaining is not good.
The memory wastage is too much in this method.
It requires more space for element links.

68 of 81

Closed Hashing (Open Addressing)

Open addressing is collision-resolution method that is used to control the collision in the hashing table.
There is no key stored outside of the hash table. Therefore, the size of the hash table is always greater than or equal to the number of keys.
It is also called closed hashing.

69 of 81

Closed Hashing (Open Addressing)

The following techniques are used in open addressing:
Linear probing
Quadratic probing
Double hashing

70 of 81

Linear Probing

In this, when the collision occurs, we perform a linear probe for the next slot, and this probing is performed until an empty slot is found.
In linear probing, the worst time to search for an element is O(table size).
The linear probing gives the best performance of the cache but its problem is clustering.

71 of 81

Linear Probing

The main advantage of this technique is that it can be easily calculated.

Disadvantages of linear probing
The main problem is clustering.
It takes too much time to find an empty slot.

72 of 81

Linear Probing

Imagine you have been asked to store some items inside a hash table of size 30.
The items are already sorted in a key-value pair format.
The values given are: (3,21) (1,72) (63,36) (5,30) (11,44) (15,33) (18,12) (16,80) (46,99).

73 of 81

Linear Probing

The hash(n) is the index computed using a hash function and T is the table size.
If slot index = ( hash(n) % T) is full, then we look for the next slot index by adding 1 ((hash(n) + 1) % T).
If (hash(n) + 1) % T is also full, then we try (hash(n) + 2) % T.
If (hash(n) + 2) % T is also full, then we try (hash(n) + 3) % T.

74 of 81

Linear Probing

75 of 81

Quadratic Probing

In this, when the collision occurs, we probe for i^2th slot in i^th iteration, and this probing is performed until an empty slot is found.
The cache performance in quadratic probing is lower than the linear probing.
Quadratic probing also reduces the problem of clustering.

76 of 81

Double Hashing

The double hashing technique uses two hash functions.
The second hash function comes into use when the first function causes a collision.
It provides an offset index to store the value.
In this, you use another hash function, and probe for (i * hash 2(x)) in the i^th iteration.

77 of 81

Double Hashing

It takes longer to determine two hash functions.
Double hashing has a high computation cost, but it searches the next free slot faster than the linear probing method.
The double probing gives the very poor the cache performance, but there has no clustering problem in it.

78 of 81

Double Hashing

Imagine you need to store some items inside a hash table of size 20.
The values given are:

(16, 8, 63, 9, 27, 37, 48, 5, 69, 34, 1).

h1(n)=n%20

h2(n)=n%13

n h(n, i) = (h1 (n) + ih2(n)) mod 20

79 of 81

Double Hashing

n	h(n,i) = (h’(n) + i2) %20
16	I = 0, h(n,0) = 16
8	I = 0, h(n,0) = 8
63	I = 0, h(n,0) = 3
9	I = 0, h(n,0) = 9
27	I = 0, h(n,0) = 7
37	I = 0, h(n,0) = 17

80 of 81

Double Hashing

48	I = 0, h(n,0) = 8 I = 0, h(n,1) = 9 I = 0, h(n,2) = 12
5	I = 0, h(n,0) = 5
69	I = 0, h(n,0) = 9 I = 0, h(n,1) = 10
34	I = 0, h(n,0) = 14
1	I = 0, h(n,0) = 1