1 of 81

Introduction to Sorting and Searching

By

Mr. Abhijit T. Somnathe

2 of 81

What is Sorting?

  • The arrangement of data in a preferred order is called sorting.
  • By sorting data, it is easier to search through it quickly and easily.
  • Sorting can be done in ascending and descending order.
  • The simplest example of sorting is a dictionary.

3 of 81

What is Searching?

  • The process of finding the desired information from the set of items stored in the form of elements in the computer memory is called ‘searching’.
  • These sets of items are in various forms, such as an array, tree, graph, or linked list.

4 of 81

Searching Methods

  • Searching in the data structure can be done by applying searching algorithms to check for or extract an element from any form of stored data structure.
  • These algorithms are classified according to the type of search operation they perform, such as:
  • Sequential search - The list or array of elements is traversed sequentially while checking every component of the set. For example – Linear Search.

5 of 81

Searching Methods

  • Interval Search - The interval search includes algorithms that are explicitly designed for searching in sorted data structures.
  • In terms of efficiency, these algorithms are far better than linear search algorithms. Example - Binary search.

6 of 81

Searching Methods

  • These methods are evaluated based on the time taken by an algorithm to search an element matching the search item in the data collections and are given by,
  • The best possible time
  • The average time
  • The worst-case time

7 of 81

Linear Search

  • The linear search algorithm searches all elements in the array sequentially.
  • Its best execution time is one, whereas the worst execution time is n, where n is the total number of items in the search array.
  • It is the most simple search algorithm in data structure and checks each item in the set of elements until it matches the search element until the end of data collection.

8 of 81

Linear Search

  • When data is unsorted, a linear search algorithm is preferred.
  • Linear search has some complexities as given below:
  • Space Complexity - Space complexity for linear search is O(n) as it does not use any extra space where n is the number of elements in an array.

9 of 81

Linear Search

  • Time Complexity –
  • Best- case complexity = O(1) occurs when the search element is present at the first element in the search array.
  • Worst- case complexity = O(n) occurs when the search element is not present in the set of elements or array.
  • Average complexity = O(n) is referred to when the element is present somewhere in the search array.

10 of 81

Linear Search

  • Let’s take an array of elements as given below:

45, 78, 12, 67, 08, 51, 39, 26

  • To find ‘51’ in an array of 8 elements given above, a linear search algorithm will check each element sequentially till its pointer points to 51 in the memory space. It takes O(6) time to find 51 in an array. To find 12, in the above array, it takes O(3), whereas, for 26, it requires O(8) time.

11 of 81

Binary Search

  • This algorithm finds specific items by comparing the middlemost items in the data collection. When a match occurs, it returns the index of the item.
  • When the middle item is greater than the item, it searches for a central item of the left sub-array.

12 of 81

Binary Search

  • In contrast, if the middle item is smaller than the search item, it explores the middle of the item in the right sub-array.
  • It continues searching for an item until it finds it or until the sub-arrays size becomes zero.
  • Binary search needs sorted order of items. It is faster than a linear search algorithm. It works on the divide and conquers principle.

13 of 81

Binary Search

  • Run-time complexity = O(log n).
  • The binary search algorithm has complexities as given below:
  • Worst-case complexity = O (n log n)
  • Average complexity = O (n log n)
  • Best case complexity = O (1)

  • Let’s take a sorted algorithm of 08 elements: 08, 12, 26, 39, 45, 51, 67, 78

14 of 81

Binary Search

  • To find 51 in an array of the above elements, The algorithm will divide an array into two arrays, 08, 12, 26, 39 and 45, 51, 67, 78.
  • As 51 is greater than 39, it will start searching for elements on the array’s right side.
  • It will further divide the into two such as 45, 51 and 67, 78.

15 of 81

Binary Search

  • As 51 is smaller than 67, it will start searching left of that sub-array.
  • That subarray is again divided into two as 45 and 51.
  • As 51 is the number matching to the search element, it will return its index number of that element in the array.

16 of 81

Binary Search

  • It will conclude that the search element 51 is located at the 6th position in an array.
  • Binary search reduces the time to half as the comparison count is reduced significantly than the linear search algorithm.

17 of 81

Interpolation Search

  • It is an improved variant of the binary search algorithm and works on the search element’s probing position.
  • Similar to binary search algorithms, it works efficiently only on sorted data collection.
  • Worst execution time = O(n)

18 of 81

Interpolation Search

  • When the target element’s location is known in the data collection, an interpolation search is used.
  • To find a number in the telephone directory, if one wants to search Rohan’s telephone number, instead of using linear or binary search, one can directly probe to memory space storage where names start from ‘R’.

19 of 81

Internal and External Sorting

  • If the data sorting process takes place entirely within the internal memory Random-Access Memory (RAM), it’s called internal sorting. This is possible when the size of the dataset to be sorted is small enough to be held in RAM.
  • If the input data is such that it cannot be adjusted in the internal memory entirely at once, it needs to be stored in a hard disk, floppy disk, or any other storage device. This is called external sorting.

20 of 81

Internal Sorting

  • Following are few algorithms that can be used for internal sort:
  • Bubble Sort
  • Insertion Sort
  • Selection Sort
  • Quick Sort

21 of 81

Bubble Sort Algorithm

  • It’s a simple sorting algorithm that repeatedly steps through the list, compares adjacent elements, and swaps them if they’re in the wrong order.
  • This swapping process continues until we sort the input list.

22 of 81

Bubble Sort Algorithm

23 of 81

Insertion Sort Algorithm

  • This sorting algorithm works similarly to the way to sort playing cards.
  • The dataset is virtually split into a sorted and an unsorted part, then the algorithm picks up the elements from the unsorted part and places them at the correct position in the sorted part.

24 of 81

Insertion Sort Algorithm

25 of 81

Selection Sort Algorithm

  • Selection sort is a simple sorting algorithm.
  • This sorting algorithm is an in-place comparison-based algorithm in which the list is divided into two parts, the sorted part at the left end and the unsorted part at the right end.
  • Initially, the sorted part is empty and the unsorted part is the entire list.

26 of 81

Selection Sort Algorithm

  • The smallest element is selected from the unsorted array and swapped with the leftmost element, and that element becomes a part of the sorted array.
  • This process continues moving unsorted array boundary by one element to the right.
  • This algorithm is not suitable for large data sets as its average and worst case complexities are of Ο(n2), where n is the number of items.

27 of 81

Selection Sort Algorithm

  • Consider the following depicted array as an example.

28 of 81

Selection Sort Algorithm

  • For the first position in the sorted list, the whole list is scanned sequentially. The first position where 14 is stored presently, we search the whole list and find that 10 is the lowest value.

29 of 81

Selection Sort Algorithm

  • So we replace 14 with 10. After one iteration 10, which happens to be the minimum value in the list, appears in the first position of the sorted list.

30 of 81

Selection Sort Algorithm

  • For the second position, where 33 is residing, we start scanning the rest of the list in a linear manner.

31 of 81

Selection Sort Algorithm

  • We find that 14 is the second lowest value in the list and it should appear at the second place. We swap these values.

32 of 81

Selection Sort Algorithm

  • After two iterations, two least values are positioned at the beginning in a sorted manner.

  • The same process is applied to the rest of the items in the array.

33 of 81

Selection Sort Algorithm

  • Following is a pictorial depiction of the entire sorting process,

34 of 81

Selection Sort Algorithm

35 of 81

Quick Sort Algorithm

  • This sorting algorithm picks up a pivot element, then partitions the dataset into two sub-arrays, one sub-array is greater than the element and another sub-array is less than the element.
  • The same process is repeated for sub-arrays till the dataset is sorted.

36 of 81

Quick Sort Algorithm

37 of 81

External Sorting

  • External sort requires algorithms whose space complexity doesn’t increase with the size of the dataset. While the space complexity of Merge Sort is O(n), we can optimize it to O(1).
  • Mostly Merge sort algorithm is used as external sorting technique.

38 of 81

Merge Sort Algorithm

  • Divide the large dataset into smaller subsets of size less than size of the RAM.
  • The space complexity of this step is O(1) as it doesn’t increase with the size of the dataset.
  • Use any of the popular internal sorting algorithms to sort the subsets one batch at a time.
  • The space of complexity of these algorithms is O(n).

39 of 81

Merge Sort Algorithm

  • The size of these subsets is less than size of RAM, it’ll require the same amount of memory to sort them.
  • Iterate using pointers to merge sorted subsets.
  • During this, we compare the values of elements of current pointers to subsets and put the smallest value in the output list.

40 of 81

Merge Sort Algorithm

  • Then move the pointer to the next item of the subset which has the smallest value.
  • Since we use pointers, the space complexity of this step is O(1) and it doesn’t increase with the size of the dataset.
  • We’ll process the following example using this method:

A S O R T I N G A N D M E R G I N G E X A M P L E

41 of 81

Merge Sort Algorithm

  • We’re assuming that the computer can store 10 elements.
  • The first step would be to divide the dataset into 3 subsets and they’re loaded in an external disk.
  • A S O R T I N G A N
  • D M E R G I N G E X
  • A M P L E

42 of 81

Merge Sort Algorithm

  • Then, as a part of step 2, we’d:
  • Load 1st subset (A S O R T I N G A N) to the computer RAM and sort it to (A A G I N N O R S T). Then store it on an external disk.
  • Load 2nd subset(D M E R G I N G E X) to the computer RAM and sort it to (D E E G G I M N R X). Then store it on an external disk.
  • Load 3rd subset(A M P L E) to the computer RAM and sort it to (A E L M P). Then store it on an external disk.

43 of 81

Merge Sort Algorithm

  • Then, Initialize the 3 pointers (one each input subsets) and 1 for output. Pointer for 1st Subset, 2nd Subset, and 3rd Subset is 1. Also, the pointer for the output dataset is 1.
  • Load the elements as pointed by pointers to three sorted datasets to Computer RAM. It’ll load max 3 elements as there are 3 subsets. It’ll load (A D A) to Computer RAM.

44 of 81

Merge Sort Algorithm

  • Find the lowest value from these elements and load it to the output array at the location identified by the output pointer. Since A is lowest in (A D A), load A to output in external disk.
  • These steps will continue till all the elements from the three subsets are processed. The output will be following after all the elements are processed:

A A A D E E E G G G I I L M M N N N O P R R S T X

45 of 81

Comparison of Sorting Techniques Based on Their Complexity

  • In a comparison based sorting algorithms, we compare elements of an array with each other to determines which of two elements should occur first in the final sorted list.
  • All comparison-based sorting algorithms have a complexity lower bound of nlogn.
  • Here is the comparison of time and space complexities of sorting algorithms.

46 of 81

Comparison of Sorting Techniques Based on Their Complexity

47 of 81

Hashing

  • Hashing is an important data structure designed to solve the problem of efficiently finding and storing data in an array.
  • For example, if you have a list of 20000 numbers, and you have given a number to search in that list- you will scan each number in the list until you find a match.

48 of 81

Hashing

  • It requires a significant amount of your time to search in the entire list and locate that specific number.
  • This manual process of scanning is not only time-consuming but inefficient too.
  • With hashing in the data structure, you can narrow down the search and find the number within seconds.

49 of 81

Hashing

  • Hashing in the data structure is a technique of mapping a large chunk of data into small tables using a hashing function.
  • It is also known as the message digest function. It is a technique that uniquely identifies a specific item from a collection of similar items.

50 of 81

Hashing

  • It uses hash tables to store the data in an array format.
  • Each value in the array has assigned a unique index number.
  • Hash tables use a technique to generate these unique index numbers for each value stored in an array format.
  • This technique is called the hash technique.

51 of 81

Hashing

  • You only need to find the index of the desired item, rather than finding the data.
  • With indexing, you can quickly scan the entire list and retrieve the item you wish.
  • Indexing also helps in inserting operations when you need to insert data at a specific location.
  • No matter how big or small the table is, you can update and retrieve data within seconds.

52 of 81

Hashing

  • Hashing in a data structure is a two-step process.
  • The hash function converts the item into a small integer or hash value. This integer is used as an index to store the original data.
  • It stores the data in a hash table. You can use a hash key to locate data quickly.

53 of 81

Hashing

  • The following is the real-life example of hashing in the data structure.
  • A library has an infinite number of books. The librarian assigns a unique number to each book. This unique number helps in identifying the position of the books on the bookshelf.

54 of 81

Hash Function

  • The hash function in a data structure maps arbitrary size of data to fixed-sized data.
  • It returns the following values: hash value, hash codes, and hash sums.

hash = hashfunc(key)

index = hash % array_size

55 of 81

Hash Function

  • The hash function must satisfy the following requirements:
  • A good hash function is easy to compute.
  • A good hash function never gets stuck in clustering and distributes keys evenly across the hash table.
  • A good hash function avoids collision when two elements or items get assigned to the same hash value.

56 of 81

Hash Table

  • Hashing in data structure uses hash tables to store the key-value pairs.
  • The hash table then uses the hash function to generate an index.
  • Hashing uses this unique index to perform insert, update, and search operations.

57 of 81

Hash Table

  • We generate a hash for the input using the hash function and then store the element using the generated hash as the key in the hash table.

58 of 81

Operation in Hash Function

  • Hash function uses 3 basic operations.
  • Insert
  • Delete
  • Search

59 of 81

Operation in Hash Function

  • Insert - T[ h(key) ] = value;
  • It calculates the hash, uses it as the key and stores the value in hash table.
  • Delete - T[ h(key) ] = NULL;
  • It calculates the hash, resets the value stored in the hash table for that key.
  • Search - return T[ h(key) ];
  • It calculates the hash, finds and returns the value stored in the hash table for that key.

60 of 81

How to choose a Hash Function

  • An efficient hash function should be built such that the index value of the added item is distributed equally across the table.
  • An effective collision resolution technique should be created to generate an alternate index for a key whose hash index corresponds to a previously inserted position in a hash table.
  • We must select a hash algorithm that is fast to calculate.

61 of 81

Characteristics of a good Hash Function

  • Uniform Distribution: For distribution throughout the constructed table.
  • Fast: The generation of hash should be very fast, and should not produce any considerable overhead.

62 of 81

Hashing Collision

  • Hashing in data structure falls into a collision if two keys are assigned the same index number in the hash table.
  • The collision creates a problem because each index in a hash table is supposed to store only one value.

63 of 81

Hashing Collision

  • Hashing in data structure uses several collision resolution techniques to manage the performance of a hash table.
  • Following are the collision resolution techniques used:
  • Open Hashing (Separate chaining)
  • Closed Hashing (Open Addressing) - Liner Probing. Quadratic probing, Double hashing.

64 of 81

Open Hashing (Separate Chaining)

  • In this technique, a linked list is created from the slot in which collision has occurred, after which the new key is inserted into the linked list.
  • This linked list of slots looks like a chain, so it is called separate chaining.
  • It is used more when we do not know how many keys to insert or delete.

65 of 81

Open Hashing (Separate Chaining)

  • Time complexity
  • Its worst-case complexity for searching is O(n).
  • Its worst-case complexity for deletion is O(n).

66 of 81

Open Hashing (Separate Chaining)

  • Advantages of separate chaining
  • It is easy to implement.
  • The hash table never fills full, so we can add more elements to the chain.
  • It is less sensitive to the function of the hashing.

67 of 81

Open Hashing (Separate Chaining)

  • Disadvantages of separate chaining
  • In this, cache performance of chaining is not good.
  • The memory wastage is too much in this method.
  • It requires more space for element links.

68 of 81

Closed Hashing (Open Addressing)

  • Open addressing is collision-resolution method that is used to control the collision in the hashing table.
  • There is no key stored outside of the hash table. Therefore, the size of the hash table is always greater than or equal to the number of keys.
  • It is also called closed hashing.

69 of 81

Closed Hashing (Open Addressing)

  • The following techniques are used in open addressing:
  • Linear probing
  • Quadratic probing
  • Double hashing

70 of 81

Linear Probing

  • In this, when the collision occurs, we perform a linear probe for the next slot, and this probing is performed until an empty slot is found.
  • In linear probing, the worst time to search for an element is O(table size).
  • The linear probing gives the best performance of the cache but its problem is clustering.

71 of 81

Linear Probing

  • The main advantage of this technique is that it can be easily calculated.

  • Disadvantages of linear probing
  • The main problem is clustering.
  • It takes too much time to find an empty slot.

72 of 81

Linear Probing

  • Imagine you have been asked to store some items inside a hash table of size 30.
  • The items are already sorted in a key-value pair format.
  • The values given are: (3,21) (1,72) (63,36) (5,30) (11,44) (15,33) (18,12) (16,80) (46,99).

73 of 81

Linear Probing

  • The hash(n) is the index computed using a hash function and T is the table size.
  • If slot index = ( hash(n) % T) is full, then we look for the next slot index by adding 1 ((hash(n) + 1) % T).
  • If (hash(n) + 1) % T is also full, then we try (hash(n) + 2) % T.
  • If (hash(n) + 2) % T is also full, then we try (hash(n) + 3) % T.

74 of 81

Linear Probing

75 of 81

Quadratic Probing

  • In this, when the collision occurs, we probe for i2th slot in ith iteration, and this probing is performed until an empty slot is found.
  • The cache performance in quadratic probing is lower than the linear probing.
  • Quadratic probing also reduces the problem of clustering.

76 of 81

Double Hashing

  • The double hashing technique uses two hash functions.
  • The second hash function comes into use when the first function causes a collision.
  • It provides an offset index to store the value.
  • In this, you use another hash function, and probe for (i * hash 2(x)) in the ith iteration.

77 of 81

Double Hashing

  • It takes longer to determine two hash functions.
  • Double hashing has a high computation cost, but it searches the next free slot faster than the linear probing method.
  • The double probing gives the very poor the cache performance, but there has no clustering problem in it.

78 of 81

Double Hashing

  • Imagine you need to store some items inside a hash table of size 20.
  • The values given are:

(16, 8, 63, 9, 27, 37, 48, 5, 69, 34, 1).

h1(n)=n%20

h2(n)=n%13

n h(n, i) = (h1 (n) + ih2(n)) mod 20

79 of 81

Double Hashing

n

h(n,i) = (h’(n) + i2) %20

16

I = 0, h(n,0) = 16

8

I = 0, h(n,0) = 8

63

I = 0, h(n,0) = 3

9

I = 0, h(n,0) = 9

27

I = 0, h(n,0) = 7

37

I = 0, h(n,0) = 17

80 of 81

Double Hashing

48

I = 0, h(n,0) = 8

I = 0, h(n,1) = 9

I = 0, h(n,2) = 12

5

I = 0, h(n,0) = 5

69

I = 0, h(n,0) = 9

I = 0, h(n,1) = 10

34

I = 0, h(n,0) = 14

1

I = 0, h(n,0) = 1

81 of 81

ANY DOUBTS?