1 of 54

Indexing

2 of 54

Why Indexing & Hashing?

  • Some Queries return only small portion of records from file or only one record from a file.
  • It is inefficient for the system to read every record.
  • Is there any way to create an index page like our books. Whenever I want to search about transaction processing. I will go to index and search on which page that chapter starts. so, it will be easy for me to search rather than checking all pages.

3 of 54

  • Database grows day by day. You cant save that all data’s (all records) on the same data block because they are having fix size.
  • So, Can I say that my data resides on number of data blocks?
  • At the time of searching database system has to copy data block into main memory and search your queried data resides in that block or not. Likewise It has to copy all data block’s one by one just to retrieve 1 or n records.

4 of 54

Indexing

  • To retrieve data of an employee having e_id=101 . The database system will look up an index to find on which disk block corresponding record resides and then fetch the disk block to get the appropriate record.

  • Indexing mechanisms used to speed up access to desired data.

  • So, how to create indexing?

5 of 54

  • Search Key - attribute to set of attributes used to look up records in a file.
  • An index file consists of records (called index entries or index record) of the form�

  • Index files are typically much smaller than the original file
  • Two basic kinds of indices:
    • Ordered indices: search keys are stored in sorted order
    • Hash indices: search keys are distributed uniformly across “buckets” using a “hash function”.

Search key value

Pointer to record

Index entry/index record

6 of 54

  • From above two techniques which one to choose?
  • Each techniques have evaluated on the basis of below factors.
    • Access type
    • Access time
    • Insertion time
    • Deletion time
    • Space overhead

7 of 54

Ordered indices

  • In an ordered index, index entries are stored sorted on the search key value.
  • Primary index: in a sequentially ordered file, the index whose search key specifies the sequential order of the file.
    • Also called clustering index
    • The search key of a primary index is usually but not necessarily the primary key.
    • Can be sparse

8 of 54

  • Secondary index: an index whose search key specifies an order different from the sequential order of the file. Also called non-clustering index.

[an index whose search key is different from the sequential order of the file ]

Usually dense

  • Index-sequential file: ordered sequential file with a primary index

9 of 54

Types of ordered indices

  • Dense indices
  • Sparse indices

10 of 54

Dense indices

  • Dense indexthere is an index record for every search key value in the database.
  • This makes searching faster but requires more space to store index records itself.
  • Index record contains search key value and a pointer to the actual record on the disk.

11 of 54

Sparse indices

  • Sparse Index: contains index records for only some search-key values.
    • Applicable when records are sequentially ordered on search-key

To locate a record with search-key value K we:

    • Find index record with largest search-key value < K
    • We start at the record pointed to by that index entry, follow the pointers in the file until we find the desired record

12 of 54

13 of 54

Index update: record insertion

    • Perform a lookup using the key value from inserted record
    • Dense indicesif the search-key value does not appear in the index, insert it.
    • Sparse indicesif index stores an entry for each block of the file, no change needs to be made to the index unless a new block is created.
      • If a new block is created, the first search-key value appearing in the new block is inserted into the index.

14 of 54

Index update: record deletion

  • If deleted record was the only record in the file with its particular search-key value, the search-key is deleted from the index also.
    • Dense indices deletion of search-key: similar to file record deletion.
    • Sparse indices
      • if deleted key value exists in the index, the value is replaced by the next search-key value in the file (in search-key order).
      • If the next search-key value already has an index entry, the entry is deleted instead of being replaced.

15 of 54

Sparse v/s Dense

  • Less space
  • Less maintenance overhead for insertion and deletion
  • Generally slower than dense index for locating records.
  • Good tradeoff: sparse index with an index entry for every block in file, corresponding to least search-key value in the block.

16 of 54

Problems

  • Index records are comprised of search-key value and data pointers.
  • This index itself is stored on the disk along with the actual database files.
  • As the size of database grows so does the size of indices.
  • There is an immense need to keep the index records in the main memory so that the search can speed up.
  • If single level index is used then a large size index cannot be kept in memory as whole and this leads to multiple disk accesses.

17 of 54

Solution: multi level index

Multi-level Index helps breaking down the index into several smaller indices in order to make the outer most level so small that it can be saved in single disk block which can easily be accommodated anywhere in the main memory.

18 of 54

Sequential File

20

10

40

30

60

50

80

70

100

90

Sparse 2nd level

10

30

50

70

90

110

130

150

170

190

210

230

10

90

170

250

330

410

490

570

19 of 54

Another problem

  • Often one wants to find all records whose values in a certain field (which is not the search key of the primary index) satisfy some condition

{ Example 1: In the EMPLOYEE database, records are stored sequentially by EmpNo, we want to find employees working in a particular department.

{ Example 2: as above, but we want to find all employees with a specified salary or range of salary

20 of 54

Secondary indices

  • One can specify a secondary index with an index entry for each search key value;
  • index entry points to a bucket, which contains pointers to all the actual records with that particular search key.

21 of 54

Secondary indices

22 of 54

Primary v/s secondary indices

  • Secondary indexes have to be dense
  • Indexes offer substantial benefits when searching for records
  • When a record file is modified (e.g., a relation), every index on that file must be updated. Updating indexes imposes overhead on database performance.
  • Sequential scan using primary index is efficient, but a
  • sequential scan using a secondary index is expensive (each record access may fetch a new block from disk)

23 of 54

  • A search key containing more than one attribute is referred to as composite search key.

  • Main disadvantage of the index-sequential file organization is that performance degrades as the file grows and File reorganizations.

24 of 54

Dynamic multilevel indexing using B+ tree

  • Disadvantage of index-sequential files: performance degrades as sequential file grows, because many overflow blocks are created. Periodic reorganization of entire file is required.

  • Advantage of B+-Tree index file: automatically reorganizes itself with small, local changes in the case of insertions and deletions. Reorganization of entire file is not required to maintain performance.

  • Disadvantage of B+-Trees: extra insertions and deletion overhead, space overhead.

  • Advantages of B+- Trees outweigh disadvantages, and B+ - Trees are used extensively in all DBMS.

25 of 54

B+ tree node structure

  • Typical node���
    • Ki are the search-key values
    • Pi are pointers to children (for non-leaf nodes) or pointers to records or buckets of records (for leaf nodes).
  • The search-keys in a node are ordered

K1 < K2 < K3 < . . . < Kn–1

26 of 54

Insertion into B+ tree

  • While inserting values into node. if node is full then follow the following two rules.
    1. If node is leaf then break down the node into two partitions.
      1. The first partition should hold ceil of (n-1)/2 key values.
      2. Second partition should hold rest of the key values where n is number of pointers.
      3. Copy of smallest key element from second partition to parent node.
    2. If node is non leaf node then break down the node into two partitions.
      • The first partition should hold ceil of(n/2)-1 key values
      • Second partition can hold rest of the key values where n is number of pointers
      • Move the smallest element from second partition to parent node.

27 of 54

Example

  • Insert 2,5,7,10,13,16,20,22,23,24

Assume n=4 or degree=4;

Insert 2:

Insert 5:

2

2

5

28 of 54

  • Insert 7

Insert 10:

No space how can I insert?

2

5

7

2

5

7

29 of 54

  • An initial node in B+ tree is leaf node.
    1. If node is leaf then break down the node into two partitions.
      1. The first partition should hold ceil of (n-1)/2 key values.
      2. Second partition should hold rest of the key values where n is number of pointers.
      3. Copy of smallest key element from second partition to parent node.

2

5

7

10

7

30 of 54

  • Insert 13

  • Insert 16

2

5

7

10

13

7

31 of 54

7

10

13

16

7

10

13

13

16

2

5

7

10

7

13

32 of 54

  • Insert 20:

  • Insert 22:

13

16

20

2

5

7

10

7

13

13

16

2

5

7

10

7

13

20

20

22

33 of 54

  • Insert 23:

  • Insert 24:

13

16

2

5

7

10

7

13

20

20

22

23

34 of 54

overflow root

20

22

23

23

24

20

22

7

13

20

35 of 54

  • Apply Rule:2

If node is non leaf node then break down the node into two partitions.

      • The first partition should hold ceil of(n/2)-1 key values
      • Second partition can hold rest of the key values where n is number of pointers
      • Move the smallest element from second partition to parent node.

  • Ceil of (n/2)-1=Ceil of(4/2)-1=2-1=1

36 of 54

13

16

2

5

7

10

7

23

24

20

22

20

23

13

37 of 54

B+ tree deletion

  • When number of search key values < (n-1)/2 nothing to do.
  • Leaf node:
    • Redistribute to sibling
      • Right node greater than left node
      • Replace the between value in parent by their smallest value of the right node.
    • Merge
      • Move all values , pointers to the left node
      • Remove the between value in parent

38 of 54

  • Non leaf node
    • Redistribute to sibling
      • Through parent
      • Right node not less than left node.
    • Merge
      • Bring down the parent
      • Move all values , pointers to left node
      • Delete the right node and pointers in parent.

39 of 54

Leaf node deletion

9

10

13

14

16

13

18

Delete 10

9

13

14

16

14

18

40 of 54

Non leaf node deletion

3

8

5

20

3

1

41 of 54

  • Delete 3

5

8

20

1

42 of 54

33

34

38

19

20

22

24

27

29

24

30

39

5

13

17

43 of 54

  • Delete 19
  • Calculate (n-1)/2=(5-1)/2=2
  • After deleting 19, 2 elements will be there so not to do

33

34

38

20

22

24

27

29

24

30

39

44 of 54

33

34

38

22

24

27

29

27

30

39

  • Delete 20 only 1 element will be there which is not half full
  • deleting at leaf node value
  • Apply leaf level deletion rule

45 of 54

  • Delete 24

33

34

38

22

27

29

29

30

39

46 of 54

33

34

38

22

27

29

30

39

47 of 54

33

34

38

22

27

29

30

39

5

13

17

48 of 54

33

34

38

22

27

5

13

17

30

39

49 of 54

  • Similar to B+-tree, but B-tree allows search-key values to appear only once. eliminates redundant storage of search keys.
  • Search keys in nonleaf nodes appear nowhere else in the
  • B-tree an additional pointer field for each search key in a nonleaf node must be included.
  • Generalized B-tree leaf node�

Nonleaf node – pointers Bi are the bucket or file record pointers.

50 of 54

B Tree

51 of 54

B+ Tree

52 of 54

  • Advantages of B-Tree indices:
    • May use less tree nodes than a corresponding B+-Tree.
    • Sometimes possible to find search-key value before reaching leaf node.
  • Disadvantages of B-Tree indices:
    • Only small fraction of all search-key values are found early
    • Non-leaf nodes are larger, so fan-out is reduced. Thus, B-Trees typically have greater depth than corresponding B+-Tree
    • Insertion and deletion more complicated than in B+-Trees
    • Implementation is harder than B+-Trees.
  • Typically, advantages of B-Trees do not out weigh disadvantages.

53 of 54

Advantages of B+ trees:

  • Because B+ trees don't have data associated with interior nodes, more keys can fit on a page of memory. Therefore, it will require fewer cache misses in order to access data that is on a leaf node.
  • The leaf nodes of B+ trees are linked, so doing a full scan of all objects in a tree requires just one linear pass through all the leaf nodes.
  • a B tree, on the other hand, would require a traversal of every level in the tree. This full-tree traversal will likely involve more cache misses than the linear traversal of B+ leaves.

Advantage of B trees:

  • Because B trees contain data with each key, frequently accessed nodes can lie closer to the root, and therefore can be accessed more quickly.

54 of 54

Other issues in Indexing

  • Covering indices
    • Add extra attributes to index so (some) queries can avoid fetching the actual records
      • Particularly useful for secondary indices
        • Why?
    • Can store extra attributes only at leaf
  • Record relocation and secondary indices
    • If a record moves, all secondary indices that store record pointers have to be updated
    • Node splits in B+-tree file organizations become very expensive
    • Solution: use primary-index search key instead of record pointer in secondary index
      • Extra traversal of primary index to locate record
        • Higher cost for queries, but node splits are cheap
      • Add record-id if primary-index search key is non-unique