2 of 54

Why Indexing & Hashing?

Some Queries return only small portion of records from file or only one record from a file.
It is inefficient for the system to read every record.
Is there any way to create an index page like our books. Whenever I want to search about transaction processing. I will go to index and search on which page that chapter starts. so, it will be easy for me to search rather than checking all pages.

3 of 54

Database grows day by day. You cant save that all data’s (all records) on the same data block because they are having fix size.
So, Can I say that my data resides on number of data blocks?
At the time of searching database system has to copy data block into main memory and search your queried data resides in that block or not. Likewise It has to copy all data block’s one by one just to retrieve 1 or n records.

4 of 54

Indexing

To retrieve data of an employee having e_id=101 . The database system will look up an index to find on which disk block corresponding record resides and then fetch the disk block to get the appropriate record.

Indexing mechanisms used to speed up access to desired data.

So, how to create indexing?

5 of 54

Search Key - attribute to set of attributes used to look up records in a file.
An index file consists of records (called index entries or index record) of the form�

�

Index files are typically much smaller than the original file
Two basic kinds of indices:

Ordered indices: search keys are stored in sorted order
Hash indices: search keys are distributed uniformly across “buckets” using a “hash function”.

Search key value

Pointer to record

Index entry/index record

6 of 54

From above two techniques which one to choose?
Each techniques have evaluated on the basis of below factors.

Access type
Access time
Insertion time
Deletion time
Space overhead

7 of 54

Ordered indices

In an ordered index, index entries are stored sorted on the search key value.
Primary index: in a sequentially ordered file, the index whose search key specifies the sequential order of the file.

Also called clustering index
The search key of a primary index is usually but not necessarily the primary key.
Can be sparse

8 of 54

Secondary index: an index whose search key specifies an order different from the sequential order of the file. Also called non-clustering index.

[an index whose search key is different from the sequential order of the file ]

Usually dense

Index-sequential file: ordered sequential file with a primary index

9 of 54

Types of ordered indices

Dense indices
Sparse indices

10 of 54

Dense indices

Dense index —there is an index record for every search key value in the database.
This makes searching faster but requires more space to store index records itself.
Index record contains search key value and a pointer to the actual record on the disk.

11 of 54

Sparse indices

Sparse Index: contains index records for only some search-key values.

Applicable when records are sequentially ordered on search-key

To locate a record with search-key value K we:

Find index record with largest search-key value < K
We start at the record pointed to by that index entry, follow the pointers in the file until we find the desired record

13 of 54

Index update: record insertion

Perform a lookup using the key value from inserted record
Dense indices – if the search-key value does not appear in the index, insert it.
Sparse indices – if index stores an entry for each block of the file, no change needs to be made to the index unless a new block is created.

If a new block is created, the first search-key value appearing in the new block is inserted into the index.

14 of 54

Index update: record deletion

If deleted record was the only record in the file with its particular search-key value, the search-key is deleted from the index also.

Dense indices – deletion of search-key: similar to file record deletion.
Sparse indices –

if deleted key value exists in the index, the value is replaced by the next search-key value in the file (in search-key order).
If the next search-key value already has an index entry, the entry is deleted instead of being replaced.

15 of 54

Sparse v/s Dense

Less space
Less maintenance overhead for insertion and deletion
Generally slower than dense index for locating records.
Good tradeoff: sparse index with an index entry for every block in file, corresponding to least search-key value in the block.

16 of 54

Problems

Index records are comprised of search-key value and data pointers.
This index itself is stored on the disk along with the actual database files.
As the size of database grows so does the size of indices.
There is an immense need to keep the index records in the main memory so that the search can speed up.
If single level index is used then a large size index cannot be kept in memory as whole and this leads to multiple disk accesses.

17 of 54

Solution: multi level index

Multi-level Index helps breaking down the index into several smaller indices in order to make the outer most level so small that it can be saved in single disk block which can easily be accommodated anywhere in the main memory.

18 of 54

Sequential File

100

Sparse 2nd level

110

130

150

170

190

210

230

170

250

330

410

490

570

19 of 54

Another problem

Often one wants to find all records whose values in a certain field (which is not the search key of the primary index) satisfy some condition

{ Example 1: In the EMPLOYEE database, records are stored sequentially by EmpNo, we want to find employees working in a particular department.

{ Example 2: as above, but we want to find all employees with a specified salary or range of salary

20 of 54

Secondary indices

One can specify a secondary index with an index entry for each search key value;
index entry points to a bucket, which contains pointers to all the actual records with that particular search key.

21 of 54

Secondary indices

22 of 54

Primary v/s secondary indices

Secondary indexes have to be dense
Indexes offer substantial benefits when searching for records
When a record file is modified (e.g., a relation), every index on that file must be updated. Updating indexes imposes overhead on database performance.
Sequential scan using primary index is efficient, but a
sequential scan using a secondary index is expensive (each record access may fetch a new block from disk)

23 of 54

A search key containing more than one attribute is referred to as composite search key.

Main disadvantage of the index-sequential file organization is that performance degrades as the file grows and File reorganizations.

24 of 54

Dynamic multilevel indexing using B+ tree

Disadvantage of index-sequential files: performance degrades as sequential file grows, because many overflow blocks are created. Periodic reorganization of entire file is required.

Advantage of B+-Tree index file: automatically reorganizes itself with small, local changes in the case of insertions and deletions. Reorganization of entire file is not required to maintain performance.

Disadvantage of B+-Trees: extra insertions and deletion overhead, space overhead.

Advantages of B+- Trees outweigh disadvantages, and B+ - Trees are used extensively in all DBMS.

25 of 54

B+ tree node structure

Typical node��

K_i are the search-key values
P_i are pointers to children (for non-leaf nodes) or pointers to records or buckets of records (for leaf nodes).

The search-keys in a node are ordered

K₁< K₂< K₃< . . .< K_n–₁

26 of 54

Insertion into B+ tree

While inserting values into node. if node is full then follow the following two rules.

If node is leaf then break down the node into two partitions.

The first partition should hold ceil of (n-1)/2 key values.
Second partition should hold rest of the key values where n is number of pointers.
Copy of smallest key element from second partition to parent node.

If node is non leaf node then break down the node into two partitions.

The first partition should hold ceil of(n/2)-1 key values
Second partition can hold rest of the key values where n is number of pointers
Move the smallest element from second partition to parent node.

27 of 54

Example

Insert 2,5,7,10,13,16,20,22,23,24

Assume n=4 or degree=4;

Insert 2:

Insert 5:

28 of 54

Insert 7

Insert 10:

No space how can I insert?

29 of 54

An initial node in B+ tree is leaf node.

If node is leaf then break down the node into two partitions.

The first partition should hold ceil of (n-1)/2 key values.
Second partition should hold rest of the key values where n is number of pointers.
Copy of smallest key element from second partition to parent node.

30 of 54

Insert 13

Insert 16

31 of 54

32 of 54

Insert 20:

Insert 22:

33 of 54

Insert 23:

Insert 24:

34 of 54

overflow root

35 of 54

Apply Rule:2

If node is non leaf node then break down the node into two partitions.

The first partition should hold ceil of(n/2)-1 key values
Second partition can hold rest of the key values where n is number of pointers
Move the smallest element from second partition to parent node.

Ceil of (n/2)-1=Ceil of(4/2)-1=2-1=1

36 of 54

37 of 54

B+ tree deletion

When number of search key values < (n-1)/2 nothing to do.
Leaf node:

Redistribute to sibling

Right node greater than left node
Replace the between value in parent by their smallest value of the right node.

Merge

Move all values , pointers to the left node
Remove the between value in parent

38 of 54

Non leaf node

Redistribute to sibling

Through parent
Right node not less than left node.

Merge

Bring down the parent
Move all values , pointers to left node
Delete the right node and pointers in parent.

39 of 54

Leaf node deletion

Delete 10

40 of 54

Non leaf node deletion

41 of 54

Delete 3

42 of 54

43 of 54

Delete 19
Calculate (n-1)/2=(5-1)/2=2
After deleting 19, 2 elements will be there so not to do

44 of 54

Delete 20 only 1 element will be there which is not half full
deleting at leaf node value
Apply leaf level deletion rule

45 of 54

Delete 24

46 of 54

47 of 54

48 of 54

49 of 54

Similar to B+-tree, but B-tree allows search-key values to appear only once. eliminates redundant storage of search keys.
Search keys in nonleaf nodes appear nowhere else in the
B-tree an additional pointer field for each search key in a nonleaf node must be included.
Generalized B-tree leaf node��

Nonleaf node – pointers Bi are the bucket or file record pointers.

52 of 54

Advantages of B-Tree indices:

May use less tree nodes than a corresponding B⁺-Tree.
Sometimes possible to find search-key value before reaching leaf node.

Disadvantages of B-Tree indices:

Only small fraction of all search-key values are found early
Non-leaf nodes are larger, so fan-out is reduced. Thus, B-Trees typically have greater depth than corresponding B⁺-Tree
Insertion and deletion more complicated than in B⁺-Trees
Implementation is harder than B⁺-Trees.

Typically, advantages of B-Trees do not out weigh disadvantages.

53 of 54

Advantages of B+ trees:

Because B+ trees don't have data associated with interior nodes, more keys can fit on a page of memory. Therefore, it will require fewer cache misses in order to access data that is on a leaf node.
The leaf nodes of B+ trees are linked, so doing a full scan of all objects in a tree requires just one linear pass through all the leaf nodes.
a B tree, on the other hand, would require a traversal of every level in the tree. This full-tree traversal will likely involve more cache misses than the linear traversal of B+ leaves.

Advantage of B trees:

Because B trees contain data with each key, frequently accessed nodes can lie closer to the root, and therefore can be accessed more quickly.

54 of 54

Other issues in Indexing

Covering indices

Add extra attributes to index so (some) queries can avoid fetching the actual records

Particularly useful for secondary indices

Why?

Can store extra attributes only at leaf

Record relocation and secondary indices

If a record moves, all secondary indices that store record pointers have to be updated
Node splits in B⁺-tree file organizations become very expensive
Solution: use primary-index search key instead of record pointer in secondary index

Extra traversal of primary index to locate record

Higher cost for queries, but node splits are cheap

Add record-id if primary-index search key is non-unique

1 of 54