1 of 22

LUCENE Indexing.....

2 of 22

What is a Search Index

Search Index is a variant of Database
Similarities with traditional RDBMS

Needs to have fast lookup for keys
Bulk of data resides on secondary storage
Minimize secondary storage access

Differences(additions) compared to RDBMS

No definitive score, returns only top k ranked hits
Relaxed transactional requirements
Two-level processing required ( details later..)

3 of 22

Requirements of a Search Index

Minimize disk access at the expense of CPU
Indexing

Support for Dynamic Indexing

Relatively fast indexing without compromising search time.
Simultaneous search/indexing

Index Compression

Speeds up overall process by resulting in less I/O

Searching

Needs an inverted index for efficient search
Merge search results for different query terms

4 of 22

Indexing Strategies

Primarily requires a sorted list of terms postings
Possible Data Structures

B-Tree based – O(logbN)

Requires random access, hence frequent disk seeks

Merge based – O(logkN)

Sorts in-memory, then merge the sorted runs/segments

Creates new segments for newly added documents
Merge existing segments

Trie Based – O(term-length)

Requires random access, frequent disk seeks

5 of 22

Lucene Indexing Fundamentals

An index contains a collection of documents
A document is a collection of fields
A field is a named collection of terms
A term is a paired string <field,term-string>
Inverted Index for efficient term-based search

6 of 22

Some Indexing Terms

Fields ( Independent search space)

Unlike CPL, fields are defined at run-time
Field Types : stored,indexed,tokenized

A field can be both stored and indexed
An indexed field can be tokenized

Segments (sub indexes)

Independent searchable index
Lucene index evolves with segments getting added/merged
Search results from segments are merged

7 of 22

Merge Algorithm

Maintain a stack of Segment Indexes
Create a segment/index for every doc added
Push new indexes into a stack
For (size=1;size < INF; size *= b) {

If (exists b indexes of size docs on top of stack) {

Pop them off the stack
Merge them into a single index
Push the merged index onto the stack

} Else {

break

} }

8 of 22

Lucene Indexing Algorithm

K-way merge – process at disk transfer rate
Average b*logN indexes

n=1M,b=2 gives 20 indexes
Fast to update and not too slow to search

Optimization

Small sized segments kept in RAM, saves I/O calls

9 of 22

Search Algorithm

Uses Inverted Index

Data Structure : term->doc-entries postings
Use skip lists to speed up search

Merge postings from different query terms

Efficient relevancy score calculation
Use Vector Space Model for similarity calculation
Parallel merge algorithm

10 of 22

A Lucene Segment

<SegName, SegSize> SegCount

Version

Name Counter

SegCount

Format

11 of 22

Per Index Files

Segments File (“segments” file)

Segments --> Format, Version, NameCounter, SegCount, <SegName, SegSize>^SegCount

Lock Files

Several files are used to indicate that another process is using an index. These are stored in java.io.tmpdir

Deletable Files

Only used on Win32, where a file may not be deleted while it is still open

Compound files (“.fcs” file, just a container)

Compound (.cfs) --> FileCount, <DataOffset, FileName>^FileCount, FileData^FileCount

12 of 22

Per Segment Files

Each segment has the following :

Field names (e.g. Title, content etc.)
Stored Field Values ( e.g. Urls )
Term dictionary

A dictionary containing all of the terms used in all of the indexed fields of all of the documents. It also contains the number of documents which contain the term, and pointers to the term's frequency and proximity data.

Term Frequency data.

For each term in the dictionary, the numbers of all the documents that contain that term, and the frequency of the term in that document.

13 of 22

FileFormat (contd..)

Term Proximity data

For each term in the dictionary, the positions that the term occurs in each document.

Term Vectors

For each field in each document, the term vector (sometimes called document vector) is stored. A term vector consists of term text and term frequency.

Normalization factors

For each field in each document, a value is stored that is multiplied into the score for hits on that field.

Deleted Documents

14 of 22

Per Segment Files(prefix=segname)

Fields

Field Info ( .fnm file)
Stored Fields (.fdx and .fdt files)

Term Dictionary

Term Infos (.tis file)

TermInfoFile (.tis)--> TIVersion, TermCount, IndexInterval, SkipInterval, TermInfos

Term Info Index (.tii file)

This contains every IIth entry from the .tis file, along with its location in the "tis" file. This is designed to be read entirely into memory and used to provide random access to the "tis" file.
TermInfoIndex (.tii)--> TIVersion, IndexTermCount, IndexInterval, SkipInterval, TermIndices

15 of 22

Per Segment Files (contd..)

Frequencies (.frq or the doc postings file)

Contains lists of documents which contain each term, along with the frequency of the term in that document.
Referred in .tis file as FreqData (with delta encoding)

Positions (.prx file)

Contains list of term positions within the documents
Used for Span Query, Phrase Query searches

Normalization (.f[0-9]* files)

A Norm file (.fn) for each indexed field with a byte/doc

16 of 22

Segment layout

Ti+1

Tj+k

Tj

Ti

.tii file (in memory)

.tis file (in disk)

Random seeks

Contiguous reads

Tq

IndexDelta

.frq/posting file (in disk)

Tq+1

Term

DocFreq

FreqDelta

ProxDelta

SkipDelta

TF1

TFd

Posting-list(term-freqs)

SD1

SDd/sk

SD2

DocId

FreqSk

TFsk

ProxSk

.prx file

Used to merge postings

17 of 22

Index Compression

Variable byte encoding

Need very fast decoding (byte-align)
Given a binary representation

Form groups of 7-bits each (i.e. Block of 128)
If 1st bit is 1 append the next 2-8 bits, do it recursively

Front Encoding of terms

Sorted terms commonly have long common prefix

8,automata,8,automate,9,automatic,10,automation
0,automata,7,e,7,ic,7,ion (huge saving....)

Works only if they are sorted lexicographically

18 of 22

Search Algorithm

Posting Lists (from .frq file)

For every term Ti, we have a posting list

Ti -> (Doc-id,Freq(d,t)) *
Ti's are sorted lexically, so are posting list on doc-id

Postings are delta encoded(for Index Compression)

Based on Vector Space Model

Conjunctive Search Algorithm (AND queries)
Disjunctive Search Algorithm (OR queries)

19 of 22

Vector Space Model

Terms constitute the space (or the axes)
Document/query are represented as a vector
Use cosine as a similarity measure (for docs)
Query is a short doc
TF-IDF weighting scheme

Tf * log(n/nt)

20 of 22

Search Algorithm (contd.)

Conjunctive Search Algorithm

The AND merge

Walk through 2 postings simultaneously, O(m+n)
Crucial – postings sorted by doc-id

Use skip pointers to speed up merge

Disjunctive search Algorithm

All postings should be merged
Minimize per postings processing

21 of 22

Others..

Positional Queries

Use positional information for doc/term tuple

Range Queries

Date Range queries

Stemming

Porter Stemmer Algorithm

Configurable Similarity algorithms

22 of 22

Open ended questions

Incremental Update to docs

Relevant to our focused crawl (for av files)
Hard problem to solve

Deficiencies in VSM

Need exact query terms
Use auto-classification/clustering

Q/A......