1 of 22

LUCENE Indexing.....

2 of 22

What is a Search Index

  • Search Index is a variant of Database
  • Similarities with traditional RDBMS
    • Needs to have fast lookup for keys
    • Bulk of data resides on secondary storage
    • Minimize secondary storage access
  • Differences(additions) compared to RDBMS
    • No definitive score, returns only top k ranked hits
    • Relaxed transactional requirements
    • Two-level processing required ( details later..)

3 of 22

Requirements of a Search Index

  • Minimize disk access at the expense of CPU
  • Indexing
    • Support for Dynamic Indexing
      • Relatively fast indexing without compromising search time.
      • Simultaneous search/indexing
    • Index Compression
      • Speeds up overall process by resulting in less I/O
  • Searching
    • Needs an inverted index for efficient search
    • Merge search results for different query terms

4 of 22

Indexing Strategies

  • Primarily requires a sorted list of terms postings
  • Possible Data Structures
    • B-Tree based – O(logbN)
      • Requires random access, hence frequent disk seeks
    • Merge based – O(logkN)
      • Sorts in-memory, then merge the sorted runs/segments
        • Creates new segments for newly added documents
        • Merge existing segments
    • Trie Based – O(term-length)
      • Requires random access, frequent disk seeks

5 of 22

Lucene Indexing Fundamentals

  • An index contains a collection of documents
  • A document is a collection of fields
  • A field is a named collection of terms
  • A term is a paired string <field,term-string>
  • Inverted Index for efficient term-based search

6 of 22

Some Indexing Terms

  • Fields ( Independent search space)
    • Unlike CPL, fields are defined at run-time
    • Field Types : stored,indexed,tokenized
      • A field can be both stored and indexed
      • An indexed field can be tokenized
  • Segments (sub indexes)
    • Independent searchable index
    • Lucene index evolves with segments getting added/merged
    • Search results from segments are merged

7 of 22

Merge Algorithm

  • Maintain a stack of Segment Indexes
  • Create a segment/index for every doc added
  • Push new indexes into a stack
  • For (size=1;size < INF; size *= b) {
    • If (exists b indexes of size docs on top of stack) {
      • Pop them off the stack
      • Merge them into a single index
      • Push the merged index onto the stack
    • } Else {
      • break
    • } }

8 of 22

Lucene Indexing Algorithm

  • K-way merge – process at disk transfer rate
  • Average b*logN indexes
    • n=1M,b=2 gives 20 indexes
    • Fast to update and not too slow to search
  • Optimization
    • Small sized segments kept in RAM, saves I/O calls

9 of 22

Search Algorithm

  • Uses Inverted Index
    • Data Structure : term->doc-entries postings
    • Use skip lists to speed up search
  • Merge postings from different query terms
    • Efficient relevancy score calculation
    • Use Vector Space Model for similarity calculation
    • Parallel merge algorithm

10 of 22

A Lucene Segment

<SegName, SegSize> SegCount

Version

Name Counter

SegCount

Format

11 of 22

Per Index Files

  • Segments File (“segments” file)
    • Segments --> Format, Version, NameCounter, SegCount, <SegName, SegSize>^SegCount
  • Lock Files
    • Several files are used to indicate that another process is using an index. These are stored in java.io.tmpdir
  • Deletable Files
    • Only used on Win32, where a file may not be deleted while it is still open
  • Compound files (“.fcs” file, just a container)
    • Compound (.cfs) --> FileCount, <DataOffset, FileName>^FileCount, FileData^FileCount

12 of 22

Per Segment Files

  • Each segment has the following :
    • Field names (e.g. Title, content etc.)
    • Stored Field Values ( e.g. Urls )
    • Term dictionary
      • A dictionary containing all of the terms used in all of the indexed fields of all of the documents. It also contains the number of documents which contain the term, and pointers to the term's frequency and proximity data.
    • Term Frequency data.
      • For each term in the dictionary, the numbers of all the documents that contain that term, and the frequency of the term in that document.

13 of 22

FileFormat (contd..)

  • Term Proximity data
    • For each term in the dictionary, the positions that the term occurs in each document.
  • Term Vectors
    • For each field in each document, the term vector (sometimes called document vector) is stored. A term vector consists of term text and term frequency.
  • Normalization factors
    • For each field in each document, a value is stored that is multiplied into the score for hits on that field.
  • Deleted Documents

14 of 22

Per Segment Files(prefix=segname)

  • Fields
      • Field Info ( .fnm file)
      • Stored Fields (.fdx and .fdt files)
  • Term Dictionary
    • Term Infos (.tis file)
      • TermInfoFile (.tis)--> TIVersion, TermCount, IndexInterval, SkipInterval, TermInfos
    • Term Info Index (.tii file)
      • This contains every IIth entry from the .tis file, along with its location in the "tis" file. This is designed to be read entirely into memory and used to provide random access to the "tis" file.
      • TermInfoIndex (.tii)--> TIVersion, IndexTermCount, IndexInterval, SkipInterval, TermIndices

15 of 22

Per Segment Files (contd..)

  • Frequencies (.frq or the doc postings file)
    • Contains lists of documents which contain each term, along with the frequency of the term in that document.
    • Referred in .tis file as FreqData (with delta encoding)
  • Positions (.prx file)
    • Contains list of term positions within the documents
    • Used for Span Query, Phrase Query searches
  • Normalization (.f[0-9]* files)
    • A Norm file (.fn) for each indexed field with a byte/doc

16 of 22

Segment layout

Ti+1

Tj+k

Tj

Ti

.tii file (in memory)

.tis file (in disk)

Random seeks

Contiguous reads

Tq

IndexDelta

.frq/posting file (in disk)

Tq+1

Term

DocFreq

FreqDelta

ProxDelta

SkipDelta

TF1

TFd

Posting-list(term-freqs)

SD1

SDd/sk

SD2

DocId

FreqSk

TFsk

ProxSk

.prx file

Used to merge postings

17 of 22

Index Compression

  • Variable byte encoding
    • Need very fast decoding (byte-align)
    • Given a binary representation
      • Form groups of 7-bits each (i.e. Block of 128)
      • If 1st bit is 1 append the next 2-8 bits, do it recursively
  • Front Encoding of terms
      • Sorted terms commonly have long common prefix
        • 8,automata,8,automate,9,automatic,10,automation
        • 0,automata,7,e,7,ic,7,ion (huge saving....)
      • Works only if they are sorted lexicographically

18 of 22

Search Algorithm

  • Posting Lists (from .frq file)
    • For every term Ti, we have a posting list
      • Ti -> (Doc-id,Freq(d,t)) *
      • Ti's are sorted lexically, so are posting list on doc-id
    • Postings are delta encoded(for Index Compression)
  • Based on Vector Space Model
    • Conjunctive Search Algorithm (AND queries)
    • Disjunctive Search Algorithm (OR queries)

19 of 22

Vector Space Model

  • Terms constitute the space (or the axes)
  • Document/query are represented as a vector
  • Use cosine as a similarity measure (for docs)
  • Query is a short doc
  • TF-IDF weighting scheme
    • Tf * log(n/nt)

20 of 22

Search Algorithm (contd.)

  • Conjunctive Search Algorithm
    • The AND merge
      • Walk through 2 postings simultaneously, O(m+n)
      • Crucial – postings sorted by doc-id
    • Use skip pointers to speed up merge
  • Disjunctive search Algorithm
      • All postings should be merged
      • Minimize per postings processing

21 of 22

Others..

  • Positional Queries
    • Use positional information for doc/term tuple
  • Range Queries
    • Date Range queries
  • Stemming
    • Porter Stemmer Algorithm
  • Configurable Similarity algorithms

22 of 22

Open ended questions

  • Incremental Update to docs
    • Relevant to our focused crawl (for av files)
    • Hard problem to solve
  • Deficiencies in VSM
    • Need exact query terms
    • Use auto-classification/clustering
  • Q/A......