1 of 72

1

CS 589 Lecture 4

Monday 6:30-9:00

Kidde 228

All zoom links in Canvas

Most slides adapted from Stanford CS276

Inverted Index

photo: https://www.scubedstudios.com/information-retrieval/

2 of 72

Review of Lecture 1-2: Retrieval models

  • Vector space model
    • Use the cosine score of TF-IDF to retrieve documents

  • BM25:
    • Probability based retrieval model by simulating 2-Poisson with parameter free model

  • LM-based retrieval model
    • Probability based retrieval model, rank by p(q|d)

2

3 of 72

Lecture 4: Information retrieval infrastructure

3

4 of 72

How to make retrieval models efficient?

4

How does Google return results so quickly?

How does Google know cs 589 refers to a course?

How does Google know stevens = SIT?

5 of 72

Pop quiz (IR Evaluation)

5

IR Evaluation demo: https://drive.google.com/file/d/1-yu6UljxfnSGSTsWnXNEITal-OSu-5Qd/view?usp=sharing

6 of 72

Lecture 4: Inverted index

  • Key data structure underlying all modern IR systems
    • Systems run on a single machine
    • Massive systems for the biggest commercial search engines

  • Exploiting the sparsity of the term-document matrix

  • Inverted index can generally be applied to retrieval models
    • TF-IDF, BM25, LM-based retrieval model

6

7 of 72

Lecture 4: Motivating example (Boolean retrieval system)

  • Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia?

7

8 of 72

Lecture 4: Motivating example (Boolean retrieval system)

  • Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia?

  • grep -r ./* “.* brutus .* caesar.*” then remove all documents containing calpurnia
    • Slow when the data size is large
    • Cannot be extended to more complex retrieval model (e.g., TF-IDF)

8

9 of 72

Lecture 4: Motivating example

  • Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia? Brutus AND Caesar AND NOT Calpurnia

9

10 of 72

Lecture 4: Motivating example

  • Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia? Brutus AND Caesar AND NOT Calpurnia
    • 110100 AND
    • 110111 AND
    • ~010000 = 101111
    • 100100

10

11 of 72

Lecture 4: Motivating example

  • Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia?

  • Antony and Cleopatra, Act III, Scene ii

Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus,

When Antony found Julius Caesar dead,

He cried almost to roaring; and he wept

When at Philippi he found Brutus slain.

  • Hamlet, Act III, Scene ii

Lord Polonius: I did enact Julius Caesar I was killed i’ the

Capitol; Brutus killed me.

11

12 of 72

Lecture 4: Motivating example

  • Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia?

12

Can you think of similar queries?

13 of 72

Challenge in scalability

  • Consider N = 1 million documents, each with about 1000 words
    • Avg 6 bytes/word including spaces/punctuation, 6GB

  • Say there are M = 500K distinct terms among these.

  • 500K x 1M matrix has half-a-trillion 0’s and 1’s
    • But it has no more than one billion 1’s (why?)

  • How do we improve the space cost for retrieval?
    • can we only record the 1?

13

14 of 72

Reducing space with inverted index

  • For each term t, we must store a linked list of all documents that contain t.
    • Identify each doc by a docID, a document serial number

14

Brutus

Calpurnia

Caesar

1

2

4

5

6

16

57

132

1

2

4

11

31

45

173

2

31

174

54

101

Can we use fixed sized array?

15 of 72

Inverted index

  • We need size-varying postings lists
    • In memory, can use linked lists or variable length arrays
      • Some tradeoffs in memory requirements/ease of insertion (what’s the insertion complexity?)

15

Dictionary

Postings

Sorted by docID (more later on why).

Posting

Brutus

Calpurnia

Caesar

1

2

4

5

6

16

57

132

1

2

4

11

31

45

173

2

31

174

54

101

16 of 72

Building inverted index

16

Tokenizer

Token stream

Friends

Romans

Countrymen

Linguistic modules

Modified tokens

friend

roman

countryman

Indexer

Inverted index

friend

roman

countryman

2

4

2

13

16

1

Documents to

be indexed

Friends, Romans, countrymen.

17 of 72

Text preprocessing for building inverted index

  • Tokenization
    • Cut character sequence into word tokens
      • Deal with “John’s”, a state-of-the-art solution
  • Normalization
    • Map text and query term to same form
      • You want U.S.A. and USA to match
  • Stemming
    • We may wish different forms of a root to match
      • authorize, authorization
  • Stop words
    • We may omit very common words (or not)
      • the, a, to, of

17

18 of 72

Indexer step: Tokenize sequence

  • Sequence of (Modified token, Document ID) pairs.

18

I did enact Julius

Caesar I was killed

i’ the Capitol;

Brutus killed me.

Doc 1

So let it be with

Caesar. The noble

Brutus hath told you

Caesar was ambitious

Doc 2

19 of 72

Indexer step: Sorting index terms

  • First sort by term then sort by docID

19

Question: how to efficiently sort by one element then by another?

20 of 72

Indexer step: Dictionary and postings

  • Multiple term entries in a single document are merged

  • Split into Dictionary and Postings

  • Doc. frequency information is added.

20

21 of 72

Where do we pay in storage?

21

Terms and counts

IR system implementation

  • How do we index efficiently?
  • How much storage do we need?

Lists of docIDs

22 of 72

Query processing for inverted index

  • Suppose we have constructed the inverted index, what query can we answer?

  • Consider processing the query: Brutus AND Caesar
    • Locate Brutus in the Dictionary;
      • Retrieve its postings.
    • Locate Caesar in the Dictionary;
      • Retrieve its postings.
    • “Merge” the two postings (intersect the document sets):

22

128

34

2

4

8

16

32

64

1

2

3

5

8

13

21

Brutus

Caesar

23 of 72

Merging two posting lists (AND & OR)

  • Walk through the two posting lists simultaneously, in time linear in the total number of postings entries (using two pointers, without skipping O(x + y))

23

128

34

2

4

8

16

32

64

1

2

3

5

8

13

21

Brutus

Caesar

24 of 72

Merging two posting lists: skipping lists (AND only)

  • Speeding up the merge by skipping every k pointers

24

25 of 72

Handling OR / NOT

  • The Boolean retrieval model is being able to ask a query that is a Boolean expression:
    • Boolean Queries are queries using AND, OR and NOT to join query terms
      • Views each document as a set of words
      • Is precise: document matches condition or not.
    • Perhaps the simplest model to build an IR system on
  • Primary commercial retrieval tool for 3 decades.
  • Many search systems you still use are Boolean:
    • Email, library catalog, macOS Spotlight

25

26 of 72

Boolean queries: handling larger size lists

  • Exercise: Adapt the merge for the queries:

Brutus AND NOT Caesar

Brutus OR NOT Caesar

  • Can we still run through the merge in time O(x+y)? What can we achieve?

26

27 of 72

What about arbitrary Boolean formula?

(Brutus OR Caesar) AND NOT

(Antony OR Cleopatra)

  • Can we always merge in “linear” time?

  • Consider a query that is an AND of n terms, what is the best way of processing?

27

Brutus

Caesar

Calpurnia

1

2

3

5

8

16

21

34

2

4

8

16

32

64

128

13

16

Query: Brutus AND Calpurnia AND Caesar

28 of 72

What about arbitrary Boolean formula?

  • Process in order of increasing freq:
    • start with smallest set, then keep cutting further.

28

Query: (Calpurnia AND Brutus) AND Caesar

This is why we kept

document freq. in dictionary

Brutus

Caesar

Calpurnia

1

2

3

5

8

16

21

34

2

4

8

16

32

64

128

13

16

29 of 72

Query processing exercise

  • Recommend a query processing order for

(tangerine OR trees) AND

(marmalade OR skies) AND

(kaleidoscope OR eyes)

  • Which terms should we process first?

29

30 of 72

Phrase indexing

  • We want to be able to answer queries such as Stevens Institute of Technology” – as a phrase

  • Sentence 1= “I went to Stevens Institute of Technology

  • Sentence 2= “The Technology Institute that Steve Harvey went to. ”

  • For this, it no longer suffices to store only
  • <term : docs> entries

30

31 of 72

A first attempt: bi-gram indexing

  • Index every consecutive pair of terms in the text as a phrase

  • For example the text “Friends, Romans, Countrymen” would generate the bigrams
    • friends romans
    • romans countrymen

  • Each of these bigrams is now a dictionary term

  • Two-word phrase query-processing is now immediate.

31

32 of 72

Issues with bi-gram indexing

  • Index blowup due to bigger dictionary
    • Infeasible for more than bigrams, big even for them

  • Bigram indexes are not the standard solution (for all bigrams) but can be part of a compound strategy

32

33 of 72

Solution 2: Positional indexing

  • In the postings, store, for each term the position(s) in which tokens of it appear:

33

<be: 993427;

1: 7, 18, 33, 72, 86, 231;

2: 3, 149;

4: 17, 191, 291, 430, 434;

5: 363, 367, …>

Which of docs 1,2,4,5

could contain “to be

or not to be”?

34 of 72

Solution 2: Positional indexing

  • For phrase queries, we use a merge algorithm recursively at the document level

  • But we now need to deal with more than just equality

34

<be: 993427;

1: 7, 18, 33, 72, 86, 231;

2: 3, 149;

4: 17, 191, 291, 430, 434;

5: 363, 367, …>

Which of docs 1,2,4,5

could contain “to be

or not to be”?

35 of 72

Proximity search

  • Extract inverted index entries for each distinct term: to, be, or, not.
  • Merge their doc:position lists to enumerate all positions with “to be or not to be”.
    • to:
      • 2:1,17,74,222,551; 4:8,16,190,429,433; 7:13,23,191; ...
    • be:
      • 1:17,19; 4:17,191,291,430,434; 5:14,19,101; …

  • LIMIT! to /5 be /5 or /5 not
    • Here, /k means “within k words of”.

35

36 of 72

Proximity search

d1: To be or not to be, that is the question.

d2: may not be a question, but to my or his understanding, ...

to: 1:[0,4], 2: [6]

be: 1:[1,5], 2: [2]

or: 1:[2], 2:[8]

not: 1:[3], 2:[1]

that: 1: [6], …

36

37 of 72

Positional indexing size

  • A positional index expands postings storage substantially
    • Even though indices can be compressed

  • Nevertheless, a positional index is now standardly used because of the power and usefulness of phrase and proximity queries … whether used explicitly or implicitly in a ranking retrieval system

  • A positional index is 2–4 as large as a non-positional index

  • Positional index size 35–50% of volume of original text

37

38 of 72

Index compression (for non-positional indexing)

  • Compressing the posting pointer table
  • Compressing the posting lists
  • Speeding up the dictionary search with trie

38

Compressing the posting ptr table

Compressing the posting list

Speeding up the dictionary search

39 of 72

Compressing the postings pointer table

  • Most of the space in the table is wasted
    • Most words < 20 bytes
    • Table storage = 28N

39

Compressing the posting ptr table

40 of 72

Compressing the postings pointer table

  • Concatenate the dictionary as one string
    • Table storage 28N -> 11N

  • How to further improve the storage space?
    • Instead of storing absolute term pointers, store the gaps

40

4 bytes

4 bytes

3 bytes

41 of 72

Compressing the postings pointer table

  • Save more space by skipping (k-1) pointers for every k pointers
    • Recover the skipped pointers by adding the length of words

  • Table storage is further reduced to 8N + 3N * ((3+k)/3*k). When k=4, the storage required is 8N + 3 * 7/12 N = 9.75N < 11N

  • Trade-off between saving space (skipping more) vs. saving time (skipping less)

41

4 bytes

4 bytes

1 byte

3 bytes

absolute value of ptr address

absolute value of ptr address

Length of word “systile”

42 of 72

Compressing the posting lists

  • Observations of posting files
    • Instead of storing docID, store gaps
    • Brutus: 2,4,8,3,4,5,15
    • Binary seq: 10,100,1000,11,100,101,1111

  • Prefix encoding
    • Binary encoding such that the sequence can be uniquely decoded (why do we need this uniqueness?)
    • e.g., Huffman encoding
    • Unary encoding: {2:110,4:11110, …}
    • A uniquely decodable seq: 110111101111111101110…

42

43 of 72

Compressing the posting lists

43

Compressing the posting list

44 of 72

Compressing the posting lists

  • Unary encoding is too long!

  • Gamma code of 13: 1110,101
    • Binary code for {length - 1} followed by 0: 1110
    • Offset (last {length - 1} bits of the binary value): 13 =1101 → 101

  • What is the gamma code of 5? 101 -> 110,01

  • We can prove gamma code is uniquely decodable!

  • Gamma code compression rate: 11.7%

44

45 of 72

Speeding up the dictionary search

45

Speeding up the dictionary search

46 of 72

Speeding up the dictionary search with prefix tree

  • Time complexity for searching/insertion:
    • BST: O(m * log n), m is the maximum word length, n is the number of words in the vocabulary
    • Prefix-tree: O(m), m is the maximum word length

46

car

cat

do

don

don

do

cat

car

47 of 72

Implementing TF-IDF Inverted Index

  • blog post colab github

47

48 of 72

Vector Database

  • Represent each query and document by vector

  • Hierarchical Navigable Small World (HNSW) graphs

48

49 of 72

Probabilistic Skip List

49

HNSW is an extension of the probabilistic skip list

50 of 72

Nearest Neighbor Search

  • Repeatedly find the closer points, until cannot get closer anymore

50

51 of 72

KNN

  • Stuck at local minimum

51

query

52 of 72

How to solve the local optimal problem:

delaunay: make NNs mutual (rather than a few hubs dominate)

multi search

52

53 of 72

How to solve the local optimal problem:

53

54 of 72

Delaunay: make sure any node is connected

54

query

55 of 72

55

56 of 72

Hierarchical Navigable Small World

HNSW

NSW

56

57 of 72

Navigable Small World

  • Consecutive insertions of elements in random order

  • Construction phase can be efficiently parallelized

  • Suffer from polylogarithmic search complexity on low-dim dataset (NSW loses to tree-based algorithm by orders of magnitude)

57

58 of 72

Hierarchical Navigable Small World (HNSW) graphs

58

source: https://www.pinecone.io/learn/series/faiss/hnsw/

greedy-routing search process

High-degree vertices have many links, whereas low-degree vertices have very few links.

59 of 72

HNSW: Using Multiple NSW

59

source: https://www.pinecone.io/learn/series/faiss/hnsw/

The search process through the multi-layer structure of an HNSW graph.

Explanation of the number of links assigned to each vertex and the effect of M, M_max, and M_max0.

60 of 72

How do we build layers?

60

61 of 72

How do we build layers?

61

62 of 72

HNSW vs baselines

62

63 of 72

Lecture 4 Summary

  • Motivation: solving the challenge in scalability
    • Lots of 0, saving the storage cost

  • Query time improvement:
    • Linear time merge algorithm
    • Reducing time cost by skipping
    • Process query in order of increasing frequency

  • Index time improvement:
    • Compressing the posting pointer table, 28N -> 11N -> 9.75N
    • Compressing the posting list: storing gaps, prefix encoding
    • Speeding up dictionary search with prefix tree

63

64 of 72

Handling web scale indexing

  • Web-scale indexing must use clusters of servers
    • Google had 1 million servers in 2011

  • Fault tolerance of a massive data center
    • If a non-fault tolerance system has 1000 nodes, each has 99.9% uptime, then 63% of the time one or more servers is down

  • Solution
    • Maintain a “master” server
    • Break indexing into parallel tasks
    • Assign each task to an idle machine

64

65 of 72

Map-reduce

65

master assigns split to idle machine

parser emits (term,doc) pair

merge partitions in inverter

complete the index

66 of 72

Examples of map-reduce

66

67 of 72

MapReduce: Industry practice

  • Term partition vs. document partition
    • Term-partitioned: one machine handles a subrange of terms
    • Document-partitioned: one machine handles a subrange of documents

  • Most industry search engine use document-partitioned index
    • Better load balancing (why?)

67

68 of 72

Logarithmic dynamic indexing

68

https://nlp.stanford.edu/IR-book/html/htmledition/dynamic-indexing-1.html

Z0

I0

Z0

I0

I1

Z0

Z0

I0

I1

I2

Z0

I2

I0

I2

I1

I2

I1

Z0

I0

I3

69 of 72

Real time search of Twitter

  • Requires high real time search
    • Low latency, high throughput query evaluation
    • High ingestion rate and immediate data availability
    • Concurrent reads and writes of the index

  • Solution: using segments
    • Each segment consists of 2^32 tweets (in memory)
    • New posts are appended to the posting lists
    • Only one segment can be written to at each time

69

70 of 72

Search engine tools

  • Apache Lucene
    • Free and open search engine library
    • First developed in 1999�
  • ElasticSearch
    • A search engine
    • based on Lucene

70

71 of 72

ElasticSearch

  • Using a REST api

71

72 of 72

Homework 2: Using ElasticSearch to build a search engine

  • Build an inverted index

  • Evaluate three search algorithm’s performance
    • TF-IDF
    • BM25
    • Dirichlet-LM

72