1 of 72

1

CS 589 Lecture 4

Monday 6:30-9:00

Kidde 228

All zoom links in Canvas

Most slides adapted from Stanford CS276

Inverted Index

photo: https://www.scubedstudios.com/information-retrieval/

2 of 72

Review of Lecture 1-2: Retrieval models

Vector space model

Use the cosine score of TF-IDF to retrieve documents

BM25:

Probability based retrieval model by simulating 2-Poisson with parameter free model

LM-based retrieval model

Probability based retrieval model, rank by p(q|d)

2

3 of 72

Lecture 4: Information retrieval infrastructure

3

4 of 72

How to make retrieval models efficient?

4

How does Google return results so quickly?

How does Google know cs 589 refers to a course?

How does Google know stevens = SIT?

5 of 72

Pop quiz (IR Evaluation)

5

IR Evaluation demo: https://drive.google.com/file/d/1-yu6UljxfnSGSTsWnXNEITal-OSu-5Qd/view?usp=sharing

6 of 72

Lecture 4: Inverted index

Key data structure underlying all modern IR systems

Systems run on a single machine
Massive systems for the biggest commercial search engines

Exploiting the sparsity of the term-document matrix

Inverted index can generally be applied to retrieval models

TF-IDF, BM25, LM-based retrieval model

6

7 of 72

Lecture 4: Motivating example (Boolean retrieval system)

Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia?

7

8 of 72

Lecture 4: Motivating example (Boolean retrieval system)

Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia?

grep -r ./* “.* brutus .* caesar.*” then remove all documents containing calpurnia

Slow when the data size is large
Cannot be extended to more complex retrieval model (e.g., TF-IDF)

8

9 of 72

Lecture 4: Motivating example

Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia? Brutus AND Caesar AND NOT Calpurnia

9

10 of 72

Lecture 4: Motivating example

Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia? Brutus AND Caesar AND NOT Calpurnia

110100 AND
110111 AND
~010000 = 101111
100100

10

11 of 72

Lecture 4: Motivating example

Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia?

Antony and Cleopatra, Act III, Scene ii

Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus,

When Antony found Julius Caesar dead,

He cried almost to roaring; and he wept

When at Philippi he found Brutus slain.

Hamlet, Act III, Scene ii

Lord Polonius: I did enact Julius Caesar I was killed i’ the

Capitol; Brutus killed me.

11

12 of 72

Lecture 4: Motivating example

Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia?

12

Can you think of similar queries?

13 of 72

Challenge in scalability

Consider N = 1 million documents, each with about 1000 words

Avg 6 bytes/word including spaces/punctuation, 6GB

Say there are M = 500K distinct terms among these.

500K x 1M matrix has half-a-trillion 0’s and 1’s

But it has no more than one billion 1’s (why?)

How do we improve the space cost for retrieval?

can we only record the 1?

13

14 of 72

Reducing space with inverted index

For each term t, we must store a linked list of all documents that contain t.

Identify each doc by a docID, a document serial number

14

Brutus

Calpurnia

Caesar

1

2

4

5

6

16

57

132

1

2

4

11

31

45

173

2

31

174

54

101

Can we use fixed sized array?

15 of 72

Inverted index

We need size-varying postings lists

In memory, can use linked lists or variable length arrays

Some tradeoffs in memory requirements/ease of insertion (what’s the insertion complexity?)

15

Dictionary

Postings

Sorted by docID (more later on why).

Posting

Brutus

Calpurnia

Caesar

1

2

4

5

6

16

57

132

1

2

4

11

31

45

173

2

31

174

54

101

16 of 72

Building inverted index

16

Tokenizer

Token stream

Friends

Romans

Countrymen

Linguistic modules

Modified tokens

friend

roman

countryman

Indexer

Inverted index

friend

roman

countryman

2

4

2

13

16

1

Documents to

be indexed

Friends, Romans, countrymen.

17 of 72

Text preprocessing for building inverted index

Tokenization

Cut character sequence into word tokens

Deal with “John’s”, a state-of-the-art solution

Normalization

Map text and query term to same form

You want U.S.A. and USA to match

Stemming

We may wish different forms of a root to match

authorize, authorization

Stop words

We may omit very common words (or not)

the, a, to, of

17

18 of 72

Indexer step: Tokenize sequence

Sequence of (Modified token, Document ID) pairs.

18

I did enact Julius

Caesar I was killed

i’ the Capitol;

Brutus killed me.

Doc 1

So let it be with

Caesar. The noble

Brutus hath told you

Caesar was ambitious

Doc 2

19 of 72

Indexer step: Sorting index terms

First sort by term then sort by docID

19

Question: how to efficiently sort by one element then by another?

20 of 72

Indexer step: Dictionary and postings

Multiple term entries in a single document are merged

Split into Dictionary and Postings

Doc. frequency information is added.

20

21 of 72

Where do we pay in storage?

21

Terms and counts

IR system implementation

How do we index efficiently?
How much storage do we need?

Lists of docIDs

22 of 72

Query processing for inverted index

Suppose we have constructed the inverted index, what query can we answer?

Consider processing the query: Brutus AND Caesar

Locate Brutus in the Dictionary;

Retrieve its postings.

Locate Caesar in the Dictionary;

Retrieve its postings.

“Merge” the two postings (intersect the document sets):

22

128

34

2

4

8

16

32

64

1

2

3

5

8

13

21

Brutus

Caesar

23 of 72

Merging two posting lists (AND & OR)

Walk through the two posting lists simultaneously, in time linear in the total number of postings entries (using two pointers, without skipping O(x + y))

23

128

34

2

4

8

16

32

64

1

2

3

5

8

13

21

Brutus

Caesar

24 of 72

Merging two posting lists: skipping lists (AND only)

Speeding up the merge by skipping every k pointers

24

25 of 72

Handling OR / NOT

The Boolean retrieval model is being able to ask a query that is a Boolean expression:

Boolean Queries are queries using AND, OR and NOT to join query terms

Views each document as a set of words
Is precise: document matches condition or not.

Perhaps the simplest model to build an IR system on

Primary commercial retrieval tool for 3 decades.
Many search systems you still use are Boolean:

Email, library catalog, macOS Spotlight

25

26 of 72

Boolean queries: handling larger size lists

Exercise: Adapt the merge for the queries:

Brutus AND NOT Caesar

Brutus OR NOT Caesar

Can we still run through the merge in time O(x+y)? What can we achieve?

26

27 of 72

What about arbitrary Boolean formula?

(Brutus OR Caesar) AND NOT

(Antony OR Cleopatra)

Can we always merge in “linear” time?

Consider a query that is an AND of n terms, what is the best way of processing?

27

Brutus

Caesar

Calpurnia

1

2

3

5

8

16

21

34

2

4

8

16

32

64

128

13

16

Query: Brutus AND Calpurnia AND Caesar

28 of 72

What about arbitrary Boolean formula?

Process in order of increasing freq:

start with smallest set, then keep cutting further.

28

Query: (Calpurnia AND Brutus) AND Caesar

This is why we kept

document freq. in dictionary

Brutus

Caesar

Calpurnia

1

2

3

5

8

16

21

34

2

4

8

16

32

64

128

13

16

29 of 72

Query processing exercise

Recommend a query processing order for

(tangerine OR trees) AND

(marmalade OR skies) AND

(kaleidoscope OR eyes)

Which terms should we process first?

29

30 of 72

Phrase indexing

We want to be able to answer queries such as “Stevens Institute of Technology” – as a phrase

Sentence 1= “I went to Stevens Institute of Technology”

Sentence 2= “The Technology Institute that Steve Harvey went to. ”

For this, it no longer suffices to store only
<term : docs> entries

30

31 of 72

A first attempt: bi-gram indexing

Index every consecutive pair of terms in the text as a phrase

For example the text “Friends, Romans, Countrymen” would generate the bigrams

friends romans
romans countrymen

Each of these bigrams is now a dictionary term

Two-word phrase query-processing is now immediate.

31

32 of 72

Issues with bi-gram indexing

Index blowup due to bigger dictionary

Infeasible for more than bigrams, big even for them

Bigram indexes are not the standard solution (for all bigrams) but can be part of a compound strategy

32

33 of 72

Solution 2: Positional indexing

In the postings, store, for each term the position(s) in which tokens of it appear:

33

<be: 993427;

1: 7, 18, 33, 72, 86, 231;

2: 3, 149;

4: 17, 191, 291, 430, 434;

5: 363, 367, …>

Which of docs 1,2,4,5

could contain “to be

or not to be”?

34 of 72

Solution 2: Positional indexing

For phrase queries, we use a merge algorithm recursively at the document level

But we now need to deal with more than just equality

34

<be: 993427;

1: 7, 18, 33, 72, 86, 231;

2: 3, 149;

4: 17, 191, 291, 430, 434;

5: 363, 367, …>

Which of docs 1,2,4,5

could contain “to be

or not to be”?

35 of 72

Proximity search

Extract inverted index entries for each distinct term: to, be, or, not.
Merge their doc:position lists to enumerate all positions with “to be or not to be”.

to:

2:1,17,74,222,551; 4:8,16,190,429,433; 7:13,23,191; ...

be:

1:17,19; 4:17,191,291,430,434; 5:14,19,101; …

LIMIT! to /5 be /5 or /5 not

Here, /k means “within k words of”.

35

36 of 72

Proximity search

d1: To be or not to be, that is the question.

d2: may not be a question, but to my or his understanding, ...

to: 1:[0,4], 2: [6]

be: 1:[1,5], 2: [2]

or: 1:[2], 2:[8]

not: 1:[3], 2:[1]

that: 1: [6], …

36

37 of 72

Positional indexing size

A positional index expands postings storage substantially

Even though indices can be compressed

Nevertheless, a positional index is now standardly used because of the power and usefulness of phrase and proximity queries … whether used explicitly or implicitly in a ranking retrieval system

A positional index is 2–4 as large as a non-positional index

Positional index size 35–50% of volume of original text

37

38 of 72

Index compression (for non-positional indexing)

Compressing the posting pointer table
Compressing the posting lists
Speeding up the dictionary search with trie

38

Compressing the posting ptr table

Compressing the posting list

Speeding up the dictionary search

39 of 72

Compressing the postings pointer table

Most of the space in the table is wasted

Most words < 20 bytes
Table storage = 28N

39

Compressing the posting ptr table

40 of 72

Compressing the postings pointer table

Concatenate the dictionary as one string

Table storage 28N -> 11N

How to further improve the storage space?

Instead of storing absolute term pointers, store the gaps

40

4 bytes

3 bytes

41 of 72

Compressing the postings pointer table

Save more space by skipping (k-1) pointers for every k pointers

Recover the skipped pointers by adding the length of words

Table storage is further reduced to 8N + 3N * ((3+k)/3*k). When k=4, the storage required is 8N + 3 * 7/12 N = 9.75N < 11N

Trade-off between saving space (skipping more) vs. saving time (skipping less)

41

4 bytes

1 byte

3 bytes

absolute value of ptr address

Length of word “systile”

42 of 72

Compressing the posting lists

Observations of posting files

Instead of storing docID, store gaps
Brutus: 2,4,8,3,4,5,15
Binary seq: 10,100,1000,11,100,101,1111

Prefix encoding

Binary encoding such that the sequence can be uniquely decoded (why do we need this uniqueness?)
e.g., Huffman encoding
Unary encoding: {2:110,4:11110, …}
A uniquely decodable seq: 110111101111111101110…

42

43 of 72

Compressing the posting lists

43

Compressing the posting list

44 of 72

Compressing the posting lists

Unary encoding is too long!

Gamma code of 13: 1110,101

Binary code for {length - 1} followed by 0: 1110
Offset (last {length - 1} bits of the binary value): 13 =1101 → 101

What is the gamma code of 5? 101 -> 110,01

We can prove gamma code is uniquely decodable!

Gamma code compression rate: 11.7%

44

45 of 72

Speeding up the dictionary search

45

Speeding up the dictionary search

46 of 72

Speeding up the dictionary search with prefix tree

Time complexity for searching/insertion:

BST: O(m * log n), m is the maximum word length, n is the number of words in the vocabulary
Prefix-tree: O(m), m is the maximum word length

46

car

cat

do

don

do

cat

car

47 of 72

Implementing TF-IDF Inverted Index

blog post colab github

47

48 of 72

Vector Database

Represent each query and document by vector

Hierarchical Navigable Small World (HNSW) graphs

Faiss: The Missing Manual by James Briggs

48

49 of 72

Probabilistic Skip List

49

HNSW is an extension of the probabilistic skip list

50 of 72

Nearest Neighbor Search

Repeatedly find the closer points, until cannot get closer anymore

50

51 of 72

KNN

Stuck at local minimum

51

query

52 of 72

How to solve the local optimal problem:

delaunay: make NNs mutual (rather than a few hubs dominate)

multi search

52

53 of 72

How to solve the local optimal problem:

53

54 of 72

Delaunay: make sure any node is connected

54

query

55 of 72

55

56 of 72

Hierarchical Navigable Small World

HNSW

NSW

56

57 of 72

Navigable Small World

Consecutive insertions of elements in random order

Construction phase can be efficiently parallelized

Suffer from polylogarithmic search complexity on low-dim dataset (NSW loses to tree-based algorithm by orders of magnitude)

57

58 of 72

Hierarchical Navigable Small World (HNSW) graphs

58

source: https://www.pinecone.io/learn/series/faiss/hnsw/

greedy-routing search process

High-degree vertices have many links, whereas low-degree vertices have very few links.

59 of 72

HNSW: Using Multiple NSW

59

source: https://www.pinecone.io/learn/series/faiss/hnsw/

The search process through the multi-layer structure of an HNSW graph.

Explanation of the number of links assigned to each vertex and the effect of M, M_max, and M_max0.

60 of 72

How do we build layers?

60

61 of 72

How do we build layers?

61

62 of 72

HNSW vs baselines

62

63 of 72

Lecture 4 Summary

Motivation: solving the challenge in scalability

Lots of 0, saving the storage cost

Query time improvement:

Linear time merge algorithm
Reducing time cost by skipping
Process query in order of increasing frequency

Index time improvement:

Compressing the posting pointer table, 28N -> 11N -> 9.75N
Compressing the posting list: storing gaps, prefix encoding
Speeding up dictionary search with prefix tree

63

64 of 72

Handling web scale indexing

Web-scale indexing must use clusters of servers

Google had 1 million servers in 2011

Fault tolerance of a massive data center

If a non-fault tolerance system has 1000 nodes, each has 99.9% uptime, then 63% of the time one or more servers is down

Solution

Maintain a “master” server
Break indexing into parallel tasks
Assign each task to an idle machine

64

65 of 72

Map-reduce

65

master assigns split to idle machine

parser emits (term,doc) pair

merge partitions in inverter

complete the index

66 of 72

Examples of map-reduce

66

67 of 72

MapReduce: Industry practice

Term partition vs. document partition

Term-partitioned: one machine handles a subrange of terms
Document-partitioned: one machine handles a subrange of documents

Most industry search engine use document-partitioned index

Better load balancing (why?)

67

68 of 72

Logarithmic dynamic indexing

68

https://nlp.stanford.edu/IR-book/html/htmledition/dynamic-indexing-1.html

Z0

I0

Z0

I0

I1

Z0

I0

I1

I2

Z0

I2

I0

I2

I1

I2

I1

Z0

I0

I3

69 of 72

Real time search of Twitter

Requires high real time search

Low latency, high throughput query evaluation
High ingestion rate and immediate data availability
Concurrent reads and writes of the index

Solution: using segments

Each segment consists of 2^32 tweets (in memory)
New posts are appended to the posting lists
Only one segment can be written to at each time

69

70 of 72

Search engine tools

Apache Lucene

Free and open search engine library
First developed in 1999�

ElasticSearch

A search engine
based on Lucene

70

71 of 72

ElasticSearch

Using a REST api

71

72 of 72

Homework 2: Using ElasticSearch to build a search engine

Build an inverted index

Evaluate three search algorithm’s performance

TF-IDF
BM25
Dirichlet-LM

72