1
CS 589 Lecture 4
Monday 6:30-9:00
Kidde 228
All zoom links in Canvas
Most slides adapted from Stanford CS276
Inverted Index
photo: https://www.scubedstudios.com/information-retrieval/
Review of Lecture 1-2: Retrieval models
2
Lecture 4: Information retrieval infrastructure
3
How to make retrieval models efficient?
4
How does Google return results so quickly?
How does Google know cs 589 refers to a course?
How does Google know stevens = SIT?
Pop quiz (IR Evaluation)
5
IR Evaluation demo: https://drive.google.com/file/d/1-yu6UljxfnSGSTsWnXNEITal-OSu-5Qd/view?usp=sharing
Lecture 4: Inverted index
6
Lecture 4: Motivating example (Boolean retrieval system)
7
Lecture 4: Motivating example (Boolean retrieval system)
8
Lecture 4: Motivating example
9
Lecture 4: Motivating example
10
Lecture 4: Motivating example
Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus,
When Antony found Julius Caesar dead,
He cried almost to roaring; and he wept
When at Philippi he found Brutus slain.
Lord Polonius: I did enact Julius Caesar I was killed i’ the
Capitol; Brutus killed me.
11
Lecture 4: Motivating example
12
Can you think of similar queries?
Challenge in scalability
13
Reducing space with inverted index
14
Brutus
Calpurnia
Caesar
1
2
4
5
6
16
57
132
1
2
4
11
31
45
173
2
31
174
54
101
Can we use fixed sized array?
Inverted index
15
Dictionary
Postings
Sorted by docID (more later on why).
Posting
Brutus
Calpurnia
Caesar
1
2
4
5
6
16
57
132
1
2
4
11
31
45
173
2
31
174
54
101
Building inverted index
16
Tokenizer
Token stream
Friends
Romans
Countrymen
Linguistic modules
Modified tokens
friend
roman
countryman
Indexer
Inverted index
friend
roman
countryman
2
4
2
13
16
1
Documents to
be indexed
Friends, Romans, countrymen.
Text preprocessing for building inverted index
17
Indexer step: Tokenize sequence
18
I did enact Julius
Caesar I was killed
i’ the Capitol;
Brutus killed me.
Doc 1
So let it be with
Caesar. The noble
Brutus hath told you
Caesar was ambitious
Doc 2
Indexer step: Sorting index terms
19
Question: how to efficiently sort by one element then by another?
Indexer step: Dictionary and postings
20
Where do we pay in storage?
21
Terms and counts
IR system implementation
Lists of docIDs
Query processing for inverted index
22
128
34
2
4
8
16
32
64
1
2
3
5
8
13
21
Brutus
Caesar
Merging two posting lists (AND & OR)
23
128
34
2
4
8
16
32
64
1
2
3
5
8
13
21
Brutus
Caesar
Merging two posting lists: skipping lists (AND only)
24
Handling OR / NOT
25
Boolean queries: handling larger size lists
Brutus AND NOT Caesar
Brutus OR NOT Caesar
26
What about arbitrary Boolean formula?
(Brutus OR Caesar) AND NOT
(Antony OR Cleopatra)
27
Brutus
Caesar
Calpurnia
1
2
3
5
8
16
21
34
2
4
8
16
32
64
128
13
16
Query: Brutus AND Calpurnia AND Caesar
What about arbitrary Boolean formula?
28
Query: (Calpurnia AND Brutus) AND Caesar
This is why we kept
document freq. in dictionary
Brutus
Caesar
Calpurnia
1
2
3
5
8
16
21
34
2
4
8
16
32
64
128
13
16
Query processing exercise
(tangerine OR trees) AND
(marmalade OR skies) AND
(kaleidoscope OR eyes)
29
Phrase indexing
30
A first attempt: bi-gram indexing
31
Issues with bi-gram indexing
32
Solution 2: Positional indexing
33
<be: 993427;
1: 7, 18, 33, 72, 86, 231;
2: 3, 149;
4: 17, 191, 291, 430, 434;
5: 363, 367, …>
Which of docs 1,2,4,5
could contain “to be
or not to be”?
Solution 2: Positional indexing
34
<be: 993427;
1: 7, 18, 33, 72, 86, 231;
2: 3, 149;
4: 17, 191, 291, 430, 434;
5: 363, 367, …>
Which of docs 1,2,4,5
could contain “to be
or not to be”?
Proximity search
35
Proximity search
d1: To be or not to be, that is the question.
d2: may not be a question, but to my or his understanding, ...
to: 1:[0,4], 2: [6]
be: 1:[1,5], 2: [2]
or: 1:[2], 2:[8]
not: 1:[3], 2:[1]
that: 1: [6], …
36
Positional indexing size
37
Index compression (for non-positional indexing)
38
Compressing the posting ptr table
Compressing the posting list
Speeding up the dictionary search
Compressing the postings pointer table
39
Compressing the posting ptr table
Compressing the postings pointer table
40
4 bytes
4 bytes
3 bytes
Compressing the postings pointer table
41
4 bytes
4 bytes
1 byte
3 bytes
absolute value of ptr address
absolute value of ptr address
Length of word “systile”
Compressing the posting lists
42
Compressing the posting lists
43
Compressing the posting list
Compressing the posting lists
44
Speeding up the dictionary search
45
Speeding up the dictionary search
Speeding up the dictionary search with prefix tree
46
car
cat
do
don
don
do
cat
car
Vector Database
48
Probabilistic Skip List
49
HNSW is an extension of the probabilistic skip list
Nearest Neighbor Search
50
KNN
51
query
How to solve the local optimal problem:
delaunay: make NNs mutual (rather than a few hubs dominate)
multi search
52
How to solve the local optimal problem:
53
Delaunay: make sure any node is connected
54
query
55
Hierarchical Navigable Small World
HNSW
NSW
56
Navigable Small World
57
Hierarchical Navigable Small World (HNSW) graphs
58
source: https://www.pinecone.io/learn/series/faiss/hnsw/
greedy-routing search process
High-degree vertices have many links, whereas low-degree vertices have very few links.
HNSW: Using Multiple NSW
59
source: https://www.pinecone.io/learn/series/faiss/hnsw/
The search process through the multi-layer structure of an HNSW graph.
Explanation of the number of links assigned to each vertex and the effect of M, M_max, and M_max0.
How do we build layers?
60
How do we build layers?
61
HNSW vs baselines
62
Lecture 4 Summary
63
Handling web scale indexing
64
Map-reduce
65
master assigns split to idle machine
parser emits (term,doc) pair
merge partitions in inverter
complete the index
Examples of map-reduce
66
MapReduce: Industry practice
67
Logarithmic dynamic indexing
68
https://nlp.stanford.edu/IR-book/html/htmledition/dynamic-indexing-1.html
Z0
I0
Z0
I0
I1
Z0
Z0
I0
I1
I2
Z0
I2
I0
I2
I1
I2
I1
Z0
I0
I3
Real time search of Twitter
69
Search engine tools
70
ElasticSearch
71
Homework 2: Using ElasticSearch to build a search engine
72