Introducing Information Retrieval
and Web Search
Introduction to
Information Retrieval
Introduction to Information Retrieval
Information Retrieval
2
Introduction to Information Retrieval
Unstructured (text) vs. structured (database) data in the mid-nineties
3
Introduction to Information Retrieval
Unstructured (text) vs. structured (database) data today
4
Introduction to Information Retrieval
Basic assumptions of Information Retrieval
5
Sec. 1.1
Introduction to Information Retrieval
The classic search model
how trap mice alive
Collection
User task
Info need�
Query�
Results�
Search
engine�
Query�refinement
Get rid of mice in a politically correct way
Info about removing mice
without killing them
Misconception?
Misformulation?
Search
Introduction to Information Retrieval
How good are the retrieved docs?
7
Sec. 1.1
Introduction to Information Retrieval
Introducing Information Retrieval
and Web Search
Introduction to
Information Retrieval
Introduction to Information Retrieval
Term-document incidence matrices
Introduction to
Information Retrieval
Introduction to Information Retrieval
Unstructured data in 1620
10
Sec. 1.1
Introduction to Information Retrieval
Term-document incidence matrices
1 if play contains word, 0 otherwise
Brutus AND Caesar BUT NOT Calpurnia
Sec. 1.1
Introduction to Information Retrieval
Incidence vectors
12
Sec. 1.1
Introduction to Information Retrieval
Answers to query
Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus,
When Antony found Julius Caesar dead,
He cried almost to roaring; and he wept
When at Philippi he found Brutus slain.
Lord Polonius: I did enact Julius Caesar I was killed i’ the
Capitol; Brutus killed me.
13
Sec. 1.1
Introduction to Information Retrieval
Bigger collections
14
Sec. 1.1
Introduction to Information Retrieval
Can’t build the matrix
15
Why?
Sec. 1.1
Introduction to Information Retrieval
Term-document incidence matrices
Introduction to
Information Retrieval
Introduction to Information Retrieval
The Inverted Index
The key data structure underlying modern IR
Introduction to
Information Retrieval
Introduction to Information Retrieval
Inverted index
18
What happens if the word Caesar is added to document 14?
Sec. 1.2
Brutus
Calpurnia
Caesar
1
2
4
5
6
16
57
132
1
2
4
11
31
45
173
2
31
174
54
101
Introduction to Information Retrieval
Inverted index
19
Dictionary
Postings
Sorted by docID (more later on why).
Posting
Sec. 1.2
Brutus
Calpurnia
Caesar
1
2
4
5
6
16
57
132
1
2
4
11
31
45
173
2
31
174
54
101
Introduction to Information Retrieval
Inverted index construction
Tokenizer
Token stream
Friends
Romans
Countrymen
Linguistic modules
Modified tokens
friend
roman
countryman
Indexer
Inverted index
friend
roman
countryman
2
4
2
13
16
1
Documents to
be indexed
Friends, Romans, countrymen.
Sec. 1.2
Introduction to Information Retrieval
Inverted index construction
Tokenizer
Token stream
Friends
Romans
Countrymen
Linguistic modules
Modified tokens
friend
roman
countryman
Indexer
Inverted index
friend
roman
countryman
2
4
2
13
16
1
More on
these later.
Documents to
be indexed
Friends, Romans, countrymen.
Sec. 1.2
Introduction to Information Retrieval
Initial stages of text processing
Introduction to Information Retrieval
Indexer steps: Token sequence
I did enact Julius
Caesar I was killed
i’ the Capitol;
Brutus killed me.
Doc 1
So let it be with
Caesar. The noble
Brutus hath told you
Caesar was ambitious
Doc 2
Sec. 1.2
Introduction to Information Retrieval
Indexer steps: Sort
Core indexing step
Sec. 1.2
Introduction to Information Retrieval
Indexer steps: Dictionary & Postings
Why frequency?
Will discuss later.
Sec. 1.2
Introduction to Information Retrieval
Where do we pay in storage?
26
Pointers
Terms and counts
IR system implementation
Sec. 1.2
Lists of docIDs
Introduction to Information Retrieval
The Inverted Index
The key data structure underlying modern IR
Introduction to
Information Retrieval
Introduction to Information Retrieval
Query processing with an inverted index
Introduction to
Information Retrieval
Introduction to Information Retrieval
The index we just built
29
Our focus
Sec. 1.3
Introduction to Information Retrieval
Query processing: AND
Brutus AND Caesar
30
128
34
2
4
8
16
32
64
1
2
3
5
8
13
21
Brutus
Caesar
Sec. 1.3
Introduction to Information Retrieval
The merge
31
34
128
2
4
8
16
32
64
1
2
3
5
8
13
21
Brutus
Caesar
If the list lengths are x and y, the merge takes O(x+y)
operations.
Crucial: postings sorted by docID.
Sec. 1.3
Introduction to Information Retrieval
The merge
32
34
128
2
4
8
16
32
64
1
2
3
5
8
13
21
128
34
2
4
8
16
32
64
1
2
3
5
8
13
21
Brutus
Caesar
2
8
If the list lengths are x and y, the merge takes O(x+y)
operations.
Crucial: postings sorted by docID.
Sec. 1.3
Introduction to Information Retrieval
Intersecting two postings lists�(a “merge” algorithm)
33
Introduction to Information Retrieval
Query processing with an inverted index
Introduction to
Information Retrieval
Introduction to Information Retrieval
The Boolean Retrieval Model
& Extended Boolean Models
Introduction to
Information Retrieval
Introduction to Information Retrieval
Boolean queries: Exact match
36
Sec. 1.3
Introduction to Information Retrieval
Example: WestLaw http://www.westlaw.com/
37
Sec. 1.4
Introduction to Information Retrieval
Example: WestLaw http://www.westlaw.com/
Sec. 1.4
Introduction to Information Retrieval
Boolean queries: �More general merges
Brutus AND NOT Caesar
Brutus OR NOT Caesar
39
Sec. 1.3
Introduction to Information Retrieval
Merging
What about an arbitrary Boolean formula?
(Brutus OR Caesar) AND NOT
(Antony OR Cleopatra)
40
Sec. 1.3
Introduction to Information Retrieval
Query optimization
Brutus
Caesar
Calpurnia
1
2
3
5
8
16
21
34
2
4
8
16
32
64
128
13
16
Query: Brutus AND Calpurnia AND Caesar
41
Sec. 1.3
Introduction to Information Retrieval
Query optimization example
42
This is why we kept
document freq. in dictionary
Execute the query as (Calpurnia AND Brutus) AND Caesar.
Sec. 1.3
Brutus
Caesar
Calpurnia
1
2
3
5
8
16
21
34
2
4
8
16
32
64
128
13
16
Introduction to Information Retrieval
More general optimization
43
Sec. 1.3
Introduction to Information Retrieval
Exercise
44
(tangerine OR trees) AND
(marmalade OR skies) AND
(kaleidoscope OR eyes)
Introduction to Information Retrieval
Query processing exercises
45
Introduction to Information Retrieval
Exercise
46
Introduction to Information Retrieval
The Boolean Retrieval Model
& Extended Boolean Models
Introduction to
Information Retrieval
Introduction to Information Retrieval
Phrase queries and positional indexes
Introduction to
Information Retrieval
Introduction to Information Retrieval
Phrase queries
<term : docs> entries
Sec. 2.4
Introduction to Information Retrieval
A first attempt: Biword indexes
Sec. 2.4.1
Introduction to Information Retrieval
Longer phrase queries
stanford university AND university palo AND palo alto
Without the docs, we cannot verify that the docs matching the above Boolean query do contain the phrase.
Can have false positives!
Sec. 2.4.1
Introduction to Information Retrieval
Extended biwords
N X X N
Sec. 2.4.1
Introduction to Information Retrieval
Issues for biword indexes
Sec. 2.4.1
Introduction to Information Retrieval
Solution 2: Positional indexes
<term, number of docs containing term;
doc1: position1, position2 … ;
doc2: position1, position2 … ;
etc.>
Sec. 2.4.2
Introduction to Information Retrieval
Positional index example
<be: 993427;
1: 7, 18, 33, 72, 86, 231;
2: 3, 149;
4: 17, 191, 291, 430, 434;
5: 363, 367, …>
Which of docs 1,2,4,5
could contain “to be
or not to be”?
Sec. 2.4.2
Introduction to Information Retrieval
Processing a phrase query
Sec. 2.4.2
Introduction to Information Retrieval
Proximity queries
Sec. 2.4.2
Introduction to Information Retrieval
Positional index size
Sec. 2.4.2
Introduction to Information Retrieval
Positional index size
Why?
100
1
100,000
1
1
1000
Positional postings
Postings
Document size
Sec. 2.4.2
Introduction to Information Retrieval
Rules of thumb
Sec. 2.4.2
Introduction to Information Retrieval
Combination schemes
Sec. 2.4.3
Introduction to Information Retrieval
Phrase queries and positional indexes
Introduction to
Information Retrieval
Introduction to Information Retrieval