Boolean and Vector Space Retrieval Models
Elmurod Kuriyozov
Credit for some of the slides in this lecture goes to Prof. Ray Mooney at UT Austin
Retrieval Models
2
Classes of Retrieval Models
3
Types of Retrieval Models:�Exact Match vs. Best Match Retrieval
Exact Match (Boolean models)
Best Match (Vector Space models, Probabilistic models)
4
Exact Match vs. Best Match Retrieval
5
Common Preprocessing Steps
6
Boolean Model
7
Search with Boolean query
8
Search with Boolean query
9
128
34
2
4
8
16
32
64
1
2
3
5
8
13
21
Term1
Term2
scan the postings
Trick for speed-up: when performing multi-way join, starts from lowest frequency term to highest frequency ones
Boolean Retrieval Model
10
Boolean Models − Problems
11
Statistical Models
Q = < database 0.5; text 0.8; information 0.2 >
Q = < database; text; information >
12
Statistical Retrieval
13
Vector space model
14
Vector Space Retrieval Model: Introduction
Term1 Term2 Term3 Term4 … Termn
Doc1 1 0 0 1 … 1
Doc2 0 1 1 0 … 0
Doc3 1 0 1 0 … 0
: : : : : : :
Query 0 0 1 0 … 1
15
Vector Space Representation: Linear Algebra
16
Basis vectors
for 2 dimensions
Basis vectors
for 3 dimensions
VS Model: an illustration
17
Sports
Education
Finance
D4
D2
D1
D5
D3
Query
What the VS model doesn’t say
18
Graphic Representation
Example:
D1 = 2T1 , 3T2 , 5T3
D2 = 3T1 , 7T2 , T3
Q = 0T1 , 0T2 , 2T3
19
T3
T1
T2
D1 = (2T1, 3T2 , 5T3 )
D2 = (3T1 , 7T2 , T3 )
Q = (0T1 , 0T2 , 2T3 )
7
3
2
5
Document Collection
20
T1 T2 …. Tt
D1 w11 w21 … wt1
D2 w12 w22 … wt2
: : : :
: : : :
Dn w1n w2n … wtn
How to assign weights?
21
Term Weights: Term Frequency
fij = frequency of term i in document j
tfij = fij / maxi{fij}
22
Term Weights: Inverse Document Frequency
23
Total number of docs in collection
Non-linear scaling
Why document frequency
24
Word | ttf | df |
try | 10422 | 8760 |
insurance | 10440 | 3997 |
Table 1. Example total term frequency v.s. document frequency in Reuters-RCV1 collection.
TF-IDF weighting
25
Computing TF-IDF -- An Example
Given a document containing terms with given frequencies:
A(3), B(2), C(1)
Assume collection contains 10,000 documents and
document frequencies of these terms are:
A(50), B(1300), C(250)
Then:
A: tf = 3/3; idf = log2(10000/50) = 7.6; tf-idf = 7.6
B: tf = 2/3; idf = log2 (10000/1300) = 2.9; tf-idf = 2.0
C: tf = 1/3; idf = log2 (10000/250) = 5.3; tf-idf = 1.8
26
Query Vector
27
Similarity Measure
28
How to define a good similarity measure?
29
Sports
Education
Finance
D4
D2
D1
D5
D3
Query
TF-IDF space
How to define a good similarity measure?
30
Cosine similarity
31
Sports
Finance
D1
D2
Query
TF-IDF space
Cosine Similarity Measure
32
D1 = 2T1 , 3T2 , 5T3 CosSim(D1 , Q) = 10 / √(4+9+25)(0+0+4) = 0.81
D2 = 3T1 , 7T2 , 1T3 CosSim(D2 , Q) = 2 / √(9+49+1)(0+0+4) = 0.13
Q = 0T1 , 0T2 , 2T3
θ2
t3
t1
t2
D1
D2
Q
θ1
CosSim(dj, q) =
Comments on Vector Space Models
33
Problems with Vector Space Model
34