1 of 34

Boolean and Vector Space Retrieval Models

Elmurod Kuriyozov

Credit for some of the slides in this lecture goes to Prof. Ray Mooney at UT Austin

2 of 34

Retrieval Models

A retrieval model specifies the details of:

Document representation
Query representation
Retrieval function

Determines a notion of relevance.
Notion of relevance can be binary or continuous (i.e. ranked retrieval).

3 of 34

Classes of Retrieval Models

Boolean models
Vector space models (statistical/algebraic)

Latent Semantic Indexing

Probabilistic models

Basic probabilistic model
Bayesian inference networks
Language models

4 of 34

Types of Retrieval Models:�Exact Match vs. Best Match Retrieval

Exact Match (Boolean models)

Query specifies precise retrieval criteria
Every document either matches or fails to match query
Result is a set of documents

Usually in no particular order

Best Match (Vector Space models, Probabilistic models)

Query describes retrieval criteria for desired documents
Every document matches a query to some degree
Result is a ranked list of documents, “best” first

5 of 34

Exact Match vs. Best Match Retrieval

Best-match models are usually more accurate / effective

Good documents appear at the top of the rankings
Good documents often don’t exactly match the query

Query was too strict

“Multimedia AND NOT (Apple OR IBM OR Microsoft)”

Document didn’t match user expectations

“multi database retrieval” vs “multi db retrieval”

Exact match still prevalent in some markets

Large installed base that doesn’t want to migrate to new software
Efficiency: Exact match is very fast, good for high volume tasks
Sufficient for some tasks
Web “advanced search”

6 of 34

Common Preprocessing Steps

Strip unwanted characters/markup (e.g. HTML tags, punctuation, numbers, etc.).
Break into tokens (keywords) on whitespace.
Stem tokens to “root” words

computational 🡪 comput

Remove common stopwords (e.g. a, the, it, etc.).
Detect common phrases (possibly using a domain specific dictionary).
Build inverted index (keyword 🡪 list of docs containing it).

7 of 34

Boolean Model

A document is represented as a set of keywords.
Queries are Boolean expressions of keywords, connected by AND, OR, and NOT, including the use of brackets to indicate scope.

[[Rio & Brazil] | [Hilo & Hawaii]] & hotel & !Hilton]

Output: Document is relevant or not. No partial matches or ranking.

8 of 34

Search with Boolean query

Boolean query

E.g., “obama” AND “healthcare” NOT “news”

Procedures

Lookup query term in the dictionary
Retrieve the posting lists
Operation

AND: intersect the posting lists
OR: union the posting list
NOT: diff the posting list

9 of 34

Search with Boolean query

Example: AND operation

128

Term1

Term2

scan the postings

Trick for speed-up: when performing multi-way join, starts from lowest frequency term to highest frequency ones

10 of 34

Boolean Retrieval Model

Popular retrieval model because:

Easy to understand for simple queries.
Clean formalism.

Boolean models can be extended to include ranking.
Reasonably efficient implementations possible for normal queries.

11 of 34

Boolean Models − Problems

The query is unlikely precise. Studies show that people are not good at creating Boolean queries

People overestimate the quality of the queries they create
Queries are too strict: Few relevant documents found
Queries are too loose: Too many documents found (few relevant)
It is hard to find the right position between these two extremes (hard for users to specify constraints)

Even if it is accurate

Not all users would like to use such queries
All relevant documents are not equally important

No one would go through all the matched results

Difficult to perform relevance feedback.

If a document is identified by the user as relevant or irrelevant, how should the query be modified?

12 of 34

Statistical Models

A document is typically represented by a bag of words (unordered words with frequencies).
Bag = set that allows multiple occurrences of the same element.
User specifies a set of desired terms with optional weights:

Weighted query terms:

Q = < database 0.5; text 0.8; information 0.2 >

Unweighted query terms:

Q = < database; text; information >

13 of 34

Statistical Retrieval

Retrieval based on similarity between query and documents.
Output documents are ranked according to similarity to query.
Similarity based on occurrence frequencies of keywords in query and document.
Automatic relevance feedback can be supported:

Relevant documents “added” to query.
Irrelevant documents “subtracted” from query.

Assumptions

Query and documents are represented in the same form

A query can be regarded as a “document”

Relevance(d,q) ∝ similarity(d,q)

14 of 34

Vector space model

Represent both doc and query by concept vectors

Each concept defines one dimension
K concepts define a high-dimensional space
Element of vector corresponds to concept weight

E.g., d=(x₁,…,x_k), x_i is “importance” of concept i

Measure relevance

Distance between the query vector and document vector in this concept space

15 of 34

Vector Space Retrieval Model: Introduction

How are documents represented in the binary vector model?

Term₁ Term₂ Term₃Term₄ … Term_n

Doc₁ 1 0 0 1 … 1

Doc₂ 0 1 1 0 … 0

Doc₃ 1 0 1 0 … 0

: : : : : : :

A document is represented as a vector of binary values

One dimension per term in the corpus vocabulary

An unstructured query can also be represented as a vector

Query 0 0 1 0 … 1

16 of 34

Vector Space Representation: Linear Algebra

Formally, a vector space is defined by a set of linearly independent basis vectors.
Basis vectors:

correspond to the dimensions or directions in the vector space;
determine what can be described in the vector space; and
must be orthogonal, or linearly independent, i.e. a value along one dimension implies nothing about a value along another.

Basis vectors

for 2 dimensions

Basis vectors

for 3 dimensions

17 of 34

VS Model: an illustration

Which document is closer to the query?

Sports

Education

Finance

D₄

D₂

D₁

D₅

D₃

Query

18 of 34

What the VS model doesn’t say

How to define/select the “basic concept”

Concepts are assumed to be orthogonal

How to assign weights

Weight in query indicates importance of the concept
Weight in doc indicates how well the concept characterizes the doc

How to define the similarity/distance measure

19 of 34

Graphic Representation

Example:

D₁ = 2T₁ , 3T₂ , 5T₃

D₂ = 3T₁ , 7T₂ , T₃

Q = 0T₁ , 0T₂ , 2T₃

T₃

T₁

T₂

D₁ = (2T₁, 3T₂ , 5T₃ )

D₂= (3T₁ , 7T₂ , T₃ )

Q = (0T₁ , 0T₂ , 2T₃ )

Is D₁ or D₂ more similar to Q?
How to measure the degree of similarity? Distance? Angle? Projection?

20 of 34

Document Collection

A collection of n documents can be represented in the vector space model by a term-document matrix.
An entry in the matrix corresponds to the “weight” of a term in the document; zero means the term has no significance in the document or it simply doesn’t exist in the document.

T₁ T₂ …. T_t

D₁ w₁₁ w₂₁ … w_t1

D₂ w₁₂ w₂₂ … w_t2

: : : :

D_n w_1n w_2n … w_tn

21 of 34

How to assign weights?

Important!
Why?

Query side: not all terms are equally important
Doc side: some terms carry more information about the content

How?

Two basic heuristics

TF (Term Frequency) = Within-doc-frequency
IDF (Inverse Document Frequency)

22 of 34

Term Weights: Term Frequency

More frequent terms in a document are more important, i.e. more indicative of the topic.

f_ij= frequency of term i in document j

Raw TF is inaccurate

Document length variation
“Repeated occurrences” are less informative than the “first occurrence”
Relevance does not increase proportionally with number of term occurrence

May want to normalize term frequency (tf) by dividing by the frequency of the most common term in the document:

tf_ij=f_ij/ max_i{f_ij}

23 of 34

Term Weights: Inverse Document Frequency

Total number of docs in collection

Non-linear scaling

24 of 34

Why document frequency

Word	ttf	df
try	10422	8760
insurance	10440	3997

Table 1. Example total term frequency v.s. document frequency in Reuters-RCV1 collection.

25 of 34

TF-IDF weighting

26 of 34

Computing TF-IDF -- An Example

Given a document containing terms with given frequencies:

A(3), B(2), C(1)

Assume collection contains 10,000 documents and

document frequencies of these terms are:

A(50), B(1300), C(250)

Then:

A: tf = 3/3; idf = log₂(10000/50) = 7.6; tf-idf = 7.6

B: tf = 2/3; idf = log₂(10000/1300) = 2.9; tf-idf = 2.0

C: tf = 1/3; idf = log₂(10000/250) = 5.3; tf-idf = 1.8

27 of 34

Query Vector

Query vector is typically treated as a document and also tf-idf weighted.
Alternative is for the user to supply weights for the given query terms.

28 of 34

Similarity Measure

A similarity measure is a function that computes the degree of similarity between two vectors.

Using a similarity measure between the query and each document:

It is possible to rank the retrieved documents in the order of presumed relevance.
It is possible to enforce a certain threshold so that the size of the retrieved set can be controlled.

29 of 34

How to define a good similarity measure?

Euclidean distance?

Sports

Education

Finance

D₄

D₂

D₁

D₅

D₃

Query

TF-IDF space

30 of 34

How to define a good similarity measure?

31 of 34

Cosine similarity

Sports

Finance

D₁

D₂

Query

TF-IDF space

32 of 34

Cosine Similarity Measure

Cosine similarity measures the cosine of the angle between two vectors.
Inner product normalized by the vector lengths.

D₁ = 2T₁ , 3T₂ , 5T₃CosSim(D₁ , Q) = 10 / √(4+9+25)(0+0+4) = 0.81

D₂ = 3T₁ , 7T₂ , 1T₃CosSim(D₂ , Q) = 2 / √(9+49+1)(0+0+4) = 0.13

Q = 0T₁ , 0T₂ , 2T₃

θ₂

t₃

t₁

t₂

D₁

D₂

θ₁

CosSim(d_j, q) =

33 of 34

Comments on Vector Space Models

Empirically effective! (Top TREC performance)
Intuitive
Easy to implement
Well-studied/Mostly evaluated
The Smart system

Developed at Cornell: 1960-1999
Still widely used

Warning: Many variants of TF-IDF!

34 of 34

Problems with Vector Space Model

Missing semantic information (e.g. word sense).
Missing syntactic information (e.g. phrase structure, word order, proximity information).
Assumption of term independence (e.g. ignores synonomy).
Lacks the control of a Boolean model (e.g., requiring a term to appear in a document).

Given a two-term query “A B”, may prefer a document containing A frequently but not B, over a document that contains both A and B, but both less frequently.

Assume query and document to be the same
Lack of “predictive adequacy”

Arbitrary term weighting
Arbitrary similarity measure

Lots of parameter tuning!