1 of 34

Boolean and Vector Space Retrieval Models

Elmurod Kuriyozov

Credit for some of the slides in this lecture goes to Prof. Ray Mooney at UT Austin

2 of 34

Retrieval Models

  • A retrieval model specifies the details of:
    • Document representation
    • Query representation
    • Retrieval function
  • Determines a notion of relevance.
  • Notion of relevance can be binary or continuous (i.e. ranked retrieval).

2

3 of 34

Classes of Retrieval Models

  • Boolean models
  • Vector space models (statistical/algebraic)
    • Latent Semantic Indexing
  • Probabilistic models
    • Basic probabilistic model
    • Bayesian inference networks
    • Language models

3

4 of 34

Types of Retrieval Models:�Exact Match vs. Best Match Retrieval

Exact Match (Boolean models)

  • Query specifies precise retrieval criteria
  • Every document either matches or fails to match query
  • Result is a set of documents
    • Usually in no particular order

Best Match (Vector Space models, Probabilistic models)

  • Query describes retrieval criteria for desired documents
  • Every document matches a query to some degree
  • Result is a ranked list of documents, “best” first

4

5 of 34

Exact Match vs. Best Match Retrieval

  • Best-match models are usually more accurate / effective
    • Good documents appear at the top of the rankings
    • Good documents often don’t exactly match the query
      • Query was too strict
        • Multimedia AND NOT (Apple OR IBM OR Microsoft)”
      • Document didn’t match user expectations
        • multi database retrieval” vs “multi db retrieval”
  • Exact match still prevalent in some markets
    • Large installed base that doesn’t want to migrate to new software
    • Efficiency: Exact match is very fast, good for high volume tasks
    • Sufficient for some tasks
    • Web “advanced search”

5

6 of 34

Common Preprocessing Steps

  • Strip unwanted characters/markup (e.g. HTML tags, punctuation, numbers, etc.).
  • Break into tokens (keywords) on whitespace.
  • Stem tokens to “root” words
    • computational 🡪 comput
  • Remove common stopwords (e.g. a, the, it, etc.).
  • Detect common phrases (possibly using a domain specific dictionary).
  • Build inverted index (keyword 🡪 list of docs containing it).

6

7 of 34

Boolean Model

  • A document is represented as a set of keywords.
  • Queries are Boolean expressions of keywords, connected by AND, OR, and NOT, including the use of brackets to indicate scope.
    • [[Rio & Brazil] | [Hilo & Hawaii]] & hotel & !Hilton]
  • Output: Document is relevant or not. No partial matches or ranking.

7

8 of 34

Search with Boolean query

  • Boolean query
    • E.g., “obama” AND “healthcare” NOT “news”
  • Procedures
    • Lookup query term in the dictionary
    • Retrieve the posting lists
    • Operation
      • AND: intersect the posting lists
      • OR: union the posting list
      • NOT: diff the posting list

8

9 of 34

Search with Boolean query

  • Example: AND operation

9

128

34

2

4

8

16

32

64

1

2

3

5

8

13

21

Term1

Term2

scan the postings

 

Trick for speed-up: when performing multi-way join, starts from lowest frequency term to highest frequency ones

10 of 34

Boolean Retrieval Model

  • Popular retrieval model because:
    • Easy to understand for simple queries.
    • Clean formalism.
  • Boolean models can be extended to include ranking.
  • Reasonably efficient implementations possible for normal queries.

10

11 of 34

Boolean Models − Problems

  • The query is unlikely precise. Studies show that people are not good at creating Boolean queries
    • People overestimate the quality of the queries they create
    • Queries are too strict: Few relevant documents found
    • Queries are too loose: Too many documents found (few relevant)
    • It is hard to find the right position between these two extremes (hard for users to specify constraints)
  • Even if it is accurate
    • Not all users would like to use such queries
    • All relevant documents are not equally important
      • No one would go through all the matched results
  • Difficult to perform relevance feedback.
    • If a document is identified by the user as relevant or irrelevant, how should the query be modified?

11

12 of 34

Statistical Models

  • A document is typically represented by a bag of words (unordered words with frequencies).
  • Bag = set that allows multiple occurrences of the same element.
  • User specifies a set of desired terms with optional weights:
    • Weighted query terms:

Q = < database 0.5; text 0.8; information 0.2 >

    • Unweighted query terms:

Q = < database; text; information >

12

13 of 34

Statistical Retrieval

  • Retrieval based on similarity between query and documents.
  • Output documents are ranked according to similarity to query.
  • Similarity based on occurrence frequencies of keywords in query and document.
  • Automatic relevance feedback can be supported:
    • Relevant documents “added” to query.
    • Irrelevant documents “subtracted” from query.
  • Assumptions
    • Query and documents are represented in the same form
      • A query can be regarded as a “document”
    • Relevance(d,q) ∝ similarity(d,q)

13

14 of 34

Vector space model

  • Represent both doc and query by concept vectors
    • Each concept defines one dimension
    • K concepts define a high-dimensional space
    • Element of vector corresponds to concept weight
      • E.g., d=(x1,…,xk), xi is “importance” of concept i
  • Measure relevance
    • Distance between the query vector and document vector in this concept space

14

15 of 34

Vector Space Retrieval Model: Introduction

  • How are documents represented in the binary vector model?

Term1 Term2 Term3 Term4 … Termn

Doc1 1 0 0 1 … 1

Doc2 0 1 1 0 … 0

Doc3 1 0 1 0 … 0

: : : : : : :

  • A document is represented as a vector of binary values
    • One dimension per term in the corpus vocabulary
  • An unstructured query can also be represented as a vector

Query 0 0 1 0 … 1

15

16 of 34

Vector Space Representation: Linear Algebra

  • Formally, a vector space is defined by a set of linearly independent basis vectors.
  • Basis vectors:
    • correspond to the dimensions or directions in the vector space;
    • determine what can be described in the vector space; and
    • must be orthogonal, or linearly independent, i.e. a value along one dimension implies nothing about a value along another.

16

Basis vectors

for 2 dimensions

Basis vectors

for 3 dimensions

17 of 34

VS Model: an illustration

  • Which document is closer to the query?

17

Sports

Education

Finance

D4

D2

D1

D5

D3

Query

18 of 34

What the VS model doesn’t say

  • How to define/select the “basic concept”
    • Concepts are assumed to be orthogonal
  • How to assign weights
    • Weight in query indicates importance of the concept
    • Weight in doc indicates how well the concept characterizes the doc
  • How to define the similarity/distance measure

18

19 of 34

Graphic Representation

Example:

D1 = 2T1 , 3T2 , 5T3

D2 = 3T1 , 7T2 , T3

Q = 0T1 , 0T2 , 2T3

19

T3

T1

T2

D1 = (2T1, 3T2 , 5T3 )

D2 = (3T1 , 7T2 , T3 )

Q = (0T1 , 0T2 , 2T3 )

7

3

2

5

  • Is D1 or D2 more similar to Q?
  • How to measure the degree of similarity? Distance? Angle? Projection?

20 of 34

Document Collection

  • A collection of n documents can be represented in the vector space model by a term-document matrix.
  • An entry in the matrix corresponds to the “weight” of a term in the document; zero means the term has no significance in the document or it simply doesn’t exist in the document.

20

T1 T2 …. Tt

D1 w11 w21 … wt1

D2 w12 w22 … wt2

: : : :

: : : :

Dn w1n w2n … wtn

21 of 34

How to assign weights?

  • Important!
  • Why?
    • Query side: not all terms are equally important
    • Doc side: some terms carry more information about the content
  • How?
    • Two basic heuristics
      • TF (Term Frequency) = Within-doc-frequency
      • IDF (Inverse Document Frequency)

21

22 of 34

Term Weights: Term Frequency

  • More frequent terms in a document are more important, i.e. more indicative of the topic.

fij = frequency of term i in document j

  • Raw TF is inaccurate
    • Document length variation
    • “Repeated occurrences” are less informative than the “first occurrence”
    • Relevance does not increase proportionally with number of term occurrence
  • May want to normalize term frequency (tf) by dividing by the frequency of the most common term in the document:

tfij = fij / maxi{fij}

22

23 of 34

Term Weights: Inverse Document Frequency

  •  

23

Total number of docs in collection

 

Non-linear scaling

24 of 34

Why document frequency

  •  

24

Word

ttf

df

try

10422

8760

insurance

10440

3997

Table 1. Example total term frequency v.s. document frequency in Reuters-RCV1 collection.

25 of 34

TF-IDF weighting

  •  

25

26 of 34

Computing TF-IDF -- An Example

Given a document containing terms with given frequencies:

A(3), B(2), C(1)

Assume collection contains 10,000 documents and

document frequencies of these terms are:

A(50), B(1300), C(250)

Then:

A: tf = 3/3; idf = log2(10000/50) = 7.6; tf-idf = 7.6

B: tf = 2/3; idf = log2 (10000/1300) = 2.9; tf-idf = 2.0

C: tf = 1/3; idf = log2 (10000/250) = 5.3; tf-idf = 1.8

26

27 of 34

Query Vector

  • Query vector is typically treated as a document and also tf-idf weighted.
  • Alternative is for the user to supply weights for the given query terms.

27

28 of 34

Similarity Measure

  • A similarity measure is a function that computes the degree of similarity between two vectors.

  • Using a similarity measure between the query and each document:
    • It is possible to rank the retrieved documents in the order of presumed relevance.
    • It is possible to enforce a certain threshold so that the size of the retrieved set can be controlled.

28

29 of 34

How to define a good similarity measure?

  • Euclidean distance?

29

Sports

Education

Finance

D4

D2

D1

D5

D3

Query

TF-IDF space

30 of 34

How to define a good similarity measure?

  •  

30

31 of 34

Cosine similarity

  •  

31

Sports

Finance

D1

D2

Query

TF-IDF space

32 of 34

Cosine Similarity Measure

  • Cosine similarity measures the cosine of the angle between two vectors.
  • Inner product normalized by the vector lengths.

32

D1 = 2T1 , 3T2 , 5T3 CosSim(D1 , Q) = 10 / √(4+9+25)(0+0+4) = 0.81

D2 = 3T1 , 7T2 , 1T3 CosSim(D2 , Q) = 2 / √(9+49+1)(0+0+4) = 0.13

Q = 0T1 , 0T2 , 2T3

θ2

t3

t1

t2

D1

D2

Q

θ1

CosSim(dj, q) =

33 of 34

Comments on Vector Space Models

  • Empirically effective! (Top TREC performance)
  • Intuitive
  • Easy to implement
  • Well-studied/Mostly evaluated
  • The Smart system
    • Developed at Cornell: 1960-1999
    • Still widely used
  • Warning: Many variants of TF-IDF!

33

34 of 34

Problems with Vector Space Model

  • Missing semantic information (e.g. word sense).
  • Missing syntactic information (e.g. phrase structure, word order, proximity information).
  • Assumption of term independence (e.g. ignores synonomy).
  • Lacks the control of a Boolean model (e.g., requiring a term to appear in a document).
    • Given a two-term query “A B”, may prefer a document containing A frequently but not B, over a document that contains both A and B, but both less frequently.
  • Assume query and document to be the same
  • Lack of “predictive adequacy”
    • Arbitrary term weighting
    • Arbitrary similarity measure
  • Lots of parameter tuning!

34