1 of 65

CS458 Natural language Processing

Lecture 15

Information Retrieval

Krishnendu Ghosh

Department of Computer Science & Engineering

Indian Institute of Information Technology Dharwad

2 of 65

IR

3 of 65

Information Retrieval (IR)

Information Retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers).

These days we frequently think first of web search, but there are many other cases:

E-mail search
Searching your laptop
Corporate knowledge bases
Legal information retrieval

4 of 65

Text: Unstructured vs Structured (Mid-90s)

5 of 65

Text: Unstructured vs Structured (2000s)

6 of 65

Classic Search Model

Query

Search Engine

Query

Reformulation

Repository

Result

7 of 65

Information Retrieval: Why?

8 of 65

Information Retrieval: Why?

9 of 65

Information Retrieval: Why not RDBMS?

10 of 65

Information Retrieval: Transform

11 of 65

IR Components

12 of 65

Information Retrieval: Core Concepts

13 of 65

Information Retrieval: Core Concepts

14 of 65

Information Retrieval: Applications

15 of 65

IR Applications: Text Categorization

16 of 65

IR Applications: Text Categorization

17 of 65

IR Applications: Text Categorization

18 of 65

IR Applications: Document Clustering

19 of 65

IR Applications: Content Based Filtering

20 of 65

IR Applications: Collaborative Filtering

21 of 65

IR Applications: Information Extraction

22 of 65

IR Applications: Question Answering

23 of 65

IR Applications: Web Search

24 of 65

IR Applications: Federated Search

25 of 65

IR Applications: Expertise Search

26 of 65

IR Applications: Citation/Link Analysis

27 of 65

IR Applications: Citation/Link Analysis

28 of 65

IR Applications: Multimedia Retrieval

29 of 65

IR Applications: Information Visualization

30 of 65

Boolean IR

31 of 65

Term-Document Incidence Matrices

32 of 65

Unstructured Data in 1620

Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia?

One could grep all of Shakespeare’s plays for Brutus and Caesar, then strip out lines containing Calpurnia?

Why is that not the answer?

Slow (for large corpora)
NOT Calpurnia is non-trivial
Other operations (e.g., find the word Romans near countrymen) not feasible
Ranked retrieval (best documents to return)

33 of 65

Incidence Vectors

1 if play contains word, 0 otherwise

Brutus AND Caesar BUT NOT Calpurnia

34 of 65

Information Retrieval

So we have a 0/1 vector for each term.

To answer query: take the vectors for Brutus, Caesar and Calpurnia (complemented) > bitwise AND.

110100 AND
110111 AND
101111 =
100100

35 of 65

Answers to query

Antony and Cleopatra, Act III, Scene ii

Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus,

When Antony found Julius Caesar dead,

He cried almost to roaring; and he wept

When at Philippi he found Brutus slain.

Hamlet, Act III, Scene ii

Lord Polonius: I did enact Julius Caesar I was killed i’ the

Capitol; Brutus killed me.

36 of 65

Bigger Collections

Consider N = 1 million documents, each with about 1000 words.

Avg 6 bytes/word including spaces/punctuation

6GB of data in the documents.

Say there are M = 500K distinct terms among these.

37 of 65

Can’t build the matrix

500K x 1M matrix has half-a-trillion 0’s and 1’s.

But it has no more than one billion 1’s.

matrix is extremely sparse.

What’s a better representation?

We only record the 1 positions.

Why?

38 of 65

Inverted Index

39 of 65

Inverted Index

For each term t, we must store a list of all documents that contain t.

Identify each doc by a docID, a document serial number.

Can we used fixed-size arrays for this?

What happens if the word Caesar is added to document 14?

40 of 65

Inverted Index

We need variable-size postings lists

On disk, a continuous run of postings is normal and best

In memory, can use linked lists or variable length arrays

Some tradeoffs in size/ease of insertion

41 of 65

Inverted Index Construction

42 of 65

Initial Stages of Text Processing

Tokenization

Cut character sequence into word tokens
Deal with “John’s”, a state-of-the-art solution

Normalization

Map text and query term to same form
You want U.S.A. and USA to match

Stemming

We may wish different forms of a root to match
authorize, authorization

Stop words

We may omit very common words (or not)
the, a, to, of

43 of 65

Indexing: Token Sequence

44 of 65

Indexing: Sorting

Sort by terms

At least conceptually
And then docID

45 of 65

Indexing: Creating Dictionary & Postings

Multiple term entries in a single document are merged.

Split into Dictionary and Postings
Doc. frequency information is added.

46 of 65

Indexing: Costs

47 of 65

Faster Postings List Intersection via Skip Pointers

48 of 65

Algorithm: Skip Pointers

49 of 65

Faster Postings List Intersection via Skip Pointers

If the list lengths are m and n, the intersection takes O(X + Y) operations. Can we do better than this?

One way to do this is to use a skip list by augmenting postings lists with skip pointers:

50 of 65

Faster Postings List Intersection via Skip Pointers

We have a two-word query. For one term, the postings list consists of the following 16 entries:

4, 6, 10, 12, 14, 16, 18, 20, 22, 32, 47, 81, 120, 122, 157, 180

and for the other, it is the one-entry postings list:

47

Work out how many comparisons would be done to intersect the two postings lists with the following two strategies.

51 of 65

Faster Postings List Intersection via Skip Pointers

Applying MERGE on the standard postings list, comparisons will be made unless either of the postings list end i.e. till we reach 47 in the upper postings list, after which the lower list ends and no more processing needs to be done.

Number of comparisons = 11

Using skip pointers of length 4 for the longer list and of length 1 for the shorter list, the following comparisons will be made: 1. 4 & 47 2. 14 & 47 3. 22 & 47 4. 120 & 47 5. 81 & 47 6. 47 & 47.

Number of comparisons=6

52 of 65

IR Summary

53 of 65

IR System

Document refers to whatever unit of text the system indexes and retrieves (web pages, scientific papers, news articles, or even shorter passages like collection paragraphs).
Collection refers to a set of documents being used to satisfy user term requests.
Term refers to a word in a collection, but it may also include phrases.
Query represents a user’s information need expressed as a set of terms.

54 of 65

IR System

55 of 65

IR Components

Data Collection and Preprocessing
Indexing
Query Processing
Presentation of Results
User Interaction and Feedback
Iterative Querying
Continuous Learning and Adaptation

56 of 65

IR Issue: Lexical Gap

57 of 65

IR Models

Boolean Retrieval

Vector Space Model (VSM) Retrieval

Language Model (LM) based Probabilistic Retrieval

Translation Model

58 of 65

IR Scoring

Score document d by the cosine of its vector d with the query vector q:

Another way to think of the cosine computation is as the dot product of unit vectors; we first normalize both the query and document vector to unit vectors, by dividing by their lengths, and then take the dot product:

59 of 65

IR Scoring

We can spell out using the tf-idf values and spelling out the dot product as a sum of products:

60 of 65

IR Scoring

Document 1 has a higher cosine with the query (0.747) than Document 2 has with the query (0.0779), and so the tf-idf cosine model would rank Document 1 above Document 2. This ranking is intuitive given the vector space model, since Document 1 has both terms including two instances of sweet, while Document 2 is missing one of the terms.

61 of 65

IR Scoring

62 of 65

IR System Performance

63 of 65

IR Metrics

Average precision

Precision at k

R-precision

Mean average precision

Discounted cumulative gain

64 of 65

IR System Performance

65 of 65

Thank You