CS458 Natural language Processing
Lecture 15
Information Retrieval
Krishnendu Ghosh
Department of Computer Science & Engineering
Indian Institute of Information Technology Dharwad
IR
Information Retrieval (IR)
Information Retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers).
These days we frequently think first of web search, but there are many other cases:
Text: Unstructured vs Structured (Mid-90s)
Text: Unstructured vs Structured (2000s)
Classic Search Model
Query
Search Engine
Query
Reformulation
Repository
Result
Information Retrieval: Why?
Information Retrieval: Why?
Information Retrieval: Why not RDBMS?
Information Retrieval: Transform
IR Components
Information Retrieval: Core Concepts
Information Retrieval: Core Concepts
Information Retrieval: Applications
IR Applications: Text Categorization
IR Applications: Text Categorization
IR Applications: Text Categorization
IR Applications: Document Clustering
IR Applications: Content Based Filtering
IR Applications: Collaborative Filtering
IR Applications: Information Extraction
IR Applications: Question Answering
IR Applications: Web Search
IR Applications: Federated Search
IR Applications: Expertise Search
IR Applications: Citation/Link Analysis
IR Applications: Citation/Link Analysis
IR Applications: Multimedia Retrieval
IR Applications: Information Visualization
Boolean IR
Term-Document Incidence Matrices
Unstructured Data in 1620
Incidence Vectors
1 if play contains word, 0 otherwise
Brutus AND Caesar BUT NOT Calpurnia
Information Retrieval
So we have a 0/1 vector for each term.
To answer query: take the vectors for Brutus, Caesar and Calpurnia (complemented) > bitwise AND.
Answers to query
Antony and Cleopatra, Act III, Scene ii
Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus,
When Antony found Julius Caesar dead,
He cried almost to roaring; and he wept
When at Philippi he found Brutus slain.
Hamlet, Act III, Scene ii
Lord Polonius: I did enact Julius Caesar I was killed i’ the
Capitol; Brutus killed me.
Bigger Collections
Consider N = 1 million documents, each with about 1000 words.
Avg 6 bytes/word including spaces/punctuation
6GB of data in the documents.
Say there are M = 500K distinct terms among these.
Can’t build the matrix
500K x 1M matrix has half-a-trillion 0’s and 1’s.
But it has no more than one billion 1’s.
matrix is extremely sparse.
What’s a better representation?
We only record the 1 positions.
Why?
Inverted Index
Inverted Index
For each term t, we must store a list of all documents that contain t.
Identify each doc by a docID, a document serial number.
Inverted Index
We need variable-size postings lists
On disk, a continuous run of postings is normal and best
In memory, can use linked lists or variable length arrays
Some tradeoffs in size/ease of insertion
Inverted Index Construction
Initial Stages of Text Processing
Indexing: Token Sequence
Indexing: Sorting
Indexing: Creating Dictionary & Postings
Indexing: Costs
Faster Postings List Intersection via Skip Pointers
Algorithm: Skip Pointers
Faster Postings List Intersection via Skip Pointers
If the list lengths are m and n, the intersection takes O(X + Y) operations. Can we do better than this?
One way to do this is to use a skip list by augmenting postings lists with skip pointers:
Faster Postings List Intersection via Skip Pointers
We have a two-word query. For one term, the postings list consists of the following 16 entries:
4, 6, 10, 12, 14, 16, 18, 20, 22, 32, 47, 81, 120, 122, 157, 180
and for the other, it is the one-entry postings list:
47
Work out how many comparisons would be done to intersect the two postings lists with the following two strategies.
Faster Postings List Intersection via Skip Pointers
Applying MERGE on the standard postings list, comparisons will be made unless either of the postings list end i.e. till we reach 47 in the upper postings list, after which the lower list ends and no more processing needs to be done.
Number of comparisons = 11
Using skip pointers of length 4 for the longer list and of length 1 for the shorter list, the following comparisons will be made: 1. 4 & 47 2. 14 & 47 3. 22 & 47 4. 120 & 47 5. 81 & 47 6. 47 & 47.
Number of comparisons=6
IR Summary
IR System
IR System
IR Components
IR Issue: Lexical Gap
IR Models
Boolean Retrieval
Vector Space Model (VSM) Retrieval
Language Model (LM) based Probabilistic Retrieval
Translation Model
IR Scoring
Score document d by the cosine of its vector d with the query vector q:
Another way to think of the cosine computation is as the dot product of unit vectors; we first normalize both the query and document vector to unit vectors, by dividing by their lengths, and then take the dot product:
IR Scoring
We can spell out using the tf-idf values and spelling out the dot product as a sum of products:
IR Scoring
Document 1 has a higher cosine with the query (0.747) than Document 2 has with the query (0.0779), and so the tf-idf cosine model would rank Document 1 above Document 2. This ranking is intuitive given the vector space model, since Document 1 has both terms including two instances of sweet, while Document 2 is missing one of the terms.
IR Scoring
IR System Performance
IR Metrics
IR System Performance
Thank You