1
CS 589 Text Mining and Information Retrieval
Mon 6:30-9:00
All zoom links in Canvas
photo: https://www.scubedstudios.com/information-retrieval/
2
Instructor: Susan Liu
CA:
OH: Mon 3:30-5:30
Discord by default
Gaoyi Wu
OH: Tue 10:00-12:00
Discord by default
CS589 Lecture 1: Overview
3
Where to find everything
4
Prerequisite
Grading calculator
Late policy
Plagiarism policy
Midterm/Final Exam In class
If you are interested in research with me
Ask all questions in Discord
Canvas & GradeScope
Pop quiz on Python
13
Feedback from previous years
15
Lecture 1: Basics about information retrieval
Mon 6:30-9:00
All zoom links in Canvas
photo: https://www.scubedstudios.com/information-retrieval/
Information Retrieval (navigational)
query: “first president of the united states”
16
Information need: name of the first US president
Information Retrieval (exploratory)
student: “I need a book on American history for my thesis research!”
librarian: "What's the topic of your thesis?"
student: "civil war"
17
Information need: to study about the history of the civil war
Information Retrieval (conversational agent)
18
Information Retrieval (conversational agent)
19
Information Retrieval (Database)
20
Information retrieval (Vector Database)
21
Information Retrieval (RAG)
22
Information Retrieval
the way from times squares to Gateway South
the US News ranking of SIT
where to go for Thanksgiving break
23
navigational
exploratory
Information Retrieval: Google
24
How does Google return results so quickly?
How does Google know cs 589 refers to a course?
How does Google know stevens = SIT?
Information Retrieval Techniques
“Because the systems that are accessible today are so easy to use, it is tempting to think the technology behind them is similarly straightforward to build. This review has shown that the route to creating successful IR systems required much innovation and thought over a long period of time. “
— The history of Information Retrieval Research, Mark Sanderson and Bruce Croft
Information Retrieval Techniques
26
"cs 589 stevens"
retrieval models (Lecture 1,2,5)
Information Retrieval Techniques
27
Evaluation (Lecture 3)
"cs 589 stevens"
Information Retrieval Techniques
28
Inverted index (Lecture 4)
50 billion pages on the web
"cs 589 stevens"
A Brief History of IR
29
300 BC
Callimachus: the first library catalog of books
Punch cards, searching at 600 cards/min
1950s
1958
Cranfield evaluation methodology; word-based indexing
building IR systems on computers; relevance feedback
1960s
1970s
TREC; learning to rank; latent semantic indexing
1980s
TF-IDF; probability ranking principle
1990 - now
web search; supporting natural language queries;
1880s
A Brief History of IR
30
…
1960s
1997
word2vec
2015
LSTM
2017
Transformers
2020
GPT-3
The Boolean retrieval system
31
The Boolean retrieval system
32
The Cranfield experiment (1958)
33
keywords system
The Cranfield experiment (1958)
34
computer science
artificial intelligence
bioinformatics
query = “subject = AI & subject = bioinformatics”
system 1: the Boolean retrieval system
On the history of IR: https://journals.sagepub.com/doi/abs/10.1177/0165551507086989?journalCode=jisb
The Cranfield experiment (1958)
35
system 2: indexing documents by lists of words
query = “ the artificial intelligence”
artificial
bags of words representation
Document-term matrix
| intelligence | book | the | cat | artificial | dog | business |
Doc1 | 0 | 1 | 3 | 1 | 0 | 1 | 0 |
Doc2 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
query | 1 | 0 | 1 | 0 | 1 | 0 | 0 |
How to build an accurate retrieval model based on word indexing?
36
How does Google return results so quickly?
How does Google know cs 589 refers to a course?
How does Google know stevens = SIT?
Word indexing: vector-space model
37
Term frequency
38
Term frequency
[0, 1, 3, 1, 0, 1, 0, 0]
[1, 0, 1, 0, 0, 0, 1, 0]
[0, 0, 1, 0, 1, 0, 0, 1]
39
artificial
Document-term matrix
| intelligence | book | the | cat | artificial | dog | business | world |
Doc1 | 0 | 1 | 3 | 1 | 0 | 1 | 0 | 0 |
Doc2 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
Doc3 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 |
query | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 |
Vector space model
40
Document-term matrix
| intelligence | book | the | cat | artificial | dog | business | world |
Doc1 | 0 | 1 | 3 | 1 | 0 | 1 | 0 | 0 |
Doc2 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
Doc3 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 |
query | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 |
TF-only representations is inaccurate
41
Zipf’s law distribution of words
42
Stop words in English
43
Inverse-document frequency
44
Vector-space model is not perfect
45
Informative words in long documents are diluted
vector = [..., 1/sqrt(21), …]
vector = [..., 1/sqrt(4), …]
46
Retrieving short documents vs. retrieving long documents
47
Short document
Long document
the
a
business
intelligence
the
a
business
AI
bot
ML
learn
NLP
Desiderata for a good retrieval model for word-indexing
48
Desiderata for a good retrieval model for word-indexing
49
Desiderata for a good retrieval model for word-indexing
50
Informative words in long documents are diluted
51
d2 is not twice more relevant than d1
Rule 3: Term frequency reweighing
52
Log scale normalization
Term-frequency reweighing
53
Log scale normalization
Rule 1: Finding appropriate norm
54
...news of presidential campaign.....
...presidential candidate...
...campaign.....campaign...................
..........................................................
........news..........................................
..........................news........................
..........................................................
..........................................................
...........presidential......presidential
query = "news about presidential campaign"
100 words
5,000 words
d1
d2
true relevance: d1 > d2
over-penalization
...news of presidential campaign.....
...presidential candidate...
..........................................................
..........................................................
..........................................................
...presidential campaign news.....
...presidential candidate...
true relevance: d2 > d1
Need normalization between 1 and |d|
under-penalization
Document length pivoting [Singhal 1995]
55
normalization = denominator
Document length pivoting [Singhal 1995]
56
TF-IDF score
true relevance
Pivoted Document Length Normalization: http://singhal.info/pivoted-dln.pdf
Document length pivoting
57
Pivoted Document Length Normalization: http://singhal.info/pivoted-dln.pdf
Document length pivoting
58
standard formulation for doc length normalization
fixing pivot to avgdl
Pivoted Document Length Normalization: http://singhal.info/pivoted-dln.pdf
How large should the denominator be?
59
dl
normalization
avgdl
1
1-b
Document length pivoting
60
https://en.wikipedia.org/wiki/Okapi_BM25
Summary
More on retrieval model design heuristics
62
IR != web search
63
IR != web search
64
Syllabus