1
CS 589 Text Mining and Information Retrieval
Tue 6:30-9:00
All zoom links in Canvas
photo: https://www.scubedstudios.com/information-retrieval/
2
Instructor: Susan Liu
CA:
OH: Tue 2:30-4:30
Discord
Gaoyi Wu
OH: Wed 9:00am-11:00am
GS347, Discord by default
CS589 Lecture 1: Overview
3
Where to find everything
4
Prerequisite
Grading calculator
Late policy
Plagiarism policy
Curve and fail
Feedback from previous years
Midterm/Final Exam In class
If you are interested in research with me
Ask all questions in Discord
Canvas & GradeScope
Accommodation: 1.5x time
Pop quiz on Python
16
17
Lecture 1: Basics about information retrieval
Mon 6:30-9:00
All zoom links in Canvas
photo: https://www.scubedstudios.com/information-retrieval/
Information Retrieval (navigational)
query: “first president of the united states”
18
Information need: name of the first US president
Information Retrieval (exploratory)
student: “I need a book on American history for my thesis research!”
librarian: "What's the topic of your thesis?"
student: "civil war"
19
Information need: to study about the history of the civil war
Information Retrieval (conversational agent)
20
Information Retrieval (conversational agent)
21
Information Retrieval (Database)
22
Information retrieval (Vector Database)
23
Information Retrieval (RAG)
24
Information Retrieval
the way from times squares to Gateway South
the US News ranking of SIT
where to go for Thanksgiving break
25
navigational
exploratory
Information Retrieval: Google
26
How does Google return results so quickly?
How does Google know cs 589 refers to a course?
How does Google know stevens = SIT?
Information Retrieval Techniques
“Because the systems that are accessible today are so easy to use, it is tempting to think the technology behind them is similarly straightforward to build. This review has shown that the route to creating successful IR systems required much innovation and thought over a long period of time. “
— The history of Information Retrieval Research, Mark Sanderson and Bruce Croft
Information Retrieval Techniques
28
"cs 589 stevens"
retrieval models (Lecture 1,2,5)
Information Retrieval Techniques
29
Evaluation (Lecture 3)
"cs 589 stevens"
Information Retrieval Techniques
30
Inverted index (Lecture 4)
50 billion pages on the web
"cs 589 stevens"
A Brief History of IR
31
300 BC
Callimachus: the first library catalog of books
Punch cards, searching at 600 cards/min
1950s
1958
Cranfield evaluation methodology; word-based indexing
building IR systems on computers; relevance feedback
1960s
1970s
TREC; learning to rank; latent semantic indexing
1980s
TF-IDF; probability ranking principle
1990 - now
web search; supporting natural language queries;
1880s
A Brief History of IR
32
…
1960s
1997
word2vec
2015
LSTM
2017
Transformers
2020
GPT-3
The Boolean retrieval system
33
The Boolean retrieval system
34
The Cranfield experiment (1958)
35
keywords system
The Cranfield experiment (1958)
36
computer science
artificial intelligence
bioinformatics
query = “subject = AI & subject = bioinformatics”
system 1: the Boolean retrieval system
On the history of IR: https://journals.sagepub.com/doi/abs/10.1177/0165551507086989?journalCode=jisb
The Cranfield experiment (1958)
37
system 2: indexing documents by lists of words
query = “ the artificial intelligence”
artificial
bags of words representation
Document-term matrix
| intelligence | book | the | cat | artificial | dog | business |
Doc1 | 0 | 1 | 3 | 1 | 0 | 1 | 0 |
Doc2 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
query | 1 | 0 | 1 | 0 | 1 | 0 | 0 |
How to build an accurate retrieval model based on word indexing?
38
How does Google return results so quickly?
How does Google know cs 589 refers to a course?
How does Google know stevens = SIT?
Word indexing: vector-space model
39
Term frequency
40
Term frequency
[0, 1, 3, 1, 0, 1, 0, 0]
[1, 0, 1, 0, 0, 0, 1, 0]
[0, 0, 1, 0, 1, 0, 0, 1]
41
artificial
Document-term matrix
| intelligence | book | the | cat | artificial | dog | business | world |
Doc1 | 0 | 1 | 3 | 1 | 0 | 1 | 0 | 0 |
Doc2 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
Doc3 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 |
query | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 |
Vector space model
42
Document-term matrix
| intelligence | book | the | cat | artificial | dog | business | world |
Doc1 | 0 | 1 | 3 | 1 | 0 | 1 | 0 | 0 |
Doc2 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
Doc3 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 |
query | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 |
TF-only representations is inaccurate
43
Zipf’s law distribution of words
44
Stop words in English
45
Inverse-document frequency
46
Vector-space model is not perfect
47
Informative words in long documents are diluted
vector = [..., 1/sqrt(21), …]
vector = [..., 1/sqrt(4), …]
48
Retrieving short documents vs. retrieving long documents
49
Short document
Long document
the
a
business
intelligence
the
a
business
AI
bot
ML
learn
NLP
Desiderata for a good retrieval model for word-indexing
50
Desiderata for a good retrieval model for word-indexing
51
Desiderata for a good retrieval model for word-indexing
52
Informative words in long documents are diluted
53
d2 is not twice more relevant than d1
Rule 3: Term frequency reweighing
54
Log scale normalization
Term-frequency reweighing
55
Log scale normalization
Rule 1: Finding appropriate norm
56
...news of presidential campaign.....
...presidential candidate...
...campaign.....campaign...................
..........................................................
........news..........................................
..........................news........................
..........................................................
..........................................................
...........presidential......presidential
query = "news about presidential campaign"
100 words
5,000 words
d1
d2
true relevance: d1 > d2
over-penalization
...news of presidential campaign.....
...presidential candidate...
..........................................................
..........................................................
..........................................................
...presidential campaign news.....
...presidential candidate...
true relevance: d2 > d1
Need normalization between 1 and |d|
under-penalization
Document length pivoting [Singhal 1995]
57
normalization = denominator
Document length pivoting [Singhal 1995]
58
TF-IDF score
true relevance
Pivoted Document Length Normalization: http://singhal.info/pivoted-dln.pdf
Document length pivoting
59
Pivoted Document Length Normalization: http://singhal.info/pivoted-dln.pdf
Document length pivoting
60
standard formulation for doc length normalization
fixing pivot to avgdl
Pivoted Document Length Normalization: http://singhal.info/pivoted-dln.pdf
How large should the denominator be?
61
dl
normalization
avgdl
1
1-b
Document length pivoting
62
https://en.wikipedia.org/wiki/Okapi_BM25
Summary
More on retrieval model design heuristics
64
IR != web search
65
IR != web search
66
Syllabus