1 of 67

1

CS 589 Text Mining and Information Retrieval

Tue 6:30-9:00

All zoom links in Canvas

photo: https://www.scubedstudios.com/information-retrieval/

2 of 67

2

Instructor: Susan Liu

CA:

xliu127@stevens.edu

OH: Tue 2:30-4:30

Discord

Gaoyi Wu

gwu13@stevens.edu

OH: Wed 9:00am-11:00am

GS347, Discord by default

3 of 67

CS589 Lecture 1: Overview

Logistics

Pre-requisite
Final grade calculation
Q & A

3

Lecture 1

What is information retrieval?
History of IR
Earlier IR models
TF-IDF, vector space model

4 of 67

Where to find everything

4

5 of 67

Prerequisite

Recommended: A good knowledge on statistics and probability

Required: Being comfortable with Python programming (pop quiz today)

Undergrad: CS115, CS284, MA 222 - Probability & Statistics & MA 232 - Linear Algebra

Contact the instructor if you aren’t sure

It will be followed by a pop quiz later, don’t worry it doesn’t count in your final grade

A good knowledge on statistics and proability will always help because we will be talking about something related to probabilistic distributions, bayes theorem, maximum likelihood estimation, monte carlo methods, and it may feel a bit overwhelming if you don’t know any of these. But we will also walk you through the math background in class, so don’t worry too much if you don’t know one thing or two.

On the other hand, it is very important that Python is required. There will be 4 assignments in this class and all of them require coding in Python, so you must be quite comfortable with writing Python already. And here being comfortable with Python means when asked to implement a new algorithm in Python, you will not experience any major problems in the programming language aspect.

6 of 67

Grading calculator

4 assignments - 40% (Python, Colab, ElasticSearch)

Midterm - 25%

Pop quiz (Attendance) - 5%

Final Exam - 30%

Answer questions in class - Bonus

7 of 67

Late policy

Submit within 24 hours of deadline - 90%
within 48 hours - 70%,
50% within 5 days,
0 if later than 5 days
0 if code not compile

8 of 67

Plagiarism policy

We have a very powerful plagiarism detection system, do not take the risk

Cheating case in CS284

A student put all his homework on a GitHub public repo
In the end, we found 8+ students copied his code

9 of 67

Curve and fail

Tips for avoiding failing the class:

Keep an eye on your grade
Make sure you did submit the homework on time
Do not plagiarize with other people, do not use AI to generate the code, do not collaborate on homework
Attend every lecture on time, if you are late and miss the sign up, let me know

10 of 67

Feedback from previous years

Failures (< 70%)
F25: 3 out of 30
F24: 0%
S24: 0%
F23: 5% (changed to final exam)
S23: 9%
F22: 9.5% (final project plagiarism detected)

11 of 67

Midterm/Final Exam In class

In class

12 of 67

If you are interested in research with me

NLP for software engineering

NLP for security

Check out my research: https://liususan091219.github.io/, email me

13 of 67

Ask all questions in Discord

14 of 67

Canvas & GradeScope

All homeworks (coding) are graded in Canvas�
All exams are graded on papers, rubrics will be released, can check the papers during office hours

15 of 67

Accommodation: 1.5x time

16 of 67

Pop quiz on Python

16

17 of 67

17

Lecture 1: Basics about information retrieval

Mon 6:30-9:00

All zoom links in Canvas

photo: https://www.scubedstudios.com/information-retrieval/

18 of 67

Information Retrieval (navigational)

query: “first president of the united states”

18

Information need: name of the first US president

19 of 67

Information Retrieval (exploratory)

student: “I need a book on American history for my thesis research!”

librarian: "What's the topic of your thesis?"

student: "civil war"

19

Information need: to study about the history of the civil war

20 of 67

Information Retrieval (conversational agent)

20

21 of 67

Information Retrieval (conversational agent)

21

22 of 67

Information Retrieval (Database)

22

23 of 67

Information retrieval (Vector Database)

23

24 of 67

Information Retrieval (RAG)

24

25 of 67

Information Retrieval

the way from times squares to Gateway South

the US News ranking of SIT

where to go for Thanksgiving break

25

navigational

exploratory

26 of 67

Information Retrieval: Google

26

How does Google return results so quickly?

How does Google know cs 589 refers to a course?

How does Google know stevens = SIT?

27 of 67

Information Retrieval Techniques

“Because the systems that are accessible today are so easy to use, it is tempting to think the technology behind them is similarly straightforward to build. This review has shown that the route to creating successful IR systems required much innovation and thought over a long period of time. “

— The history of Information Retrieval Research, Mark Sanderson and Bruce Croft

28 of 67

Information Retrieval Techniques

28

"cs 589 stevens"

retrieval models (Lecture 1,2,5)

29 of 67

Information Retrieval Techniques

29

Evaluation (Lecture 3)

"cs 589 stevens"

30 of 67

Information Retrieval Techniques

30

Inverted index (Lecture 4)

50 billion pages on the web

"cs 589 stevens"

31 of 67

A Brief History of IR

31

300 BC

Callimachus: the first library catalog of books

Punch cards, searching at 600 cards/min

1950s

1958

Cranfield evaluation methodology; word-based indexing

building IR systems on computers; relevance feedback

1960s

1970s

TREC; learning to rank; latent semantic indexing

1980s

TF-IDF; probability ranking principle

1990 - now

web search; supporting natural language queries;

1880s

32 of 67

A Brief History of IR

32

…

1960s

1997

word2vec

2015

LSTM

2017

Transformers

2020

GPT-3

33 of 67

The Boolean retrieval system

33

34 of 67

The Boolean retrieval system

e.g., SELECT * FROM table_computer WHERE price < $500 AND brand = “Dell”
Primary commercial retrieval system for 3 decades
Many systems today still use the boolean retrieval system, i.e., eCommerce search, etc.

Advantage: Returns exactly what you want

Disadvantage:

can only specify queries based on the pre-defined categories
works primarily for products with attribute

34

35 of 67

The Cranfield experiment (1958)

Simple systems based on keywords appeared to work just as well as complex classificatory schemes (Wikipedia)

Keywords systems are much easier to implement than the other systems

35

keywords system

36 of 67

The Cranfield experiment (1958)

Imagine you need to help users search for literatures in a digital library, how would you design such a system?

36

computer science

artificial intelligence

bioinformatics

query = “subject = AI & subject = bioinformatics”

system 1: the Boolean retrieval system

On the history of IR: https://journals.sagepub.com/doi/abs/10.1177/0165551507086989?journalCode=jisb

37 of 67

The Cranfield experiment (1958)

Imagine you need to help users search for literatures in a digital library, how would you design such a system?

37

system 2: indexing documents by lists of words

query = “ the artificial intelligence”

artificial

bags of words representation

Document-term matrix

	intelligence	book	the	cat	artificial	dog	business
Doc1	0	1	3	1	0	1	0
Doc2	1	0	0	0	0	0	1
query	1	0	1	0	1	0	0

38 of 67

How to build an accurate retrieval model based on word indexing?

38

How does Google return results so quickly?

How does Google know cs 589 refers to a course?

How does Google know stevens = SIT?

39 of 67

Word indexing: vector-space model

Represent each document/query as a vector

The similarity = cosine score between the vectors

39

40 of 67

Term frequency

d1= “the cat, the dog, the book”

d2 = “the business intelligence”

d3 = “the artificial world”

40

query = “the artificial intelligence book”

41 of 67

Term frequency

d1= “the cat, the dog, the book”

[0, 1, 3, 1, 0, 1, 0, 0]

d2 = “the business intelligence”

[1, 0, 1, 0, 0, 0, 1, 0]

d3 = “the artificial world”

[0, 0, 1, 0, 1, 0, 0, 1]

41

query = “the artificial intelligence book”
q = [1, 1, 1, 0, 1, 0, 0, 0]

artificial

Document-term matrix

	intelligence	book	the	cat	artificial	dog	business	world
Doc1	0	1	3	1	0	1	0	0
Doc2	1	0	1	0	0	0	1	0
Doc3	0	0	1	0	1	0	0	1
query	1	1	1	0	1	0	0	0

42 of 67

Vector space model

d1 = [0, 1, 3, 1, 0, 1, 0, 0]
d2 = [1, 0, 1, 0, 0, 0, 1, 0]
d3 = [0, 0, 1, 0, 1, 0, 0, 1]

42

To answer the query:

“the artificial intelligence book”

q = [1, 1, 1, 0, 1, 0, 0, 0]

Document-term matrix

	intelligence	book	the	cat	artificial	dog	business	world
Doc1	0	1	3	1	0	1	0	0
Doc2	1	0	1	0	0	0	1	0
Doc3	0	0	1	0	1	0	0	1
query	1	1	1	0	1	0	0	0

43 of 67

TF-only representations is inaccurate

Documents are dominated by words such as “the” “a”

These words do not carry any meanings, nor do they discriminate between documents

43

q = “the artificial intelligence book”

d1 = “the cat, the dog, the book”
d2 = “the business intelligence”
d3 = “the artificial world”

https://tinyurl.com/422tdd7n

44 of 67

Zipf’s law distribution of words

44

45 of 67

Stop words in English

Documents are dominated by words such as “the” “a”

These words do not discriminate between documents

45

46 of 67

Inverse-document frequency

Inverse-document frequency: penalizing a word’s TF based on its document frequency

46

q = “the artificial intelligence book”

d1 = “the cat, the dog, the book”
d2 = “the business intelligence”
d3 = “the artificial world”

47 of 67

Vector-space model is not perfect

Shorter documents are concentrated on more important words

Informative words in long documents are diluted

47

48 of 67

Informative words in long documents are diluted

vector = [..., 1/sqrt(21), …]

vector = [..., 1/sqrt(4), …]

48

q = “artificial intelligence”

d1 = ““Artificial intelligence was founded as an academic discipline in 1955, and in the years since has experienced several waves of optimism”

d2 = “the journal of Artificial intelligence”

49 of 67

Retrieving short documents vs. retrieving long documents

The difference between retrieving short documents and long documents

Longer documents cover more topics, so the query may match a small subset of the vocabulary
Longer documents need to be considered differently

49

Short document

Long document

the

a

business

intelligence

the

a

business

AI

bot

ML

learn

NLP

50 of 67

Desiderata for a good retrieval model for word-indexing

50

51 of 67

Desiderata for a good retrieval model for word-indexing

Rule 1 (document length pivoting): score(w, d) should be properly normalized

norm = 1: too little
norm = vector-space/cosine: too much

51

52 of 67

Desiderata for a good retrieval model for word-indexing

Rule 2 (idf): if a word w appears everywhere, its importance should be discounted

52

53 of 67

Informative words in long documents are diluted

Rule 3 (tf): tf’s importance should be sublinear

53

q = “artificial intelligence”

d1 = ““Artificial intelligence was founded as an academic discipline in 1955, and in the years since has experienced several waves of optimism”

d2 = ““Artificial intelligence was founded as an academic discipline in 1955, artificial intelligence”

d2 is not twice more relevant than d1

54 of 67

Rule 3: Term frequency reweighing

Term frequency reweighing: penalizing a word’s TF based on the TF itself

If a word appears in the same document multiple times, it’s importance should not grow linearly

54

Log scale normalization

55 of 67

Term-frequency reweighing

Logarithmic normalization

55

Log scale normalization

56 of 67

Rule 1: Finding appropriate norm

56

...news of presidential campaign.....

...presidential candidate...

...campaign.....campaign...................

..........................................................

........news..........................................

..........................news........................

..........................................................

...........presidential......presidential

query = "news about presidential campaign"

100 words

5,000 words

d1

d2

true relevance: d1 > d2

over-penalization

...news of presidential campaign.....

...presidential candidate...

..........................................................

...presidential campaign news.....

...presidential candidate...

true relevance: d2 > d1

Need normalization between 1 and |d|

under-penalization

57 of 67

Document length pivoting [Singhal 1995]

57

normalization = denominator

58 of 67

Document length pivoting [Singhal 1995]

For each query q and each document d, compute their relevance score score(q, d)
Manually evaluate the relevance between q and d
cosine score is biased

58

TF-IDF score

true relevance

Pivoted Document Length Normalization: http://singhal.info/pivoted-dln.pdf

59 of 67

Document length pivoting

Rotate the relevance score curve, such that it most closely align with the relevance judgement curve

59

Pivoted Document Length Normalization: http://singhal.info/pivoted-dln.pdf

60 of 67

Document length pivoting

Rotate the relevance score curve, such that it most closely align with the relevance judgement curve

60

standard formulation for doc length normalization

fixing pivot to avgdl

Pivoted Document Length Normalization: http://singhal.info/pivoted-dln.pdf

61 of 67

How large should the denominator be?

61

dl

normalization

avgdl

1

1-b

62 of 67

Document length pivoting

62

https://en.wikipedia.org/wiki/Okapi_BM25

63 of 67

Summary

What is information retrieval?

History of IR

Boolean retrieval system, cranfield experiment

TF-IDF, vector space model, document length pivoting

64 of 67

More on retrieval model design heuristics

Axiomatic thinking in information retrieval [Fang et al., SIGIR 2004]

64

65 of 67

IR != web search

The other side of information retrieval techniques

Recommender systems (users who bought this also bought…)
Online advertising

65

66 of 67

IR != web search

Factoid question answering systems

66

67 of 67

Syllabus

Vector space model, TF-IDF

Probability ranking principle, BM25

IR evaluation, query completion

Inverted index, ES, PageRank, HITS

Relevance feedback, PRF

EM algorithm

RNN/LSTM

Transformer/Bert

Frontier topic: recommender system

Frontier topic: opinion analysis/mining

Frontier topic: NMT, program synthesis

Neural IR