1 of 67

1

CS 589 Text Mining and Information Retrieval

Tue 6:30-9:00

All zoom links in Canvas

photo: https://www.scubedstudios.com/information-retrieval/

2 of 67

2

Instructor: Susan Liu

CA:

OH: Tue 2:30-4:30

Discord

Gaoyi Wu

OH: Wed 9:00am-11:00am

GS347, Discord by default

3 of 67

CS589 Lecture 1: Overview

  • Logistics
    • Pre-requisite
    • Final grade calculation
    • Q & A

3

  • Lecture 1
    • What is information retrieval?
    • History of IR
    • Earlier IR models
    • TF-IDF, vector space model

4 of 67

Where to find everything

4

5 of 67

Prerequisite

  • Recommended: A good knowledge on statistics and probability

  • Required: Being comfortable with Python programming (pop quiz today)

  • Undergrad: CS115, CS284, MA 222 - Probability & Statistics & MA 232 - Linear Algebra

  • Contact the instructor if you aren’t sure

6 of 67

Grading calculator

  • 4 assignments - 40% (Python, Colab, ElasticSearch)

  • Midterm - 25%

  • Pop quiz (Attendance) - 5%

  • Final Exam - 30%

  • Answer questions in class - Bonus

7 of 67

Late policy

  • Submit within 24 hours of deadline - 90%
  • within 48 hours - 70%,
  • 50% within 5 days,
  • 0 if later than 5 days
  • 0 if code not compile

8 of 67

Plagiarism policy

  • We have a very powerful plagiarism detection system, do not take the risk

  • Cheating case in CS284
    • A student put all his homework on a GitHub public repo
    • In the end, we found 8+ students copied his code

9 of 67

Curve and fail

  • Tips for avoiding failing the class:
    • Keep an eye on your grade
    • Make sure you did submit the homework on time
    • Do not plagiarize with other people, do not use AI to generate the code, do not collaborate on homework
    • Attend every lecture on time, if you are late and miss the sign up, let me know

10 of 67

Feedback from previous years

  • Failures (< 70%)
  • F25: 3 out of 30
  • F24: 0%
  • S24: 0%
  • F23: 5% (changed to final exam)
  • S23: 9%
  • F22: 9.5% (final project plagiarism detected)

11 of 67

Midterm/Final Exam In class

  • In class

12 of 67

If you are interested in research with me

  • NLP for software engineering

  • NLP for security

  • Check out my research: https://liususan091219.github.io/, email me

13 of 67

Ask all questions in Discord

14 of 67

Canvas & GradeScope

  • All homeworks (coding) are graded in Canvas�
  • All exams are graded on papers, rubrics will be released, can check the papers during office hours

15 of 67

Accommodation: 1.5x time

16 of 67

Pop quiz on Python

16

17 of 67

17

Lecture 1: Basics about information retrieval

Mon 6:30-9:00

All zoom links in Canvas

photo: https://www.scubedstudios.com/information-retrieval/

18 of 67

Information Retrieval (navigational)

query: “first president of the united states”

18

Information need: name of the first US president

19 of 67

Information Retrieval (exploratory)

student: “I need a book on American history for my thesis research!”

librarian: "What's the topic of your thesis?"

student: "civil war"

19

Information need: to study about the history of the civil war

20 of 67

Information Retrieval (conversational agent)

20

21 of 67

Information Retrieval (conversational agent)

21

22 of 67

Information Retrieval (Database)

22

23 of 67

Information retrieval (Vector Database)

23

24 of 67

Information Retrieval (RAG)

24

25 of 67

Information Retrieval

the way from times squares to Gateway South

the US News ranking of SIT

where to go for Thanksgiving break

25

navigational

exploratory

26 of 67

Information Retrieval: Google

26

How does Google return results so quickly?

How does Google know cs 589 refers to a course?

How does Google know stevens = SIT?

27 of 67

Information Retrieval Techniques

Because the systems that are accessible today are so easy to use, it is tempting to think the technology behind them is similarly straightforward to build. This review has shown that the route to creating successful IR systems required much innovation and thought over a long period of time.

The history of Information Retrieval Research, Mark Sanderson and Bruce Croft

28 of 67

Information Retrieval Techniques

28

"cs 589 stevens"

retrieval models (Lecture 1,2,5)

29 of 67

Information Retrieval Techniques

29

Evaluation (Lecture 3)

"cs 589 stevens"

30 of 67

Information Retrieval Techniques

30

Inverted index (Lecture 4)

50 billion pages on the web

"cs 589 stevens"

31 of 67

A Brief History of IR

31

300 BC

Callimachus: the first library catalog of books

Punch cards, searching at 600 cards/min

1950s

1958

Cranfield evaluation methodology; word-based indexing

building IR systems on computers; relevance feedback

1960s

1970s

TREC; learning to rank; latent semantic indexing

1980s

TF-IDF; probability ranking principle

1990 - now

web search; supporting natural language queries;

1880s

32 of 67

A Brief History of IR

32

1960s

1997

word2vec

2015

LSTM

2017

Transformers

2020

GPT-3

33 of 67

The Boolean retrieval system

33

34 of 67

The Boolean retrieval system

  • e.g., SELECT * FROM table_computer WHERE price < $500 AND brand = “Dell”
  • Primary commercial retrieval system for 3 decades
  • Many systems today still use the boolean retrieval system, i.e., eCommerce search, etc.

  • Advantage: Returns exactly what you want

  • Disadvantage:
    • can only specify queries based on the pre-defined categories
    • works primarily for products with attribute

34

35 of 67

The Cranfield experiment (1958)

  • Simple systems based on keywords appeared to work just as well as complex classificatory schemes (Wikipedia)

  • Keywords systems are much easier to implement than the other systems

35

keywords system

36 of 67

The Cranfield experiment (1958)

  • Imagine you need to help users search for literatures in a digital library, how would you design such a system?

36

computer science

artificial intelligence

bioinformatics

query = “subject = AI & subject = bioinformatics”

system 1: the Boolean retrieval system

On the history of IR: https://journals.sagepub.com/doi/abs/10.1177/0165551507086989?journalCode=jisb

37 of 67

The Cranfield experiment (1958)

  • Imagine you need to help users search for literatures in a digital library, how would you design such a system?

37

system 2: indexing documents by lists of words

query = “ the artificial intelligence”

artificial

bags of words representation

Document-term matrix

intelligence

book

the

cat

artificial

dog

business

Doc1

0

1

3

1

0

1

0

Doc2

1

0

0

0

0

0

1

query

1

0

1

0

1

0

0

38 of 67

How to build an accurate retrieval model based on word indexing?

38

How does Google return results so quickly?

How does Google know cs 589 refers to a course?

How does Google know stevens = SIT?

39 of 67

Word indexing: vector-space model

  • Represent each document/query as a vector

  • The similarity = cosine score between the vectors

39

40 of 67

Term frequency

  • d1= “the cat, the dog, the book”

  • d2 = “the business intelligence”

  • d3 = “the artificial world”

40

  • query = “the artificial intelligence book”

41 of 67

Term frequency

  • d1= “the cat, the dog, the book”

[0, 1, 3, 1, 0, 1, 0, 0]

  • d2 = “the business intelligence”

[1, 0, 1, 0, 0, 0, 1, 0]

  • d3 = “the artificial world”

[0, 0, 1, 0, 1, 0, 0, 1]

41

  • query = “the artificial intelligence book”
  • q = [1, 1, 1, 0, 1, 0, 0, 0]

artificial

Document-term matrix

intelligence

book

the

cat

artificial

dog

business

world

Doc1

0

1

3

1

0

1

0

0

Doc2

1

0

1

0

0

0

1

0

Doc3

0

0

1

0

1

0

0

1

query

1

1

1

0

1

0

0

0

42 of 67

Vector space model

  • d1 = [0, 1, 3, 1, 0, 1, 0, 0]
  • d2 = [1, 0, 1, 0, 0, 0, 1, 0]
  • d3 = [0, 0, 1, 0, 1, 0, 0, 1]

42

  • To answer the query:
    • “the artificial intelligence book”
  • q = [1, 1, 1, 0, 1, 0, 0, 0]

Document-term matrix

intelligence

book

the

cat

artificial

dog

business

world

Doc1

0

1

3

1

0

1

0

0

Doc2

1

0

1

0

0

0

1

0

Doc3

0

0

1

0

1

0

0

1

query

1

1

1

0

1

0

0

0

43 of 67

TF-only representations is inaccurate

  • Documents are dominated by words such as “the” “a”

  • These words do not carry any meanings, nor do they discriminate between documents

43

  • q = “the artificial intelligence book”
  • d1 = “the cat, the dog, the book”
  • d2 = “the business intelligence
  • d3 = “the artificial world”

44 of 67

Zipf’s law distribution of words

44

45 of 67

Stop words in English

  • Documents are dominated by words such as “the” “a”

  • These words do not discriminate between documents

45

46 of 67

Inverse-document frequency

  • Inverse-document frequency: penalizing a word’s TF based on its document frequency

46

  • q = “the artificial intelligence book”
  • d1 = “the cat, the dog, the book”
  • d2 = “the business intelligence
  • d3 = “the artificial world”

47 of 67

Vector-space model is not perfect

  • Shorter documents are concentrated on more important words

  • Informative words in long documents are diluted

47

48 of 67

Informative words in long documents are diluted

vector = [..., 1/sqrt(21), …]

vector = [..., 1/sqrt(4), …]

48

  • q = “artificial intelligence
  • d1 = ““Artificial intelligence was founded as an academic discipline in 1955, and in the years since has experienced several waves of optimism”
  • d2 = “the journal of Artificial intelligence

49 of 67

Retrieving short documents vs. retrieving long documents

  • The difference between retrieving short documents and long documents
    • Longer documents cover more topics, so the query may match a small subset of the vocabulary
    • Longer documents need to be considered differently

49

Short document

Long document

the

a

business

intelligence

the

a

business

AI

bot

ML

learn

NLP

50 of 67

Desiderata for a good retrieval model for word-indexing

50

51 of 67

Desiderata for a good retrieval model for word-indexing

  • Rule 1 (document length pivoting): score(w, d) should be properly normalized
    • norm = 1: too little
    • norm = vector-space/cosine: too much

51

52 of 67

Desiderata for a good retrieval model for word-indexing

  • Rule 2 (idf): if a word w appears everywhere, its importance should be discounted

52

53 of 67

Informative words in long documents are diluted

  • Rule 3 (tf): tf’s importance should be sublinear

53

  • q = “artificial intelligence
  • d1 = ““Artificial intelligence was founded as an academic discipline in 1955, and in the years since has experienced several waves of optimism”
  • d2 = ““Artificial intelligence was founded as an academic discipline in 1955, artificial intelligence

d2 is not twice more relevant than d1

54 of 67

Rule 3: Term frequency reweighing

  • Term frequency reweighing: penalizing a word’s TF based on the TF itself

  • If a word appears in the same document multiple times, it’s importance should not grow linearly

54

Log scale normalization

55 of 67

Term-frequency reweighing

  • Logarithmic normalization

55

Log scale normalization

56 of 67

Rule 1: Finding appropriate norm

56

...news of presidential campaign.....

...presidential candidate...

...campaign.....campaign...................

..........................................................

........news..........................................

..........................news........................

..........................................................

..........................................................

...........presidential......presidential

query = "news about presidential campaign"

100 words

5,000 words

d1

d2

true relevance: d1 > d2

over-penalization

...news of presidential campaign.....

...presidential candidate...

..........................................................

..........................................................

..........................................................

...presidential campaign news.....

...presidential candidate...

true relevance: d2 > d1

Need normalization between 1 and |d|

under-penalization

57 of 67

Document length pivoting [Singhal 1995]

57

normalization = denominator

58 of 67

Document length pivoting [Singhal 1995]

  • For each query q and each document d, compute their relevance score score(q, d)
  • Manually evaluate the relevance between q and d
  • cosine score is biased

58

TF-IDF score

true relevance

Pivoted Document Length Normalization: http://singhal.info/pivoted-dln.pdf

59 of 67

Document length pivoting

  • Rotate the relevance score curve, such that it most closely align with the relevance judgement curve

59

Pivoted Document Length Normalization: http://singhal.info/pivoted-dln.pdf

60 of 67

Document length pivoting

  • Rotate the relevance score curve, such that it most closely align with the relevance judgement curve

60

standard formulation for doc length normalization

fixing pivot to avgdl

Pivoted Document Length Normalization: http://singhal.info/pivoted-dln.pdf

61 of 67

How large should the denominator be?

61

dl

normalization

avgdl

1

1-b

62 of 67

Document length pivoting

62

https://en.wikipedia.org/wiki/Okapi_BM25

63 of 67

Summary

  • What is information retrieval?

  • History of IR

  • Boolean retrieval system, cranfield experiment

  • TF-IDF, vector space model, document length pivoting

64 of 67

More on retrieval model design heuristics

  • Axiomatic thinking in information retrieval [Fang et al., SIGIR 2004]

64

65 of 67

IR != web search

  • The other side of information retrieval techniques
    • Recommender systems (users who bought this also bought…)
    • Online advertising

65

66 of 67

IR != web search

  • Factoid question answering systems

66

67 of 67

Syllabus

  • Vector space model, TF-IDF
  • Probability ranking principle, BM25
  • IR evaluation, query completion
  • Inverted index, ES, PageRank, HITS
  • Relevance feedback, PRF
  • EM algorithm
  • RNN/LSTM
  • Transformer/Bert
  • Frontier topic: recommender system
  • Frontier topic: opinion analysis/mining
  • Frontier topic: NMT, program synthesis
  • Neural IR