1 of 65

1

CS 589 Text Mining and Information Retrieval

Mon 6:30-9:00

All zoom links in Canvas

photo: https://www.scubedstudios.com/information-retrieval/

2 of 65

2

Instructor: Susan Liu

CA:

OH: Mon 3:30-5:30

Discord by default

Gaoyi Wu

OH: Tue 10:00-12:00

Discord by default

3 of 65

CS589 Lecture 1: Overview

  • Logistics
    • Pre-requisite
    • Final grade calculation
    • Q & A

3

  • Lecture 1
    • What is information retrieval?
    • History of IR
    • Earlier IR models
    • TF-IDF, vector space model

4 of 65

Where to find everything

4

5 of 65

Prerequisite

  • Recommended: A good knowledge on statistics and probability

  • Required: Being comfortable with Python programming (pop quiz today)

  • Undergrad: CS115, CS284, MA 222 - Probability & Statistics & MA 232 - Linear Algebra

  • Contact the instructor if you aren’t sure

6 of 65

Grading calculator

  • 4 assignments - 40% (Python, Colab, ElasticSearch)

  • Midterm - 25%

  • Pop quiz (Attendance) - 5%

  • Final Exam - 30%

  • Answer questions in class - Bonus

7 of 65

Late policy

  • Submit within 24 hours of deadline - 90%
  • within 48 hours - 70%,
  • 50% within 5 days,
  • 0 if later than 5 days
  • 0 if code not compile

8 of 65

Plagiarism policy

  • We have a very powerful plagiarism detection system, do not take the risk

  • Cheating case in CS284
    • A student put all his homework on a GitHub public repo
    • In the end, we found 8+ students copied his code

9 of 65

Midterm/Final Exam In class

10 of 65

If you are interested in research with me

  • NLP for software engineering

  • NLP for security

  • Check out my research: https://liususan091219.github.io/, email me

11 of 65

Ask all questions in Discord

12 of 65

Canvas & GradeScope

  • All homeworks (coding) are graded in Canvas�
  • All exams are graded on papers, rubrics will be released, can check the papers during office hours

13 of 65

Pop quiz on Python

13

14 of 65

Feedback from previous years

  • Failures (< 70%)
  • F24: 0%
  • S24: 0%
  • F23: 5% (changed to final exam)
  • S23: 9%
  • F22: 9.5% (final project plagiarism detected)

15 of 65

15

Lecture 1: Basics about information retrieval

Mon 6:30-9:00

All zoom links in Canvas

photo: https://www.scubedstudios.com/information-retrieval/

16 of 65

Information Retrieval (navigational)

query: “first president of the united states”

16

Information need: name of the first US president

17 of 65

Information Retrieval (exploratory)

student: “I need a book on American history for my thesis research!”

librarian: "What's the topic of your thesis?"

student: "civil war"

17

Information need: to study about the history of the civil war

18 of 65

Information Retrieval (conversational agent)

18

19 of 65

Information Retrieval (conversational agent)

19

20 of 65

Information Retrieval (Database)

20

21 of 65

Information retrieval (Vector Database)

21

22 of 65

Information Retrieval (RAG)

22

23 of 65

Information Retrieval

the way from times squares to Gateway South

the US News ranking of SIT

where to go for Thanksgiving break

23

navigational

exploratory

24 of 65

Information Retrieval: Google

24

How does Google return results so quickly?

How does Google know cs 589 refers to a course?

How does Google know stevens = SIT?

25 of 65

Information Retrieval Techniques

Because the systems that are accessible today are so easy to use, it is tempting to think the technology behind them is similarly straightforward to build. This review has shown that the route to creating successful IR systems required much innovation and thought over a long period of time.

The history of Information Retrieval Research, Mark Sanderson and Bruce Croft

26 of 65

Information Retrieval Techniques

26

"cs 589 stevens"

retrieval models (Lecture 1,2,5)

27 of 65

Information Retrieval Techniques

27

Evaluation (Lecture 3)

"cs 589 stevens"

28 of 65

Information Retrieval Techniques

28

Inverted index (Lecture 4)

50 billion pages on the web

"cs 589 stevens"

29 of 65

A Brief History of IR

29

300 BC

Callimachus: the first library catalog of books

Punch cards, searching at 600 cards/min

1950s

1958

Cranfield evaluation methodology; word-based indexing

building IR systems on computers; relevance feedback

1960s

1970s

TREC; learning to rank; latent semantic indexing

1980s

TF-IDF; probability ranking principle

1990 - now

web search; supporting natural language queries;

1880s

30 of 65

A Brief History of IR

30

1960s

1997

word2vec

2015

LSTM

2017

Transformers

2020

GPT-3

31 of 65

The Boolean retrieval system

31

32 of 65

The Boolean retrieval system

  • e.g., SELECT * FROM table_computer WHERE price < $500 AND brand = “Dell”
  • Primary commercial retrieval system for 3 decades
  • Many systems today still use the boolean retrieval system, i.e., eCommerce search, etc.

  • Advantage: Returns exactly what you want

  • Disadvantage:
    • can only specify queries based on the pre-defined categories
    • works primarily for products with attribute

32

33 of 65

The Cranfield experiment (1958)

  • Simple systems based on keywords appeared to work just as well as complex classificatory schemes (Wikipedia)

  • Keywords systems are much easier to implement than the other systems

33

keywords system

34 of 65

The Cranfield experiment (1958)

  • Imagine you need to help users search for literatures in a digital library, how would you design such a system?

34

computer science

artificial intelligence

bioinformatics

query = “subject = AI & subject = bioinformatics”

system 1: the Boolean retrieval system

On the history of IR: https://journals.sagepub.com/doi/abs/10.1177/0165551507086989?journalCode=jisb

35 of 65

The Cranfield experiment (1958)

  • Imagine you need to help users search for literatures in a digital library, how would you design such a system?

35

system 2: indexing documents by lists of words

query = “ the artificial intelligence”

artificial

bags of words representation

Document-term matrix

intelligence

book

the

cat

artificial

dog

business

Doc1

0

1

3

1

0

1

0

Doc2

1

0

0

0

0

0

1

query

1

0

1

0

1

0

0

36 of 65

How to build an accurate retrieval model based on word indexing?

36

How does Google return results so quickly?

How does Google know cs 589 refers to a course?

How does Google know stevens = SIT?

37 of 65

Word indexing: vector-space model

  • Represent each document/query as a vector

  • The similarity = cosine score between the vectors

37

38 of 65

Term frequency

  • d1= “the cat, the dog, the book”

  • d2 = “the business intelligence”

  • d3 = “the artificial world”

38

  • query = “the artificial intelligence book”

39 of 65

Term frequency

  • d1= “the cat, the dog, the book”

[0, 1, 3, 1, 0, 1, 0, 0]

  • d2 = “the business intelligence”

[1, 0, 1, 0, 0, 0, 1, 0]

  • d3 = “the artificial world”

[0, 0, 1, 0, 1, 0, 0, 1]

39

  • query = “the artificial intelligence book”
  • q = [1, 1, 1, 0, 1, 0, 0, 0]

artificial

Document-term matrix

intelligence

book

the

cat

artificial

dog

business

world

Doc1

0

1

3

1

0

1

0

0

Doc2

1

0

1

0

0

0

1

0

Doc3

0

0

1

0

1

0

0

1

query

1

1

1

0

1

0

0

0

40 of 65

Vector space model

  • d1 = [0, 1, 3, 1, 0, 1, 0, 0]
  • d2 = [1, 0, 1, 0, 0, 0, 1, 0]
  • d3 = [0, 0, 1, 0, 1, 0, 0, 1]

40

  • To answer the query:
    • “the artificial intelligence book”
  • q = [1, 1, 1, 0, 1, 0, 0, 0]

Document-term matrix

intelligence

book

the

cat

artificial

dog

business

world

Doc1

0

1

3

1

0

1

0

0

Doc2

1

0

1

0

0

0

1

0

Doc3

0

0

1

0

1

0

0

1

query

1

1

1

0

1

0

0

0

41 of 65

TF-only representations is inaccurate

  • Documents are dominated by words such as “the” “a”

  • These words do not carry any meanings, nor do they discriminate between documents

41

  • q = “the artificial intelligence book”
  • d1 = “the cat, the dog, the book”
  • d2 = “the business intelligence
  • d3 = “the artificial world”

42 of 65

Zipf’s law distribution of words

42

43 of 65

Stop words in English

  • Documents are dominated by words such as “the” “a”

  • These words do not discriminate between documents

43

44 of 65

Inverse-document frequency

  • Inverse-document frequency: penalizing a word’s TF based on its document frequency

44

  • q = “the artificial intelligence book”
  • d1 = “the cat, the dog, the book”
  • d2 = “the business intelligence
  • d3 = “the artificial world”

45 of 65

Vector-space model is not perfect

  • Shorter documents are concentrated on more important words

  • Informative words in long documents are diluted

45

46 of 65

Informative words in long documents are diluted

vector = [..., 1/sqrt(21), …]

vector = [..., 1/sqrt(4), …]

46

  • q = “artificial intelligence
  • d1 = ““Artificial intelligence was founded as an academic discipline in 1955, and in the years since has experienced several waves of optimism”
  • d2 = “the journal of Artificial intelligence

47 of 65

Retrieving short documents vs. retrieving long documents

  • The difference between retrieving short documents and long documents
    • Longer documents cover more topics, so the query may match a small subset of the vocabulary
    • Longer documents need to be considered differently

47

Short document

Long document

the

a

business

intelligence

the

a

business

AI

bot

ML

learn

NLP

48 of 65

Desiderata for a good retrieval model for word-indexing

48

49 of 65

Desiderata for a good retrieval model for word-indexing

  • Rule 1 (document length pivoting): score(w, d) should be properly normalized
    • norm = 1: too little
    • norm = vector-space/cosine: too much

49

50 of 65

Desiderata for a good retrieval model for word-indexing

  • Rule 2 (idf): if a word w appears everywhere, its importance should be discounted

50

51 of 65

Informative words in long documents are diluted

  • Rule 3 (tf): tf’s importance should be sublinear

51

  • q = “artificial intelligence
  • d1 = ““Artificial intelligence was founded as an academic discipline in 1955, and in the years since has experienced several waves of optimism”
  • d2 = ““Artificial intelligence was founded as an academic discipline in 1955, artificial intelligence

d2 is not twice more relevant than d1

52 of 65

Rule 3: Term frequency reweighing

  • Term frequency reweighing: penalizing a word’s TF based on the TF itself

  • If a word appears in the same document multiple times, it’s importance should not grow linearly

52

Log scale normalization

53 of 65

Term-frequency reweighing

  • Logarithmic normalization

53

Log scale normalization

54 of 65

Rule 1: Finding appropriate norm

54

...news of presidential campaign.....

...presidential candidate...

...campaign.....campaign...................

..........................................................

........news..........................................

..........................news........................

..........................................................

..........................................................

...........presidential......presidential

query = "news about presidential campaign"

100 words

5,000 words

d1

d2

true relevance: d1 > d2

over-penalization

...news of presidential campaign.....

...presidential candidate...

..........................................................

..........................................................

..........................................................

...presidential campaign news.....

...presidential candidate...

true relevance: d2 > d1

Need normalization between 1 and |d|

under-penalization

55 of 65

Document length pivoting [Singhal 1995]

55

normalization = denominator

56 of 65

Document length pivoting [Singhal 1995]

  • For each query q and each document d, compute their relevance score score(q, d)
  • Manually evaluate the relevance between q and d
  • cosine score is biased

56

TF-IDF score

true relevance

Pivoted Document Length Normalization: http://singhal.info/pivoted-dln.pdf

57 of 65

Document length pivoting

  • Rotate the relevance score curve, such that it most closely align with the relevance judgement curve

57

Pivoted Document Length Normalization: http://singhal.info/pivoted-dln.pdf

58 of 65

Document length pivoting

  • Rotate the relevance score curve, such that it most closely align with the relevance judgement curve

58

standard formulation for doc length normalization

fixing pivot to avgdl

Pivoted Document Length Normalization: http://singhal.info/pivoted-dln.pdf

59 of 65

How large should the denominator be?

59

dl

normalization

avgdl

1

1-b

60 of 65

Document length pivoting

60

https://en.wikipedia.org/wiki/Okapi_BM25

61 of 65

Summary

  • What is information retrieval?

  • History of IR

  • Boolean retrieval system, cranfield experiment

  • TF-IDF, vector space model, document length pivoting

62 of 65

More on retrieval model design heuristics

  • Axiomatic thinking in information retrieval [Fang et al., SIGIR 2004]

62

63 of 65

IR != web search

  • The other side of information retrieval techniques
    • Recommender systems (users who bought this also bought…)
    • Online advertising

63

64 of 65

IR != web search

  • Factoid question answering systems

64

65 of 65

Syllabus

  • Vector space model, TF-IDF
  • Probability ranking principle, BM25
  • IR evaluation, query completion
  • Inverted index, ES, PageRank, HITS
  • Relevance feedback, PRF
  • EM algorithm
  • RNN/LSTM
  • Transformer/Bert
  • Frontier topic: recommender system
  • Frontier topic: opinion analysis/mining
  • Frontier topic: NMT, program synthesis
  • Neural IR