1 of 56

Chapter-2……….

Term weighting and similarity measures�

1

03/01/20

Gashaw D.

2 of 56

2

Terms

  • Terms are usually stems. Terms can be also phrases, such as “Computer Science”, “World Wide Web”, etc.
  • Documents and queries are represented as vectors or “bags of words” (BOW).
    • Each vector holds a place for every term in the collection.
    • Position 1 corresponds to term 1, position 2 to term 2, position n to term n.

W=0 if a term is absent

Documents are represented by binary weights or Non-binary weighted vectors of terms.

03/01/20

Gashaw D.

3 of 56

3

Document Collection

  • A collection of n documents can be represented in the vector space model by a term-document matrix.
  • An entry in the matrix corresponds to the “weight” of a term in the document; zero means the term has no significance in the document or it simply doesn’t exist in the document.

T1 T2 …. Tt

D1 w11 w21 … wt1

D2 w12 w22 … wt2

: : : :

: : : :

Dn w1n w2n … wtn

03/01/20

Gashaw D.

4 of 56

TERM VECTOR SPACE

Term vector space

n-dimensional space, where n is the number of different terms/tokens used to index a set of documents.

Vector

Document i, di , represented by a vector.

Its magnitude in dimension j is wij,

where: wij > 0 if term j occurs in document i

wij = 0 otherwise

wij is the weight of term j in document i.

03/01/20

Gashaw D.

5 of 56

Binary Weights

  • Only the presence (1) or absence (0) of a term is included in the vector

  • Binary formula gives every word that appears in a document equal relevance.

  • It can be useful when frequency is not important.

  • Binary Weights Formula:

03/01/20

Gashaw D.

6 of 56

03/01/20

Gashaw D.

7 of 56

7

Why use term weighting?

  • Binary weights are too limiting.
    • terms are either present or absent.
    • Not allow to order documents according to their level of relevance for a given query

  • Non-binary weights allow to model partial matching .
    • Partial matching allows retrieval of docs that approximate the query.
    • Term-weighting improves quality of answer set.
    • Term weighting enables ranking of retrieved documents; such that best matching documents are ordered at the top as they are more relevant than others.

03/01/20

Gashaw D.

8 of 56

Term Weighting: Term Frequency (TF)

  • TF (term frequency) - Count the number of times term occurs in document.

fij = frequency of term i in document j

  • A term that appears many times within a document is likely to be more important than a term that appears only once.
  • If used alone, it favors common words and long documents.
    • It gives too much credit to words that appears more frequently.

  • May want to normalize term frequency (tf) across the entire corpus:

tfij = fij / max{fij}

03/01/20

Gashaw D.

9 of 56

03/01/20

Gashaw D.

10 of 56

10

Document Normalization

  • Long documents have an unfair advantage:
    • They use a lot of terms
      • So they get more matches than short documents
    • And they use the same words repeatedly
      • So they have much higher term frequencies

  • Normalization seeks to remove these effects:
    • Related somehow to maximum term frequency.
    • But also sensitive to the number of terms.

  • If we don’t normalize short documents may not be recognized as relevant.

03/01/20

Gashaw D.

11 of 56

Problems with term frequency

  • Need a mechanism for attenuating the effect of terms that occur too often in the collection to be meaningful for relevance/meaning determination
  • Scale down the term weight of terms with high collection frequency
    • Reduce the tf weight of a term by a factor that grows with the collection frequency
  • More common for this purpose is document frequency
    • how many documents in the collection contain the term

11

The example shows that collection frequency and document frequency behaves differently

03/01/20

Gashaw D.

12 of 56

12

Document Frequency

  • It is defined to be the number of documents in the collection that contain a term

DF = document frequency

    • Count the frequency considering the whole collection of documents.
    • Less frequently a term appears in the whole collection, the more discriminating it is.

df i = document frequency of term i

= number of documents containing term i

03/01/20

Gashaw D.

13 of 56

Inverse Document Frequency (IDF)

  • IDF measures rarity of the term in collection. The IDF is a measure of the general importance of the term
    • Inverts the document frequency.
  • It diminishes the weight of terms that occur very frequently in the collection and increases the weight of terms that occur rarely.
    • Gives full weight to terms that occur in one document only.
    • Gives lowest weight to terms that occur in all documents.
    • Terms that appear in many different documents are less indicative of overall topic.

idfi = inverse document frequency of term i,

= log (N/ df i) (N: total number of documents)

13

03/01/20

Gashaw D.

14 of 56

14

Inverse Document Frequency

  • IDF provides high values for rare words and low values for common words.
  • IDF is an indication of a term’s discrimination power.
    • Log used to dampen the effect relative to tf.
    • Make the difference between Document frequency vs. corpus frequency ?
  • E.g.: given a collection of 1000 documents and document frequency, compute IDF for each word?

Word

N

DF

IDF

the

1000

1000

0

some

1000

100

3.322

car

1000

10

6.644

merge

1000

1

9.966

03/01/20

Gashaw D.

15 of 56

TF*IDF Weighting

  • The most used term-weighting is tf*idf weighting scheme:

wij = tfij idfi = tfij * log2 (N/ dfi)

  • A term occurring frequently in the document but rarely in the rest of the collection is given high weight.
    • The tf-idf value for a term will always be greater than or equal to zero.

  • Experimentally, tf*idf has been found to work well.
    • It is often used in the vector space model together with cosine similarity to determine the similarity between two documents.

15

03/01/20

Gashaw D.

16 of 56

TF*IDF weighting

  • When does TF*IDF registers a high weight? when a term t occurs many times within a small number of documents
    • Highest tf*idf for a term shows a term has a high term frequency (in the given document) and a low document frequency (in the whole collection of documents);
    • the weights hence tend to filter out common terms.
    • Thus lending high discriminating power to those documents
  • Lower TF*IDF is registered when the term occurs fewer times in a document, or occurs in many documents
    • Thus offering a less pronounced relevance signal
  • Lowest TF*IDF is registered when the term occurs in virtually all documents

03/01/20

Gashaw D.

17 of 56

Computing TF-IDF: An Example

  • Assume collection contains 10,000 documents and statistical analysis shows that document frequencies (DF) of three terms are: A(50), B(1300), C(250). And also term frequencies (TF) of these terms are: A(3), B(2), C(1). Compute TF*IDF for each term?

A: tf = 3/3=1.00; idf = log2(10000/50) = 7.644; tf*idf = 7.644

B: tf = 2/3=0.67; idf = log2(10000/1300) = 2.943; tf*idf = 1.962

C: tf = 1/3=0.33; idf = log2(10000/250) = 5.322; tf*idf = 1.774

  • Query vector is typically treated as a document and also tf-idf weighted.

17

03/01/20

Gashaw D.

18 of 56

More Example

  • Consider a document containing 100 words wherein the word cow appears 3 times. Now, assume we have 10 million documents and cow appears in one thousand of these.

    • The term frequency (TF) for cow :

3/100 = 0.03

    • The inverse document frequency is

log2(10,000,000 / 1,000) = 13.228

    • The TF*IDF score is the product of these frequencies: 0.03 * 13.228 = 0.39684

18

03/01/20

Gashaw D.

19 of 56

Exercise

19

  • Let C = number of times a given word appears in a document;
  • TW = total number of words in a document;
  • TD = total number of documents in a corpus, and
  • DF = total number of documents containing a given word;
  • compute TF, IDF and TF*IDF score for each term

Word

C

TW

TD

DF

TF

IDF

TFIDF

airplane

5

46

3

1

0.109

1.585

0.172

blue

1

46

3

1

chair

7

46

3

3

computer

3

46

3

1

forest

2

46

3

1

justice

7

46

3

3

love

2

46

3

1

might

2

46

3

1

perl

5

46

3

2

rose

6

46

3

3

shoe

4

46

3

1

thesis

2

46

3

2

03/01/20

Gashaw D.

20 of 56

Concluding remarks

  • Suppose from a set of English documents, we wish to determine which once are the most relevant to the query "the brown cow."
  • A simple way to start out is by eliminating documents that do not contain all three words "the," "brown," and "cow," but this still leaves many documents.
  • To further distinguish them, we might count the number of times each term occurs in each document and sum them all together;
    • the number of times a term occurs in a document is called its TF. However, because the term "the" is so common, this will tend to incorrectly emphasize documents which happen to use the word "the" more, without giving enough weight to the more meaningful terms "brown" and "cow".
    • Also the term "the" is not a good keyword to distinguish relevant and non-relevant documents and terms like "brown" and "cow" that occur rarely are good keywords to distinguish relevant documents from the non-relevant once.

20

03/01/20

Gashaw D.

21 of 56

Concluding remarks

  • Hence IDF is incorporated which diminishes the weight of terms that occur very frequently in the collection and increases the weight of terms that occur rarely.

This leads to use TF*IDF as a better weighting technique

  • On top of that we apply similarity measures to calculate the distance between document i and query j.

    • There are a number of similarity measures; the most common similarity measure are
      • Euclidean distance , Inner or Dot product, Cosine similarity, Dice similarity, Jaccard similarity, etc.

03/01/20

Gashaw D.

22 of 56

Similarity Measures

  • A similarity measure is a function which computes the degree of similarity between a pair of vectors or documents

– since queries and documents are both vectors, a similarity measure can represent the similarity between two documents, two queries, or one document and one query

  • There are a large number of similarity measures proposed in the literature, because the best similarity measure doesn't exist (yet!)

03/01/20

Gashaw D.

23 of 56

Similarity Measure

  • We now have vectors for all documents in the collection, a vector for the query, how to compute similarity?

  • A similarity measure is a function that computes the degree of similarity or distance between document vector and query vector.

  • Using a similarity measure between the query and each document:
    • It is possible to rank the retrieved documents in the order of presumed relevance.
    • It is possible to enforce a certain threshold so that the size of the retrieved set can be controlled.

23

θ2

t3

t1

t2

D1

D2

Q

θ1

03/01/20

Gashaw D.

24 of 56

Intuition

Postulate: Documents that are “close together”

in the vector space talk about the same things and more similar than others.

t1

d2

d1

d3

d4

d5

t3

t2

θ

φ

03/01/20

Gashaw D.

25 of 56

Term Similarity: Basic Concept

Two documents are similar if they contain some of the same terms.

Possible measures of similarity might take into consideration:

(a) The lengths of the documents

(b) The number of terms in common

(c) Whether the terms are common or unusual

(d) How many times each term appears

03/01/20

Gashaw D.

26 of 56

Similarity Measure

Desiderata for proximity

  1. If d1 is near d2, then d2 is near d1.
  2. If d1 near d2, and d2 near d3, then d1 is not far from d3.
  3. No document is closer to d than d itself.
    • Sometimes it is a good idea to determine the maximum possible similarity as the “distance” between a document d and itself

  • A similarity measure attempts to compute the distance between document vector wj and query wq vector.
    • The assumption here is that documents whose vectors are close to the query vector are more relevant to the query than documents whose vectors are away from the query vector.

26

03/01/20

Gashaw D.

27 of 56

Similarity Measure: Techniques

    • Euclidean distance
    • It is the most common similarity measure.
    • Euclidean distance examines the root of square differences between coordinates of a pair of document and query terms.
  • Dot product
    • The dot product is also known as the scalar product or inner product
    • the dot product is defined as the product of the magnitudes of query and document vectors
  • Cosine similarity (or normalized inner product)
    • It projects document and query vectors into a term space and calculate the cosine angle between these.

27

03/01/20

Gashaw D.

28 of 56

Euclidean Distance:

Euclidean distance is considered the traditional metric for problems with geometry.

It can be simply explained as the ordinary distance between two points. It is one of the most used algorithms in the cluster analysis. One of the algorithms that use this formula would be K-mean. Mathematically it computes the root of squared differences between the coordinates between two objects.

03/01/20

Gashaw D.

29 of 56

Euclidean distance

  • Similarity between vectors for the document di and query q can be computed as:

sim(dj,q) = |dj – q| =

where wij is the weight of term i in document j and wiq is the weight of term i in the query

    • Example: Determine the Euclidean distance between the document 1 vector (0, 3, 2, 1, 10) and query vector (2, 7, 1, 0, 0). 0 means corresponding term not found in document or query

29

03/01/20

Gashaw D.

30 of 56

Inner Product

  • Similarity between vectors for the document di and query q can be computed as the vector inner product:

sim(dj,q) = djq = wij · wiq

where wij is the weight of term i in document j and wiq is the weight of term i in the query q

  • For binary vectors, the inner product is the number of matched query terms in the document (size of intersection).
  • For weighted term vectors, it is the sum of the products of the weights of the matched terms.�

30

03/01/20

Gashaw D.

31 of 56

Properties of Inner Product

  • Favors long documents with a large number of unique terms.
    • Again, the issue of normalization
  • Measures how many terms matched but not how many terms are not matched.

31

03/01/20

Gashaw D.

32 of 56

Inner Product -- Examples

  • Binary weight :
    • Size of vector = size of vocabulary = 7

sim(D, Q) = 3

  • Term Weighted:

sim(D1 , Q) = 2*1 + 3*0 + 5*2 = 12

sim(D2 , Q) = 3*1 + 7*0 + 1*2 = 5

Retrieval

Database

Term

Computer

Text

Manage

Data

D

1

1

1

0

1

1

0

Q

1

0

1

0

0

1

1

Retrieval

Database

Architecture

D1

2

3

5

D2

3

7

1

Q

1

0

2

03/01/20

Gashaw D.

33 of 56

Inner Product: Example 1

d1

d2

d3

d4

d5

d6

d7

k1

k2

k3

33

03/01/20

Gashaw D.

34 of 56

d1

d2

d3

d4

d5

d6

d7

k1

k2

k3

Inner Product: Exercise

34

03/01/20

Gashaw D.

35 of 56

Cosine similarity

  • Measures similarity between d1 and d2 captured by the cosine of the angle x between them.

  • Or;

  • The denominator involves the lengths of the vectors
  • So the cosine measure is also known as the normalized inner product

03/01/20

Gashaw D.

36 of 56

Example: Computing Cosine Similarity

  • Let say we have query vector Q = (0.4, 0.8); and also document D1 = (0.2, 0.7). Compute their similarity using cosine?

03/01/20

Gashaw D.

37 of 56

Cosine similarity

Cos(x, y) = x . y / ||x|| * ||y||

where,

  • x . y = product (dot) of the vectors ‘x’ and ‘y’.
  • ||x|| and ||y|| = length of the two vectors ‘x’ and ‘y’.
  • ||x|| * ||y|| = cross product of the two vectors ‘x’ and ‘y’.

03/01/20

Gashaw D.

Example :�Consider an example to find the similarity between two vectors – ‘x’ and ‘y’, using Cosine Similarity.

The ‘x’ vector has values, x = { 3, 2, 0, 5 }�The ‘y’ vector has values, y = { 1, 0, 0, 0 }

38 of 56

The formula for calculating the cosine similarity is : 

Cos(x, y) = x . y / ||x|| * ||y||

x . y = 3*1 + 2*0 + 0*0 + 5*0 = 3||x|| =

√ (3)^2 + (2)^2 + (0)^2 + (5)^2 = 6.16||y|| =

√ (1)^2 + (0)^2 + (0)^2 + (0)^2 = 1

Cos(x, y) = 3 / (6.16 * 1) = 0.49

The dissimilarity between the two vectors ‘x’ and ‘y’ is given by –

Dis(x, y) = 1 - Cos(x, y) = 1 - 0.49 = 0.51

  • The cosine similarity between two vectors is measured in ‘θ’.
  • If θ = 0°, the ‘x’ and ‘y’ vectors overlap, thus proving they are similar.
  • If θ = 90°, the ‘x’ and ‘y’ vectors are dissimilar.

 

03/01/20

Gashaw D.

39 of 56

39

Example: Computing Cosine Similarity

1.0

0.8

0.6

0.8

0.4

0.6

0.4

1.0

0.2

0.2

  • Let say we have two documents in our corpus; D1 = (0.8, 0.3) and D2 = (0.2, 0.7). Given query vector Q = (0.4, 0.8), determine which document is the most relevant one for the query?

03/01/20

Gashaw D.

40 of 56

03/01/20

Gashaw D.

Calculate Ranking

Similarity of query to documents in example:

41 of 56

03/01/20

Gashaw D.

42 of 56

Example

  • Given three documents; D1, D2 and D3 with the corresponding TFIDF weight, Which documents are more similar using the three measurement?

42

Terms

D1

D2

D3

affection

0.996

0.993

0.847

Jealous

0.087

0.120

0.466

gossip

0.017

0.000

0.254

03/01/20

Gashaw D.

43 of 56

Cosine Similarity vs. Inner Product

  • Cosine similarity measures the cosine of the angle between two vectors.
  • Inner product normalized by the vector lengths.

D1 = 2T1 + 3T2 + 5T3 CosSim(D1 , Q) = 10 / (4+9+25)(0+0+4) = 0.81

D2 = 3T1 + 7T2 + 1T3 CosSim(D2 , Q) = 2 / (9+49+1)(0+0+4) = 0.13

Q = 0T1 + 0T2 + 2T3

D1 is 6 times better than D2 using cosine similarity but only 5 times better using inner product.

CosSim(dj, q) =

InnerProduct(dj, q) =

43

03/01/20

Gashaw D.

44 of 56

Exercises

  • A database collection consists of 1 million documents, of which 200,000 contain the term holiday while 250,000 contain the term season. A document repeats holiday 7 times and season 5 times. It is known that holiday is repeated more than any other term in the document. Calculate the weight of both terms in this document using three different term weight methods. Try with �(i) normalized and unnormalized TF;

(ii) TF*IDF based on normalized and unnormalized TF

44

03/01/20

Gashaw D.

45 of 56

03/01/20

Gashaw D.

46 of 56

03/01/20

Gashaw D.

47 of 56

03/01/20

Gashaw D.

48 of 56

03/01/20

Gashaw D.

49 of 56

03/01/20

Gashaw D.

Measuring Distances in Vector Spaces

2.1. What Is Distance?

Both cosine similarity and Euclidean distance are methods for measuring the proximity between vectors in a vector space. It’s important that we, therefore, define what do we mean by the distance between two vectors, because as we’ll soon see this isn’t exactly obvious.

Let’s start by studying the case described in this image:

50 of 56

03/01/20

Gashaw D.

Term Operations:

There are various operations that are performed on terms in an IR System.

They are:

 Stemming

 Weighting

 Thesaurus

 Stoplist

 Truncation

  1. Stemming: This operation involve interconnecting of relavant word in an automated way. The interconnected of words is typically reduces the words that resembles like a common root.
  2. b) Weighting: This operation allocates numbering to the indexing or query terms taking into consideration the information regarding the statistical distribution of terms.
  3. c) Thesaurus: This operation combines the words that are equal (or) similar meanings are related to each other.
  4. d) Stop List: This operation deals with the words that may not have indexing value. It simply removes the potential indexing terms by finding their presence in the stoplist.
  5. e) Truncation: This operation manually combines the terms with help of wildcard characters in the word where the truncated term is used for matching multiple words

51 of 56

  1. What is the idf of a term that occurs in every document?
  2. If the term ti appears in every document of the corpus, idfi is equal to 0. The fewer documents the term ti appears in, the higher the idfi value. The measure called term frequency-inverse document frequency (tf-idf) is defined as tfij*idfi (Salton and McGill, 1986).

2. Can the TF-IDF weight of a term in a document exceed 1?

You may notice that the product of TF and IDF can be above 1. Now, the last step is to normalize these values so that TF-IDF values always scale between 0 and 1.

3. Exercise 6.8 Why is the idf of a term always finite? If we assume that DF(T) ≠ 0 (i.e. that the respective query term T appears at least once, we can give an upper bound for the IDF (for DF(T) = 1) as well as a lower bound (for DF(T) = N, see belo

idf ( Inverse Document Frequency): It is used to determine how important the word is across all the documents. That………

How do you calculate IDF?

IDF(t) = log_e(Total number of documents / Number of documents with term t in it).

03/01/20

Gashaw D.

52 of 56

Exercise 6.8 Why is the idf of a term always finite? If we assume that DF(T) ≠ 0 (i.e. that the respective query term T appears at least once, we can give an upper bound for the IDF (for DF(T) = 1) as well as a lower bound (for DF(T) = N, see below). Because we are dealing with a fixed, finite document set, there also is a finite set of possible values for the IDF which are all between these two extreme values.

Exercise 6.9 What is the idf of a term that occurs in every document? Compare this with the use of stop word lists. It is 0. For a word that occurs in every document, putting it on the stop list has the same effect as idf weighting: the word is ignored.

Exercise 6.10 Consider the table of term frequencies for 3 documents denoted Doc1, Doc2, Doc3 in Figure 6.9. Compute the tf–idf weights for the terms car , auto , insurance , and best , for each document, using the idf values from Figure 6.8 (806,791)

Exercise 6.10 Consider the table of term frequencies for 3 documents denoted Doc1, Doc2, Doc3 in Figure 6.9. Compute the tf–idf weights for the terms car , auto , insurance , and best , for each document, using the idf values from Figure 6.8 (806,791)

03/01/20

Gashaw D.

53 of 56

03/01/20

Gashaw D.

IDF or inverse document frequency is a measure of how much information the word provides, that is,

whether the term is common or rare across all documents(src: wikipedia)�To calculate idf use this:�

                                 

In case the term, t isn't present in corpus, to avoid divide by zero the denominator is adjusted to�

                                               

�So, for a term that occurs in every document:�idf = log( N / ( N + 1 ) ).�for N >> 0, idf -> 0 ( for a corpus of size 1M, idf of a term present in every document is 5.99 * 10 ^ -6)

54 of 56

IDF is used to find the relevance of a given term across all the documents and not just a single document.

Because TF increases proportionally to the number of times the word appears in a document, it needs to be offset by the frequency of the word in the corpus - this is done by IDF.

When the number of documents is too high, some of the terms which appear rarely will be ignored if IDF is not used. Since the numerator in the IDF is the total number of documents n, we obtain an inverse of the frequency which otherwise would be too small to be calculated with.

In one sentence, IDF balances the values of the terms occurring too frequently which may add to incorrect statistic otherwise.

TF-IDF in itself helps us find the similarity between the documents and subsequently classify them.

03/01/20

Gashaw D.

55 of 56

03/01/20

Gashaw D.

56 of 56

Index Terms Selection:�o Tokenize: identify words in a document, so that each document is represented by a list of�keywords or attributes�o Stop words: removal of high frequency words Stop list of words is used for comparing the input text�o Stemming and Normalization: reduce words with similar meaning into their stem/root word. Suffix stripping is the common method�o Weighting terms: Different index terms have varying importance when used to describe document contents. This effect is captured through the assignment of numerical weights to each index term of a document. There are different index terms weighting methods (TF, DF, CF) based on which TF*IDF weight can be calculated during searching.�

03/01/20

Gashaw D.