Chapter-2……….
Term weighting and similarity measures�
1
03/01/20
Gashaw D.
2
Terms
W=0 if a term is absent
Documents are represented by binary weights or Non-binary weighted vectors of terms.
03/01/20
Gashaw D.
3
Document Collection
T1 T2 …. Tt
D1 w11 w21 … wt1
D2 w12 w22 … wt2
: : : :
: : : :
Dn w1n w2n … wtn
03/01/20
Gashaw D.
TERM VECTOR SPACE
Term vector space
n-dimensional space, where n is the number of different terms/tokens used to index a set of documents.
Vector
Document i, di , represented by a vector.
Its magnitude in dimension j is wij,
where: wij > 0 if term j occurs in document i
wij = 0 otherwise
wij is the weight of term j in document i.
03/01/20
Gashaw D.
Binary Weights
03/01/20
Gashaw D.
03/01/20
Gashaw D.
7
Why use term weighting?
03/01/20
Gashaw D.
Term Weighting: Term Frequency (TF)
fij = frequency of term i in document j
tfij = fij / max{fij}
03/01/20
Gashaw D.
03/01/20
Gashaw D.
10
Document Normalization
03/01/20
Gashaw D.
Problems with term frequency
11
The example shows that collection frequency and document frequency behaves differently
03/01/20
Gashaw D.
12
Document Frequency
DF = document frequency
df i = document frequency of term i
= number of documents containing term i
03/01/20
Gashaw D.
Inverse Document Frequency (IDF)
idfi = inverse document frequency of term i,
= log (N/ df i) (N: total number of documents)
13
03/01/20
Gashaw D.
14
Inverse Document Frequency
Word | N | DF | IDF |
the | 1000 | 1000 | 0 |
some | 1000 | 100 | 3.322 |
car | 1000 | 10 | 6.644 |
merge | 1000 | 1 | 9.966 |
03/01/20
Gashaw D.
TF*IDF Weighting
wij = tfij idfi = tfij * log2 (N/ dfi)
15
03/01/20
Gashaw D.
TF*IDF weighting
03/01/20
Gashaw D.
Computing TF-IDF: An Example
A: tf = 3/3=1.00; idf = log2(10000/50) = 7.644; tf*idf = 7.644
B: tf = 2/3=0.67; idf = log2(10000/1300) = 2.943; tf*idf = 1.962
C: tf = 1/3=0.33; idf = log2(10000/250) = 5.322; tf*idf = 1.774
17
03/01/20
Gashaw D.
More Example
3/100 = 0.03
log2(10,000,000 / 1,000) = 13.228
18
03/01/20
Gashaw D.
Exercise
19
Word | C | TW | TD | DF | TF | IDF | TFIDF |
airplane | 5 | 46 | 3 | 1 | 0.109 | 1.585 | 0.172 |
blue | 1 | 46 | 3 | 1 | | | |
chair | 7 | 46 | 3 | 3 | | | |
computer | 3 | 46 | 3 | 1 | | | |
forest | 2 | 46 | 3 | 1 | | | |
justice | 7 | 46 | 3 | 3 | | | |
love | 2 | 46 | 3 | 1 | | | |
might | 2 | 46 | 3 | 1 | | | |
perl | 5 | 46 | 3 | 2 | | | |
rose | 6 | 46 | 3 | 3 | | | |
shoe | 4 | 46 | 3 | 1 | | | |
thesis | 2 | 46 | 3 | 2 | | | |
03/01/20
Gashaw D.
Concluding remarks
20
03/01/20
Gashaw D.
Concluding remarks
This leads to use TF*IDF as a better weighting technique
03/01/20
Gashaw D.
Similarity Measures
– since queries and documents are both vectors, a similarity measure can represent the similarity between two documents, two queries, or one document and one query
03/01/20
Gashaw D.
Similarity Measure
23
θ2
t3
t1
t2
D1
D2
Q
θ1
03/01/20
Gashaw D.
Intuition
Postulate: Documents that are “close together”
in the vector space talk about the same things and more similar than others.
t1
d2
d1
d3
d4
d5
t3
t2
θ
φ
03/01/20
Gashaw D.
Term Similarity: Basic Concept
Two documents are similar if they contain some of the same terms.
Possible measures of similarity might take into consideration:
(a) The lengths of the documents
(b) The number of terms in common
(c) Whether the terms are common or unusual
(d) How many times each term appears
03/01/20
Gashaw D.
Similarity Measure
Desiderata for proximity
26
03/01/20
Gashaw D.
Similarity Measure: Techniques
27
03/01/20
Gashaw D.
Euclidean Distance:
Euclidean distance is considered the traditional metric for problems with geometry.
It can be simply explained as the ordinary distance between two points. It is one of the most used algorithms in the cluster analysis. One of the algorithms that use this formula would be K-mean. Mathematically it computes the root of squared differences between the coordinates between two objects.
03/01/20
Gashaw D.
Euclidean distance
sim(dj,q) = |dj – q| =
where wij is the weight of term i in document j and wiq is the weight of term i in the query
�
29
03/01/20
Gashaw D.
Inner Product
sim(dj,q) = dj•q = wij · wiq
where wij is the weight of term i in document j and wiq is the weight of term i in the query q
30
03/01/20
Gashaw D.
Properties of Inner Product
31
03/01/20
Gashaw D.
Inner Product -- Examples
sim(D, Q) = 3
sim(D1 , Q) = 2*1 + 3*0 + 5*2 = 12
sim(D2 , Q) = 3*1 + 7*0 + 1*2 = 5
| Retrieval | Database | Term | Computer | Text | Manage | Data |
D | 1 | 1 | 1 | 0 | 1 | 1 | 0 |
Q | 1 | 0 | 1 | 0 | 0 | 1 | 1 |
| Retrieval | Database | Architecture |
D1 | 2 | 3 | 5 |
D2 | 3 | 7 | 1 |
Q | 1 | 0 | 2 |
03/01/20
Gashaw D.
Inner Product: Example 1
d1
d2
d3
d4
d5
d6
d7
k1
k2
k3
33
03/01/20
Gashaw D.
d1
d2
d3
d4
d5
d6
d7
k1
k2
k3
Inner Product: Exercise
34
03/01/20
Gashaw D.
Cosine similarity
03/01/20
Gashaw D.
Example: Computing Cosine Similarity
03/01/20
Gashaw D.
Cosine similarity
Cos(x, y) = x . y / ||x|| * ||y||
where,
03/01/20
Gashaw D.
Example :�Consider an example to find the similarity between two vectors – ‘x’ and ‘y’, using Cosine Similarity.
The ‘x’ vector has values, x = { 3, 2, 0, 5 }�The ‘y’ vector has values, y = { 1, 0, 0, 0 }
The formula for calculating the cosine similarity is :
Cos(x, y) = x . y / ||x|| * ||y||
x . y = 3*1 + 2*0 + 0*0 + 5*0 = 3||x|| =
√ (3)^2 + (2)^2 + (0)^2 + (5)^2 = 6.16||y|| =
√ (1)^2 + (0)^2 + (0)^2 + (0)^2 = 1
∴ Cos(x, y) = 3 / (6.16 * 1) = 0.49
The dissimilarity between the two vectors ‘x’ and ‘y’ is given by –
∴ Dis(x, y) = 1 - Cos(x, y) = 1 - 0.49 = 0.51
03/01/20
Gashaw D.
39
Example: Computing Cosine Similarity
1.0
0.8
0.6
0.8
0.4
0.6
0.4
1.0
0.2
0.2
03/01/20
Gashaw D.
03/01/20
Gashaw D.
Calculate Ranking
Similarity of query to documents in example:
03/01/20
Gashaw D.
Example
42
Terms | D1 | D2 | D3 |
affection | 0.996 | 0.993 | 0.847 |
Jealous | 0.087 | 0.120 | 0.466 |
gossip | 0.017 | 0.000 | 0.254 |
03/01/20
Gashaw D.
Cosine Similarity vs. Inner Product
D1 = 2T1 + 3T2 + 5T3 CosSim(D1 , Q) = 10 / √(4+9+25)(0+0+4) = 0.81
D2 = 3T1 + 7T2 + 1T3 CosSim(D2 , Q) = 2 / √(9+49+1)(0+0+4) = 0.13
Q = 0T1 + 0T2 + 2T3
D1 is 6 times better than D2 using cosine similarity but only 5 times better using inner product.
CosSim(dj, q) =
InnerProduct(dj, q) =
43
03/01/20
Gashaw D.
Exercises
(ii) TF*IDF based on normalized and unnormalized TF
44
03/01/20
Gashaw D.
03/01/20
Gashaw D.
03/01/20
Gashaw D.
03/01/20
Gashaw D.
03/01/20
Gashaw D.
03/01/20
Gashaw D.
Measuring Distances in Vector Spaces
2.1. What Is Distance?
Both cosine similarity and Euclidean distance are methods for measuring the proximity between vectors in a vector space. It’s important that we, therefore, define what do we mean by the distance between two vectors, because as we’ll soon see this isn’t exactly obvious.
Let’s start by studying the case described in this image:
03/01/20
Gashaw D.
Term Operations:
There are various operations that are performed on terms in an IR System.
They are:
Stemming
Weighting
Thesaurus
Stoplist
Truncation
2. Can the TF-IDF weight of a term in a document exceed 1?
You may notice that the product of TF and IDF can be above 1. Now, the last step is to normalize these values so that TF-IDF values always scale between 0 and 1.
3. Exercise 6.8 Why is the idf of a term always finite? If we assume that DF(T) ≠ 0 (i.e. that the respective query term T appears at least once, we can give an upper bound for the IDF (for DF(T) = 1) as well as a lower bound (for DF(T) = N, see belo
idf ( Inverse Document Frequency): It is used to determine how important the word is across all the documents. That………
How do you calculate IDF?
IDF(t) = log_e(Total number of documents / Number of documents with term t in it).
03/01/20
Gashaw D.
Exercise 6.8 Why is the idf of a term always finite? If we assume that DF(T) ≠ 0 (i.e. that the respective query term T appears at least once, we can give an upper bound for the IDF (for DF(T) = 1) as well as a lower bound (for DF(T) = N, see below). Because we are dealing with a fixed, finite document set, there also is a finite set of possible values for the IDF which are all between these two extreme values.
Exercise 6.9 What is the idf of a term that occurs in every document? Compare this with the use of stop word lists. It is 0. For a word that occurs in every document, putting it on the stop list has the same effect as idf weighting: the word is ignored.
Exercise 6.10 Consider the table of term frequencies for 3 documents denoted Doc1, Doc2, Doc3 in Figure 6.9. Compute the tf–idf weights for the terms car , auto , insurance , and best , for each document, using the idf values from Figure 6.8 (806,791)
Exercise 6.10 Consider the table of term frequencies for 3 documents denoted Doc1, Doc2, Doc3 in Figure 6.9. Compute the tf–idf weights for the terms car , auto , insurance , and best , for each document, using the idf values from Figure 6.8 (806,791)
03/01/20
Gashaw D.
03/01/20
Gashaw D.
IDF or inverse document frequency is a measure of how much information the word provides, that is,
whether the term is common or rare across all documents(src: wikipedia)�To calculate idf use this:�
In case the term, t isn't present in corpus, to avoid divide by zero the denominator is adjusted to�
�So, for a term that occurs in every document:�idf = log( N / ( N + 1 ) ).�for N >> 0, idf -> 0 ( for a corpus of size 1M, idf of a term present in every document is 5.99 * 10 ^ -6)
IDF is used to find the relevance of a given term across all the documents and not just a single document.
Because TF increases proportionally to the number of times the word appears in a document, it needs to be offset by the frequency of the word in the corpus - this is done by IDF.
When the number of documents is too high, some of the terms which appear rarely will be ignored if IDF is not used. Since the numerator in the IDF is the total number of documents n, we obtain an inverse of the frequency which otherwise would be too small to be calculated with.
In one sentence, IDF balances the values of the terms occurring too frequently which may add to incorrect statistic otherwise.
TF-IDF in itself helps us find the similarity between the documents and subsequently classify them.
03/01/20
Gashaw D.
03/01/20
Gashaw D.
Index Terms Selection:�o Tokenize: identify words in a document, so that each document is represented by a list of�keywords or attributes�o Stop words: removal of high frequency words Stop list of words is used for comparing the input text�o Stemming and Normalization: reduce words with similar meaning into their stem/root word. Suffix stripping is the common method�o Weighting terms: Different index terms have varying importance when used to describe document contents. This effect is captured through the assignment of numerical weights to each index term of a document. There are different index terms weighting methods (TF, DF, CF) based on which TF*IDF weight can be calculated during searching.�
03/01/20
Gashaw D.