Measures of Similarity and Dissimilarity
Unit - II
Datamining
Measures of Similarity and Dissimilarity
Measures of Similarity and Dissimilarity
Measures of Similarity and Dissimilarity
Transformations
Measures of Similarity and Dissimilarity
Measures of Similarity and Dissimilarity
Dissimilarities between Data Objects
Euclidean Distance
Measures of Similarity and Dissimilarity
Dissimilarities between Data Objects
If d(x, y) is the distance between two points, x and y, then the following properties hold.
1. Positivity
(a) d(x, x) ≥ 0 for all x and y,
(b) d(x, y) = 0 only if x = y.
2. Symmetry
d(x, y) = d(y, x) for all x and y.
3. Triangle Inequality
d(x, z) ≤ d(x, y) + d(y, z) for all points x, y, and z.
Note:-Measures that satisfy all three properties are known as metrics.
Measures of Similarity and Dissimilarity
Dissimilarities between Data Objects
Measures of Similarity and Dissimilarity
Dissimilarities between Data Objects
Measures of Similarity and Dissimilarity
Dissimilarities between Data Objects
Measures of Similarity and Dissimilarity
Dissimilarities between Data Objects
Measures of Similarity and Dissimilarity
Dissimilarities between Data Objects
Non-metric Dissimilarities: Set Differences
A = {1, 2, 3, 4} and B = {2, 3, 4},
then A − B = {1} and
B − A = ∅, the empty set.
If d(A, B) = size(A − B), then it does not satisfy the second part of the positivity property, the symmetry property, or the triangle inequality.
d(A, B) = size(A − B) + size(B − A) (modified which follows all properties)
Measures of Similarity and Dissimilarity
Dissimilarities between Data Objects
Non-metric Dissimilarities: Time
Dissimilarity measure that is not a metric,but still useful.
d(1PM, 2PM) = 1 hour
d(2PM, 1PM) = 23 hours
Distance in python
Measures of Similarity and Dissimilarity
Similarities between Data Objects
Measures of Similarity and Dissimilarity
Examples of proximity measures
f00 = the number of attributes where x is 0 and y is 0
f01 = the number of attributes where x is 0 and y is 1
f10 = the number of attributes where x is 1 and y is 0
f11 = the number of attributes where x is 1 and y is 1
Measures of Similarity and Dissimilarity
Examples of proximity measures
Simple Matching Coefficient(SMC)
Jaccard Coefficient
Measures of Similarity and Dissimilarity
Examples of proximity measures
Measures of Similarity and Dissimilarity
Examples of proximity measures
Cosine similarity (Document similarity)
If x and y are two document vectors, then
Measures of Similarity and Dissimilarity
Examples of proximity measures
cosine similarity (Document similarity)
Measures of Similarity and Dissimilarity
Examples of proximity measures
cosine similarity (Document similarity)
# import required libraries
import numpy as np
from numpy.linalg import norm
# define two lists or array
A = np.array([2,1,2,3,2,9])
B = np.array([3,4,2,4,5,5])
print("A:", A)
print("B:", B)
# compute cosine similarity
cosine = np.dot(A,B)/(norm(A)*norm(B))
print("Cosine Similarity:", cosine)
Measures of Similarity and Dissimilarity
Examples of proximity measures
cosine similarity (Document similarity)
Measures of Similarity and Dissimilarity
Examples of proximity measures
cosine similarity (Document similarity)
Note:-
Dividing x and y by their lengths normalizes them to have a length of 1 ( means magnitude is not considered)
Measures of Similarity and Dissimilarity
Examples of proximity measures
Extended Jaccard Coefficient (Tanimoto Coefficient)
Measures of Similarity and Dissimilarity
Examples of proximity measures
Pearson’s correlation
Measures of Similarity and Dissimilarity
Examples of proximity measures
Pearson’s correlation
Measures of Similarity and Dissimilarity
Examples of proximity measures
Pearson’s correlation
Measures of Similarity and Dissimilarity
Examples of proximity measures
Pearson’s correlation ( scipy.stats.pearsonr() - automatic)
Measures of Similarity and Dissimilarity
Examples of proximity measures
Pearson’s correlation (manual in python)