1 of 29

Measures of Similarity and Dissimilarity

Unit - II

Datamining

2 of 29

Measures of Similarity and Dissimilarity

  • Similarity and dissimilarity are important because they are used by a number of data mining techniques
    • such as
      • clustering,
      • nearest neighbor classification, and
      • anomaly detection.
  • Proximity is used to refer to either similarity or dissimilarity.
    • proximity between objects having only one simple attribute, and
    • proximity measures for objects with multiple attributes.

3 of 29

Measures of Similarity and Dissimilarity

  • Similarity between two objects is a numerical measure of the degree to which the two objects are alike.
    • Similarity - high -objects that are more alike.
    • Non-negative
    • between 0 (no similarity) and 1 (complete similarity).
  • Dissimilarity between two objects is a numerical measure of the degree to which the two objects are different.
    • Dissimilarity - low - objects are more similar.
    • Distance - synonym for dissimilarity

4 of 29

Measures of Similarity and Dissimilarity

Transformations

  • Transformations are often applied to
    • convert a similarity to a dissimilarity,
    • convert a dissimilarity to a similarity
    • to transform a proximity measure to fall within a particular range, such as [0,1].
  • Example
    • Similarities between objects range from 1 (not at all similar) to 10 (completely similar)
    • we can make them fall within the range [0, 1] by using the transformation
      • s’ = (s−1)/9
      • s - Original Similarity
      • s’ - New similarity values

5 of 29

Measures of Similarity and Dissimilarity

6 of 29

Measures of Similarity and Dissimilarity

Dissimilarities between Data Objects

Euclidean Distance

7 of 29

Measures of Similarity and Dissimilarity

Dissimilarities between Data Objects

If d(x, y) is the distance between two points, x and y, then the following properties hold.

1. Positivity

(a) d(x, x) ≥ 0 for all x and y,

(b) d(x, y) = 0 only if x = y.

2. Symmetry

d(x, y) = d(y, x) for all x and y.

3. Triangle Inequality

d(x, z) ≤ d(x, y) + d(y, z) for all points x, y, and z.

Note:-Measures that satisfy all three properties are known as metrics.

8 of 29

Measures of Similarity and Dissimilarity

Dissimilarities between Data Objects

9 of 29

Measures of Similarity and Dissimilarity

Dissimilarities between Data Objects

10 of 29

Measures of Similarity and Dissimilarity

Dissimilarities between Data Objects

11 of 29

Measures of Similarity and Dissimilarity

Dissimilarities between Data Objects

12 of 29

Measures of Similarity and Dissimilarity

Dissimilarities between Data Objects

Non-metric Dissimilarities: Set Differences

A = {1, 2, 3, 4} and B = {2, 3, 4},

then A − B = {1} and

B − A = ∅, the empty set.

If d(A, B) = size(A − B), then it does not satisfy the second part of the positivity property, the symmetry property, or the triangle inequality.

d(A, B) = size(A − B) + size(B − A) (modified which follows all properties)

13 of 29

Measures of Similarity and Dissimilarity

Dissimilarities between Data Objects

Non-metric Dissimilarities: Time

Dissimilarity measure that is not a metric,but still useful.

d(1PM, 2PM) = 1 hour

d(2PM, 1PM) = 23 hours

  • Example:- when answering the question: “If an event occurs at 1PM every day, and it is now 2PM, how long do I have to wait for that event to occur again?”

14 of 29

Distance in python

15 of 29

Measures of Similarity and Dissimilarity

Similarities between Data Objects

  • Typical properties of similarities are the following:
    • 1. s(x, y) = 1 only if x = y. (0 ≤ s ≤ 1)
    • 2. s(x, y) = s(y, x) for all x and y. (Symmetry)
  • A Non-symmetric Similarity Measure
    • Classify a small set of characters which is flashed on a screen.
    • Confusion matrix - records how often each character is classified as itself, and how often each is classified as another character.
    • “0” appeared 200 times but classified as
      • “0” 160 times,
      • “o” 40 times.
    • ‘o’ appeared 200 times and was classified as
      • “o” 170 times
      • “0” only 30 times.
  • similarity measure can be made symmetric by setting
    • S`(x, y) = S`(y, x) = (s(x, y)+s(y, x))/2,
      • S` - new similarity measure.

16 of 29

Measures of Similarity and Dissimilarity

Examples of proximity measures

  • Similarity Measures for Binary Data
    • Similarity measures between objects that contain only binary attributes are called similarity coefficients

    • Let x and y be two objects that consist of n binary attributes.

    • The comparison of two objects (or two binary vectors), leads to the following four quantities (frequencies):

f00 = the number of attributes where x is 0 and y is 0

f01 = the number of attributes where x is 0 and y is 1

f10 = the number of attributes where x is 1 and y is 0

f11 = the number of attributes where x is 1 and y is 1

17 of 29

Measures of Similarity and Dissimilarity

Examples of proximity measures

  • Similarity Measures for Binary Data

Simple Matching Coefficient(SMC)

Jaccard Coefficient

18 of 29

Measures of Similarity and Dissimilarity

Examples of proximity measures

  • Similarity Measures for Binary Data

19 of 29

Measures of Similarity and Dissimilarity

Examples of proximity measures

Cosine similarity (Document similarity)

If x and y are two document vectors, then

20 of 29

Measures of Similarity and Dissimilarity

Examples of proximity measures

cosine similarity (Document similarity)

21 of 29

Measures of Similarity and Dissimilarity

Examples of proximity measures

cosine similarity (Document similarity)

# import required libraries

import numpy as np

from numpy.linalg import norm

# define two lists or array

A = np.array([2,1,2,3,2,9])

B = np.array([3,4,2,4,5,5])

print("A:", A)

print("B:", B)

# compute cosine similarity

cosine = np.dot(A,B)/(norm(A)*norm(B))

print("Cosine Similarity:", cosine)

22 of 29

Measures of Similarity and Dissimilarity

Examples of proximity measures

cosine similarity (Document similarity)

  • Cosine similarity - measure of angle between x and y.
  • Cosine similarity = 1 (angle is 0, and x & y are same (except magnitude or length))
  • Cosine similarity = 0 (angle is 90, and x & y do not share any terms (words))

23 of 29

Measures of Similarity and Dissimilarity

Examples of proximity measures

cosine similarity (Document similarity)

Note:-

Dividing x and y by their lengths normalizes them to have a length of 1 ( means magnitude is not considered)

24 of 29

Measures of Similarity and Dissimilarity

Examples of proximity measures

Extended Jaccard Coefficient (Tanimoto Coefficient)

25 of 29

Measures of Similarity and Dissimilarity

Examples of proximity measures

Pearson’s correlation

26 of 29

Measures of Similarity and Dissimilarity

Examples of proximity measures

Pearson’s correlation

  • The more tightly linear two variables X and Y are, the closer Pearson's correlation coefficient(PCC)
    • PCC = -1, if the relationship is negative,
    • PCC=+1, if the relationship is positive.
      • an increase in the value of one variable increases the value of another variable
    • PCC = 0 Perfectly linearly uncorrelated numbers
      • an increase in the value of one decreases the value of another variable.

27 of 29

Measures of Similarity and Dissimilarity

Examples of proximity measures

Pearson’s correlation

28 of 29

Measures of Similarity and Dissimilarity

Examples of proximity measures

Pearson’s correlation ( scipy.stats.pearsonr() - automatic)

29 of 29

Measures of Similarity and Dissimilarity

Examples of proximity measures

Pearson’s correlation (manual in python)