1 of 82

Course Name : MACHINE LEARNING

Course Code : 20AM01

Course Instructor : K.Rajasekhar

Semester : VI

Regulation : R23

Unit: 2

1

2 of 82

UNIT-2: SYLLABUS

Nearest Neighbor-Based Models: Introduction to Proximity Measures, Distance Measures, Non-Metric Similarity Functions, Proximity Between Binary Patterns, Different Classification Algorithms Based on the Distance Measures ,K-Nearest Neighbor Classifier, Radius Distance Nearest Neighbor Algorithm, KNN Regression, Performance of Classifiers, Performance of Regression Algorithms.

3 of 82

Introduction to Proximity Measures (Cont’d..)

  • Proximity measures are mathematical techniques used to find how similar or how different two data points are. They help machine learning algorithms understand the relationship between observations in a dataset.

Similarity Measures

  • These tell us how close or alike two data objects are.�Higher value → more similar.

Examples:

  • Cosine Similarity
  • Jaccard Similarity
  • Correlation Coefficient

Distance (Dissimilarity) Measures

  • These tell us how far apart two objects are.
  • Minimum dissimilarity is ‘0’ �Higher value → more different.

Examples:

  • Euclidean Distance
  • Manhattan Distance
  • Minkowski Distance
  • Hamming Distance

4 of 82

Measures of Feature Redundancy:

  • Feature redundancy, is based on similar information contribution by multiple features.

Three types of measures

  • Correlation-based measures
  • Distance-based measures
  • Other coefficient-based measures

5 of 82

6 of 82

7 of 82

Distance Measures: (cont’d)

  • Distance measures (also called dissimilarity measures) are mathematical tools used to calculate how far apart two data points are.

These measures are mainly used in:

  • Clustering (e.g., K-Means, Hierarchical Clustering)
  • Classification (e.g., K-Nearest Neighbors)
  • Recommendation systems
  • Anomaly detection
  • Euclidean Distance:
  • It Measures the straight-line distance between two points in Euclidean space.
  • Works best for continuous numerical data.
  • Widely used in KNN, K-Means, and geometry-based ML tasks.

8 of 82

9 of 82

Distance Measures (Cont’d)

Euclidean Distance:

Manhattan Distance

  • Calculates distance by summing absolute differences of coordinates.
  • Moves like a grid or city-block path.
  • Useful for high-dimensional or sparse datasets.

10 of 82

Distance Measures (Cont’d)

Minkowski Distance:

  • A generalized distance measure that includes Euclidean and Manhattan.
  • Controlled by a parameter p.
  • Flexible for different types of data distributions.
  • It allows adjusting the power parameter ‘p’ when p=1 it is equivalent to Manhattan distance.
  • when p=2 it is equivalent to euclidean distance.

ul for high-dimensional or sparse datasets.

11 of 82

12 of 82

Distance Measures (Cont’d)

Hamming Distance

  • Counts how many positions differ between two strings or binary vectors.
  • Works only for categorical or binary data.
  • Used in error detection and correction codes.

13 of 82

14 of 82

Distance Measures (Cont’d)

Chebyshev Distance

  • Measures the maximum absolute difference along any dimension.
  • Useful when diagonal movement is allowed (like on a chessboard).
  • Useful for fast, maximum-based comparisons.

15 of 82

Distance Measures (Cont’d)

Cosine Distance

  • Measures dissimilarity based on the angle between two vectors.
  • Ignores magnitude and focuses on direction.
  • Best for text, embeddings, and high-dimensional feature vectors.

16 of 82

17 of 82

18 of 82

19 of 82

Distance Measures (Cont’d)

Jaccard Distance

  • Measures dissimilarity between two sets or binary vectors.
  • Based on the ratio of intersection to union.
  • Useful for recommendation systems and set comparisons.

20 of 82

Non-Metric Similarity Functions

  • Non-metric similarity functions are similarity measures that do NOT follow the properties of a metric, especially the triangle inequality.

These are widely used when data is:

  • Textual
  • Categorical
  • Probabilistic
  • Non-numeric
  • High-dimensional (e.g., embeddings)

Cosine Similarity (Non-Metric)

  • Measures similarity based on the angle between two vectors.
  • Does NOT satisfy triangle inequality → not a metric.
  • Great for text, embeddings, documents.

21 of 82

(c) Cosine Similarity:

  • Cosine similarity actually measures the angle between x and y vectors.
  • If cosine similarity has a value 1, the angle between x and y is 0° which means x and y are same except for the magnitude.
  • If cosine similarity is 0, the angle between x and y is 90°. Hence, they do not share any similarity

Let’s calculate the cosine similarity of x and y, where

x = (2, 4, 0, 0, 2, 1, 3, 0, 0) and y = (2, 1, 0, 0, 3, 2, 1, 0, 1).

22 of 82

Non-Metric Similarity Functions (Cont’d)

Pearson Correlation Similarity

  • Measures how strongly two variables change together.
  • Values range from −1 to +1.
  • Not a metric because triangle inequality fails

Jaccard Similarity

  • Measures overlap between two sets.
  • Good for binary and set-based data.
  • Not a metric because similarity, not distance.

23 of 82

24 of 82

25 of 82

26 of 82

Non-Metric Similarity Functions (Cont’d)

Overlap Coefficient (Szymkiewicz–Simpson)

  • Measures the similarity of two sets based on the smaller set.
  • Useful for subset comparison.
  • Not a metric.

Dice Similarity (Sørensen–Dice)

  • Similar to Jaccard but gives more weight to intersection.
  • Used in text, medical image segmentation.
  • Not a metric

27 of 82

Proximity Between Binary Patterns

  • Binary patterns contain only 0s and 1s.
  • To measure how similar or different two binary vectors are, we use proximity measures (similarity or distance).

Hamming Distance (Binary Distance)

  • Counts how many bit positions differ.
  • Measures dissimilarity between binary patterns.
  • Used in coding theory, clustering, KNN.

28 of 82

Proximity Between Binary Patterns (cont’d)

Simple Matching Coefficient (SMC)

  • Measures similarity by counting total matches (1–1 and 0–0).
  • Treats 0s and 1s equally.
  • Used in clustering of binary attributes.

Jaccard Similarity (for Asymmetric Binary Data)

  • Measures similarity using only 1s (ignores double 0s).
  • Best when 1 = important/rare event.
  • Used in text, web mining, recommendation.

29 of 82

30 of 82

b) Simple matching coefficient (SMC):

(c) Cosine Similarity:

  • Cosine similarity which is one of the most popular measures in text classification is calculated as:

31 of 82

Proximity Between Binary Patterns

Jaccard Distance

  • Measures dissimilarity between two binary vectors.
  • Inverse of Jaccard similarity.
  • Higher distance → more different.

Dice Similarity (Sørensen–Dice)

  • Similar to Jaccard but gives double weight to 1–1 matches.
  • Better for imbalanced binary data.
  • Used in text analysis & image segmentation.

32 of 82

Different Classification Algorithms Based on the Distance Measures

  • Distance-based classifiers make decisions by measuring how close a new data point is to existing labeled data.
  • Nearest Neighbor Classifier (NNC)
  • k-Nearest Neighbor Classifier (kNNC)
  • Weighted k-Nearest Neighbor (WkNN)
  • Radius distance Near Neighbors
  • Tree Based Nearest Neighbors
  • KNN Regression

Nearest Neighbor Classifier (NNC):

  • The Nearest Neighbor Classifier is the simplest distance-based classification method.
  • It classifies a new data point by finding the closest point (nearest neighbor) in the training dataset and assigning the same class.
  • It uses a distance measure such as Euclidean, Manhattan, or Minkowski distance.

33 of 82

Different Classification Algorithms Based on the Distance Measures

  • Class(x)=Class of nearest training sample to x
  • This is also called 1-Nearest Neighbor (1-NN) because it uses only one closest point.

How It Works (Step-by-Step)

  • Take a new input point x.
  • Compute the distance between xxx and every training point:
    • Euclidean, Manhattan etc
  • Find the training point with minimum distance.
  • Assign the same class to the new point.
  • Distance Formula (Euclidean)

34 of 82

Nearest Neighbour Classifier (NNC)

  • Let

, each pattern be a vector in some L dimensional space and li is its class label.

  • Now the nearest neighbor of x (i.e., the test pattern) is given by

where xj is the jth training pattern and d(x, xj) is the distance between

x and xj.

  • Test pattern T(2.1, 0.7) is assigned to class 1,
  • since its Euclidean distance from x3 is minimum (i.e., 1 unit)

35 of 82

Nearest Neighbour Classifier (NNC)

New Point

  • P = (2, 2)

Point

X

Y

Class

A

1

2

Red

B

3

1

Blue

C

6

4

Blue

36 of 82

K-Nearest Neighbor (kNN) Algorithm

37 of 82

38 of 82

39 of 82

40 of 82

41 of 82

42 of 82

43 of 82

�k-Nearest Neighbors (k-NN) Classifier�

  • To Classify a new sample by finding the k nearest training samples from the dataset using a distance measure.
  • Common distances: Euclidean, Manhattan, Minkowski.
  • The class most common among the neighbors becomes the predicted class.
  • New point: P(3,3)
  • Let k=3

Point

X

Y

Class

A

1

2

Red

B

2

1

Red

C

5

4

Blue

D

6

2

Blue

44 of 82

45 of 82

46 of 82

47 of 82

48 of 82

49 of 82

50 of 82

51 of 82

52 of 82

Distance-Weighted k-Nearest Neighbors (DW-kNN)

  • Distance-Weighted k-NN is an improved version of the regular k-NN algorithm.
  • In normal k-NN, each of the k neighbors contributes equally.
  • But in distance-weighted k-NN, closer neighbors get more weight, and farther neighbors get less weight.

Why Use Distance Weighting?

  • Because a neighbor that is very close to the query point should influence the classification more strongly than one far away.

Working of Distance-Weighted k-NN (Step-by-Step)

  • Choose k
  • Compute distance of query point from all training points
  • Select k nearest neighbors
  • Compute weight for each neighbor (1/distance)
  • Add weights class-wise
  • The class with the highest total weight is the prediction

53 of 82

Distance-Weighted k-Nearest Neighbors (DW-kNN)

  • New point: P(3,3)
  • Let k=3

54 of 82

Distance-Weighted k-Nearest Neighbors (DW-kNN)

Point

X

Y

Class

A

1

2

Red

B

2

1

Red

C

5

4

Blue

D

6

2

Blue

55 of 82

Distance-Weighted k-Nearest Neighbors (DW-kNN)

Advantages of Distance-Weighted k-NN

Reduces impact of far-away neighbors�✔Improves accuracy over normal k-NN�✔ Better decision boundaries�✔ Works well with small and moderate datasets

.

56 of 82

Radius Distance Nearest Neighbor (RDNN)

  • The Radius Nearest Neighbor classifier (also called Radius-Based Neighbor Classifier or Neighborhood Components) is a variation of k-NN.
  • Instead of choosing a fixed k neighbors,�RNN selects all neighbors within a user-defined distance radius (r).
  • A point is classified according to the majority class of the neighbors inside this radius.
  • If no point lies within the radius:�→ the classifier may reject the point or choose the nearest neighbor.

57 of 82

Radius Distance Nearest Neighbor (RDNN)

Point

X

Y

Class

A

1

2

Red

B

2

1

Red

C

5

4

Blue

D

6

2

Blue

58 of 82

Radius Distance Nearest Neighbor (RDNN)

  • New point: P(3,3)
  • Let r=2 units

59 of 82

Radius distance Near Neighbours Algorithm

Class assigned to T is Class 3

60 of 82

KNN Regression

  • KNN Regression is the regression version of k-Nearest Neighbors.
  • Instead of predicting a class label, KNN Regression predicts a continuous numerical value by averaging (or weighted averaging) the values of its k nearest neighbors.

How KNN Regression Works (Step-by-Step)

  • Select a value of k
  • Compute distances between query point and all training data
  • Select k closest points
  • Take the average (or weighted average) of the target values
  • That average = predicted output

Query:

  • Predict price for 1100 sq ft house�Let k = 3

61 of 82

KNN Regression

Example of KNN Regression

House Size (sq ft)

Price (lakhs)

800

50

1000

65

1200

70

1500

85

62 of 82

KNN Regression

  • Let
  • The regression model needs to use χ

to find the value of y for a new vector x.

  • Idea:
    • Find k nearest neighbors of x from the n data vectors. Let them be x1, x2, · · · , xk.
    • Consider the y values associated with these xi’s. Let them be y1, y2, · · · , yk.
    • Predicted value of y, i.e.,

63 of 82

Performance of Classifiers:

  • To evaluate how well a classification model works, several metrics and evaluation techniques are used. These metrics are usually derived from the Confusion Matrix.

Confusion Matrix

  • A Confusion Matrix is a table used to evaluate the performance of a classification model by comparing the model’s predictions with the actual (true) values.
  • It helps you understand how many predictions were correct and where the model made mistakes.

A confusion matrix is a table used for evaluating classifier performance.

Predicted Positive

Predicted Negative

Actual Positive

True Positive (TP)

False Negative (FN)

Actual Negative

False Positive (FP)

True Negative (TN)

64 of 82

65 of 82

66 of 82

67 of 82

68 of 82

69 of 82

70 of 82

71 of 82

Performance of Classifiers:

Accuracy

  • Percentage of correctly predicted instances.

Accuracy= (TP + TN)/(TP+TN+FP+FN)

Precision

  • Out of all predicted positives, how many are actually positive.

Precision= (TP)/(TP+FP)

Recall (Sensitivity)

  • Out of all actual positives, how many are correctly identified.

Recall=(TP)/(TP+FN​)

F1-Score

  • Harmonic mean of Precision and Recall.

F1=2× (Precision⋅ Recall)/(Precision + Recall)

72 of 82

2.5 EVALUATING PERFORMANCE OF A MODEL

  • The responsibility of the classification model is to assign class label to the target feature based on the value of the predictor features.
  • To evaluate the performance of the model, the number of correct classifications or predictions made by the model has to be recorded.
  • Eg: In the problem of predicting the win/loss of a cricket match, a classification is said to be correct if, it has been predicted by the model that the team will win and it has actually won.
  • Based on the number of correct and incorrect classifications or predictions made by a model, the accuracy of the model is calculated.
  • For example, 99% accuracy in case of a sports win predictor model may be reasonably good .
  • The same number may not be acceptable as a good threshold when the learning problem deals with predicting a critical illness. In this case, even the 1% incorrect prediction may lead to loss of many lives.
  • So the model performance needs to be evaluated in light of the learning problem in question.

73 of 82

2.5 EVALUATING PERFORMANCE OF A MODEL Cont…

  • There are four possibilities with regards to the cricket match win/loss prediction:
    1. The model predicted win and the team won. In this case, the model has correctly classified data instances as the class of interest. These cases are referred as True Positive (TP) cases.
    2. The model predicted win and the team lost. In this case, the model incorrectly classified data instances as the class of interest. These cases are referred as False Positive (FP) cases.
    3. The model predicted loss and the team won. In this case, the model has incorrectly classified as not the class of interest. These cases are referred as False Negative (FN) cases.
    4. The model predicted loss and the team lost. In this case, the model has correctly classified as not the class of interest. These cases are referred as True Negative (TN) cases.

74 of 82

2.5 EVALUATING PERFORMANCE OF A MODEL Cont…

75 of 82

2.5 EVALUATING PERFORMANCE OF A MODEL Cont…

  • For any classification model, model accuracy is given by total number of correct classifications divided by total number of classifications done.
  • The percentage of misclassifications is indicated using error rate which is measured as

2.5.1 Supervised learning – Classification

  • Kappa value of a model indicates the adjusted model accuracy.

76 of 82

2.5 EVALUATING PERFORMANCE OF A MODEL Cont…

2.5.1 Supervised learning – Classification

  • Let’s assume the confusion matrix of the win/loss prediction of cricket match problem to be as below:

In context of the above confusion matrix, total count of TPs= 85,

count of FPs = 4, count of FNs = 2 and count of TNs = 9.

77 of 82

2.5 EVALUATING PERFORMANCE OF A MODEL Cont…

2.5.1 Supervised learning – Classification

  • Two measures of model performance are sensitivity and specificity of the model.
  • The sensitivity of a model measures the proportion of TP examples or positive cases which were correctly classified.
  • A high value of sensitivity is more desirable than a high value of accuracy
  • Specificity of a model measures the proportion of negative examples which have been correctly classified.
  • It is a good measure to indicate a good balance of a model being excessively conservative or excessively aggressive.
  • A higher value of specificity will indicate a better model performance.

78 of 82

2.5 EVALUATING PERFORMANCE OF A MODEL Cont…

2.5.1 Supervised learning – Classification

  • Two other performance measures of a supervised learning model which are similar to sensitivity and specificity are precision and recall.

Precision:

  • Precision gives the proportion of positive predictions which are truly positive.
  • It indicates the reliability of a model in predicting a class of interest.
  • When the model is related to win / loss prediction of cricket, precision indicates how often it predicts the win correctly.
  • A model with higher precision is perceived to be more reliable

Recall:

  • Recall gives the proportion of TP cases over all actually positive cases..
  • Recall indicates the proportion of correct prediction of positives to the total number of positives.
  • In case of win/loss prediction of cricket, recall resembles what proportion of the total wins were predicted correctly.

79 of 82

2.5 EVALUATING PERFORMANCE OF A MODEL Cont…

2.5.1 Supervised learning – Classification

F-Measure

  • F-measure is another measure of model performance which combines the precision and recall.
  • It takes the harmonic mean of precision and recall.

In context of the confusion matrix for the cricket match win prediction problem,

80 of 82

Performance of Regression Algorithms

  • Regression models predict continuous numeric values (e.g., house price, temperature, stock price).
  • To evaluate how good a regression model is, we use error metrics that measure the difference between actual values and predicted values.

Mean Absolute Error (MAE)

  • Measures average absolute difference between actual and predicted values.
  • Easy to understand.
  • Less sensitive to outliers than MSE.

Mean Squared Error (MSE):

  • Squared error penalizes large errors heavily.
  • Useful when large mistakes are unacceptable.

81 of 82

Performance of Regression Algorithms

Root Mean Squared Error (RMSE)

Square root of MSE → gives error in original units (like dollars, meters).

Most commonly used regression metric.

R² Score (Coefficient of Determination):

Where:

SSres = Sum of squared residuals

SStot​ = Total variance in data

Interpretation:

  • R² = 1 → Perfect model
  • R² = 0 → Model no better than mean
  • R² < 0 → Model is worse than mean prediction

82 of 82

Performance of Regression Algorithms

Adjusted R² Score

Adjusts R² based on number of predictors (k).

Useful in multiple regression.

Prevents artificially high due to adding many features