1 of 87

MACHINE LEARNING

Mr. P.T.Krishna Sai

Assistant Professor

UNIT-2: Nearest Neighbor-Based Models

2 of 87

UNIT-2: Nearest Neighbor-Based Models

Introduction to Proximity Measures, Distance Measures, Non-Metric Similarity Functions, Proximity Between Binary Patterns, Different Classification Algorithms Based on the Distance Measures, K-Nearest Neighbor Classifier, Radius Distance Nearest Neighbor Algorithm, KNN Regression, Performance of Classifiers, Performance of Regression Algorithms.

3 of 87

Proximity Measures

  • It is used to refer to either Similarity or dissimilarities.
  • The proximity between two data objects is a function of the proximity between the corresponding attributes of the two objects.
  • Proximity measures such as Correlation & Euclidean distance which are useful for dense data such as time series or 2-dimensional points.
  • Jaccard & Cosine similarity measures, which are useful for data like documents.

4 of 87

Similarity

Definition: A numerical measure of how alike two data objects are.

Characteristics:

It is higher when objects are more alike.

Often falls in the range [0, 1].

Similarity measure for two objects i & j returns:

0 – if they are unlike.

1 – if they are alike.

Key Principles:

The higher the similarity value, the greater the similarity between objects.

1 indicates complete similarity.

5 of 87

Dissimilarity

Definition: Numerical measures of how different two objects are.

Characteristics:

Lower when objects are more alike.

Minimum dissimilarity is often zero.

Upper limit varies.

Dissimilarity measure for two objects i & j returns:

0 – if they are alike.

1 – if they are unlike.

Key Principles:

The higher the dissimilarity value, the greater the dissimilarity between objects.

1 indicates complete dissimilarity.

General Relation

The general relation between similarity (s) and dissimilarity (d) is:

s = 1 – d (or) d = 1 - s

6 of 87

7 of 87

Case 1: Equal Values

Problem: Let p = 10 and q = 10. Find the dissimilarity and similarity.

Calculation:

  1. Check Condition: Since 10 = 10, the objects are identical.
  2. Dissimilarity (d): Because they match, d = 0.
  3. Similarity (s): Because they match, s = 1.

Result: For p=q, the objects have 0 dissimilarity and 1 similarity.

Case 2: Unequal Values

Problem: Let p = 10 and q = 25. Find the dissimilarity and similarity.

Calculation:

  1. Check Condition: Since 10 25, the objects are different.
  2. Dissimilarity (d): Because they do not match, d = 1.
  3. Similarity (s): * Directly: Because they do not match, s = 0.

Using the formula s = 1 - d

s = 1 - 1 = 0

Result: For p q, the objects have 1 dissimilarity and 0 similarity.

8 of 87

2. Ordinal Attributes

The calculation for dissimilarity between two points, p and q, where the number of variables (or states) n = 2.

Case 1:

Given: p = 10, q = 5, and n = 2.

Formula used:

Calculation:

d = {|10 - 5|}/{2 - 1} = {5}/{1} = 5

Note: There is a crossed-out section below this, followed by a second calculation: s = 1 - 5 = -4.

3. Interval or Ratio Attributes

This section explores how to convert dissimilarity values (d) into similarity values (s).

Given Dissimilarities:

The values of d are 0, 1, 10, 100.

Similarity Formulas used:

Calculations for s = {1}/{1 + d}:

9 of 87

Dissimilarity (d)

Similarity Calculation (s)

Result

d = 0

s = {1}/{1 + 0}

s = 1

d = 1

s = {1}/{1 + 1} = {1}/{2}

s = 0.5

d = 10

s = {1}/{1 + 10} = {1}/{11}

s 0.09

d = 100

s = {1}/{1 + 100} ={1}/{101}

s 0.01

It demonstrate that as dissimilarity (d) increases, the similarity (s) decreases toward zero, reflecting an inverse relationship between the two metrics.

10 of 87

Data Matrix

Suppose that we have n objects (ex. person or courses) described by p attributes (ex. age, height, weight & gender).

The objects are:

11 of 87

Dissimilarity matrix :- (object by object structure)

This structure stores a collection of proximities that are available for all pairs of n objects.

The proximity measures can be classified into two types :-

  1. Distance measures (or) Dissimilarity
  2. Non-metric measures (or) Similarity

12 of 87

Distance Measures :-

  • It is one of the proximity measure type.
  • It is also known as dissimilarity.
  • In ML, the distance measures mainly classified into four types:
    • Euclidean distance
    • Manhattan distance
    • Minkowski distance
    • Chebyshev distance

Euclidean Distance

  • It is the distance between two real-valued vectors. Mostly we use it to calculate the distance between two rows of data having numerical values (floating or integer values).
  • It represents in mathematically as:

13 of 87

Final Distance Matrix

B[0]

B[1]

A[0]

5.6569

8.4853

A[1]

2.8284

5.6569

A = [1, 2],

[3, 4]

B = [5, 6],

[7, 8]

14 of 87

3D Euclidean Distance

15 of 87

Applications

It is useful metric in many algorithms. They are:

  1. k-nearest neighbor (kNN)
  2. Clustering - k-mean
  3. Multi-dimensional Scaling
  4. Image Processing
  5. Robotics (drones or self driving cars)

16 of 87

Manhattan Distance

Also known as: Taxicab distance or city block distance.

Definition: Calculates the distance between two points in a grid-like space. It measures the sum of absolute differences of their Cartesian coordinates, effectively representing the distance a taxi would travel in a city with a grid layout like Manhattan.

Key Characteristics

Grid-based: It is used when movement is restricted to horizontal and vertical lines like city streets.

Sum of absolute differences: Instead of a straight line (Euclidean distance), it adds up the distances along each axis.

Non-negative: Because of the absolute value, the distance always results in a positive number.

17 of 87

Mathematically, the Manhattan distance between two points in an n-dimensional space is the sum of the absolute differences of their Cartesian coordinates.

Mathematical Formulas

18 of 87

Example: Google Maps

Calculate the distance between two points, A and B.

Point A: (1, 1)

Point B: (4, 5)

Calculation Steps (as written):

d = (1 - 4) + (1 - 5) = -3 + (-4) = -7

19 of 87

Applications:

Manhattan distance finds applications in various fields: data analysis & geospatial technology. Here some key areas where manhattan distance is particularly useful. They are:

  1. path finding algorithms
  2. clustering technique (k-means clustering)
  3. Image Recognition
  4. Outlier detection
  5. Geographic Information Systems (GIS)

20 of 87

Minkowski Distance

It is mainly a mathematical measure used to calculate the distance between two points in a multi-dimensional space. It's an extension of the more commonly known Euclidean distance.

21 of 87

22 of 87

Example: The distance calculations between point A(2, 3) and point B(5, 7).

23 of 87

Point A: (2, 5, 8)

Point B: (3, 1, 6)

24 of 87

Chebyshev Distance

Also known as: "Chessboard distance" or “ metric."

Definition: A distance metric that measures the distance between two points as the maximum absolute difference between their corresponding coordinates.

Usage: It is used to calculate distance in a space where movement can occur in any direction, including diagonally. It is especially useful in grid-based systems like chessboards or pathfinding in games where diagonal movement is allowed.

Characteristics: Simple and fast to compute; works well in environments with uniform movement in all directions.

25 of 87

26 of 87

Example Calculation

Given points:

A = (4, 7, 8)

B = (6, 9, 1)

Formula for 3D space:

D = max(|x1 - y1|, |x2 - y2|, |x3 - y3|)

Step-by-step Calculation:

  1. |4 - 6| = |-2| = 2
  2. |7 - 9| = |-2| = 2
  3. |8 - 1| = |7| = 7

Result:

D = max(2, 2, 7) = 7

27 of 87

Metric Similarity Functions in Nearest Neighbors Based Models

Proximity Measures Proximity measures are categorized into a hierarchy shown below:

  • Non-metric similarity functions are measures used to evaluate the similarity between data points without strictly adhering to the properties of a metric space (e.g., triangle inequality).
  • These functions are often used in scenarios where relationships between data points cannot be easily captured by traditional metric distance measures, such as Euclidean or Manhattan distance.

28 of 87

Characteristics of Non-metric Similarity Functions

  • They focus on measuring similarity rather than distance.
  • They may not satisfy properties like symmetry, non-negativity, or the triangle inequality.
  • They are particularly useful in domains where data relationships are non-linear or symbolic.

The Cosine Similarity Formula

To compute the Cosine similarity between vectors A and B, you can use the following formula:

The Cosine similarity is one of the most common measures of document similarity. If A and B are two document vectors, then:

29 of 87

Mathematical Calculations

Vectors:

A = (4, 5, 6)

B = (2, 1, 2)

30 of 87

Applications

  1. Text Analysis
  2. Recommender Systems
  3. Information Retrieval
  4. Clustering
  5. Image Processing
  6. Natural Language Processing

Advantages

  1. Simple & Efficient
  2. Magnitude Independent
  3. Widely Applicable

Disadvantages

  1. Ignores Magnitude
  2. Sensitive to Sparse data
  3. Normalization dependency

31 of 87

Jaccard Similarity

  • It is a common proximity measurement used to compute the similarity between two objects, such as two text documents.
  • Jaccard similarity can be used to find the similarity between two asymmetric binary vectors or to find the similarity between two sets.

Key Characteristics

  • The score can range from 0 to 1.
  • The higher the number, the more similar the two sets of data.
  • The Jaccard similarity score is 0 if there are no common words between two documents.

32 of 87

Example:

doc1 = "Data is the new oil of the digital economy"

doc2 = "Data is a new oil"

Let's get the set of unique words for each document.

words_doc1 = {'data', 'is', 'the', 'new', 'oil', 'of', 'digital', 'economy'}

words_doc2 = {'data', 'is', 'a', 'new', 'oil'}

Now, we will calculate the intersection & union of these two sets of words & measure the Jaccard similarity between doc1 & doc2.

33 of 87

Applications:

  1. Text mining
  2. E-commerce
  3. Recommendation systems

34 of 87

Pearson Correlation

  • It is used for measuring the strength and direction of a linear relationship between two variables.
  • It is important in fields like data science, finance, healthcare, and social sciences, where understanding relationships between different factors is important.
  • Quantifying the degree to which two variables are related helps analysts, researchers, and decision makers to find patterns and make predictions and inform decisions.

35 of 87

The Pearson coefficient descriptions were indicated to be below,

Note on Correlation Values

The correlation coefficient (r) indicates the strength and direction of a linear relationship:

  • r = -1: Indicates a negative correlation. (As one variable increases, the other decreases).
  • r = 1: Indicates a positive correlation. (As one variable increases, the other also increases).
  • r = 0: Means no correlation. (There is no linear relationship between the variables).

36 of 87

Uses of Correlation

  1. Feature Selection: Helping to identify which variables are most relevant to a target outcome in data modeling.
  2. Redundancy Detection: Identifying variables that are so highly correlated that they provide the same information, allowing one to be removed.
  3. Understanding Relationships: Gaining insights into how different data points influence or relate to one another.

37 of 87

Proximity Between Binary Patterns

In Machine Learning & pattern recognition, measuring the proximity (similarity or dissimilarity) between binary patterns (vectors 0's & 1's) is essential for tasks like clustering, classification & Informational retrieval.

Common Methods:-

  1. Hamming distance
  2. Simple Matching Coefficient (SMC)
  3. Jaccard Similarity & distance
  4. Cosine Similarity
  5. Dice Similarity Coefficient

38 of 87

Hamming Distance (HD):-

It counts the no. of bit positions where two binary vectors differ.

Formula:-

39 of 87

Simple Matching Coefficient (SMC)

SMC is a commonly used similarity coefficient. It is represented mathematically as:

40 of 87

Example: Two binary vectors, x and y, each containing 10 bits.

Vector x : 1 0 0 0 0 0 1 0 0 0

Vector y : 0 0 0 0 0 0 1 0 0 1

From these vectors, the following counts are derived based on pairwise bit comparisons:

41 of 87

42 of 87

Jaccard Similarity & Distance

The Jaccard Similarity (also known as the Jaccard Index or Jaccard Coefficient) and the Jaccard Distance are statistical measures used to understand the relationship between two sets of data. They are particularly popular in data science, ecology, and information retrieval (like comparing two documents).

Jaccard Similarity Formula

43 of 87

Example:

44 of 87

Cosine Similarity

Treats binary vectors as real vectors and computes the cosine of the angle between them.

Formula:

45 of 87

Dice Similarity Coefficient

Gives more weight to matching 1's.

46 of 87

Different classification algorithms based on distance measures

To determine the class of a sample by computing distances between data points in a feature space. These algorithms typically classify a sample by identifying which classes' samples are closest to it.

Different types of classification algorithms based on distance measures in ML are:

  1. k-Nearest Neighbor
  2. Support Vector Machine (SVM)
  3. Nearest Centroid Classifier
  4. Learning Vector Quantization (LVQ)
  5. Radius Neighbors Classifier
  6. Fuzzy KNN

47 of 87

1. KNN :-

  • It is a non-parametric, supervised learning classifier which uses proximity about the grouping of an individual data point.
  • KNN uses distance metrics to identify nearest neighbors, these neighbors are used for classification and regression tasks.

Distance metrics are :-

  1. Euclidean distance
  2. Manhattan distance
  3. Minkowski distance.

48 of 87

2. SVM (Support Vector Machine)

  • SVM is used for classification & regression.
  • It works by finding decision planes that separate data into different classes.
  • Support vectors are the closest data points to the hyperplane that define the class boundary.
  • A hyperplane is a plane that supports different classes.
  • SVM can use kernel functions that are based on similarity & distance between vectors.

49 of 87

3. Nearest Centroid Classifier

  • Definition: It is a prototype-based method where each class is represented by its centroid, calculated as the average (avg) of all training data points belonging to that class.
  • Concept: Essentially, it is a simplified version of k-NN, using only one prototype (the centroid) per class.
  • Mechanism: It is a simple classification algorithm that assigns data points to the class whose centroid (mean) is closest to it.

Distance Measures

  • Manhattan
  • Euclidean
  • Minkowski

50 of 87

4. Learning Vector Quantization (LVQ)

  • It is a supervised learning algorithm used for classification. It is a type of ANN that employs a competitive learning approach to classify data into predefined classes.
  • LVQ has two layers: one is the input layer & other one is output layer. The architecture of the LVQ consists of a number of classes in an I/p & "n" - no. of input features.

51 of 87

5. Radius Neighbors Classifier

  • It is a ML algorithm used for classification tasks, belonging to the family of instance-based or lazy learning methods.
  • Unlike the kNN algorithm which finds a fixed number of 'k' closest neighbors, the radius neighbors classifier identifies all training examples within a specified fixed 'r' of a new unclassified data point.

52 of 87

6. Fuzzy k-NN

  • It is an extension of the traditional kNN algorithm that incorporates principles of fuzzy set theory. While standard kNN assigns a crisp class label to a data point based on the majority class of its kNN, FKNN introduces the concept of fuzzy membership.
  • Fuzzy version of kNN where a data point can belong to more than one class with varying degrees of membership.

53 of 87

k-Nearest Neighbor Classifier

  • It is one of the simplest yet powerful supervised learning techniques used for classification and regression tasks in ML.
  • It is non-parametric, meaning it doesn't make any assumptions about the underlying data distribution, which makes it versatile for various applications.
  • kNN works by analyzing the proximity or "closeness" of data points based on specific distance metrics.

54 of 87

Types of kNN:

1. Classification:

  • kNN assigns a class label to a new data point based on the majority class of the nearest neighbors.
  • Ex: For instance, if a data point has 5 NN (Nearest Neighbors) and three of them belong to class A while two belong to class B, the algorithm will classify the point in class A.

55 of 87

2. Regression:-

  • kNN predicts continuous values by averaging the values of the kNN.
  • Example, if you're predicting house prices, kNN will use the average prices of the kNN to estimate the price of a new house.

kNN algorithm:-

  • Step 1:- Select the number k of the neighbors.
  • Step 2:- Calculate the Euclidean distance of k number of neighbors.
  • Step 3:- Take the kNN as per the calculated Euclidean distance.
  • Step 4:- Among these k neighbors, count the no. of data points in category.
  • Step 5:- Assign the new data points to that category for which the no. of neighbors is maximum.
  • Step 6:- Our model is ready.

56 of 87

KNN FLOW CHART

57 of 87

Selecting the Value of k in the KNN Algorithm

  • Choosing the correct value of k is critical to the performance of the KNN algorithm.
  • Small k value can make the model too sensitive to noise, resulting in overfitting.
  • Large k value can oversimplify the model, causing underfitting.

Methods to Select the Optimal k:

  • Cross-Validation: A commonly used technique for choosing the value of k is cross-validation. By splitting the dataset into training and validation sets and evaluating model performance across different values of k, the optimal k value can be determined based on which k produces the lowest error rate.
  • Common Values of k: Values of k such as 3, 5, or 7 are typically chosen. Smaller values of k allow the model to capture local patterns, while larger k values generalize better across the dataset.

58 of 87

Advantages:

  • It is simple to implement.
  • It is robust to the noisy training data.
  • It can be more effective if the training data is large.�

Disadvantages:

  • Always needs to determine the value of k which may be complex sometimes.
  • The computation cost is high because of calculating the distance between the data points for all the training.

59 of 87

Radius Distance Nearest Neighbors Algorithm

  • It is a ML technique, primarily used for classification, that identifies and utilizes data points within a specified fixed radius of a query point for making predictions.
  • The data is not uniformly sampled, radius-based neighbor classification in Radius Neighbors Classifier can be a better choice.
  • It is a method used in ML to identify data points that are in close proximity to a given point within a specified radius. This technique is particularly useful in various applications, such as clustering, classification, and anomaly detection.
  • It helps uncover patterns and trends within the data, enabling more accurate predictions and insights.
  • Unlike kNN, which considers a fixed number of 'k' closest neighbors, RNN (Radius-NN) defines a neighborhood based on a distance threshold.

60 of 87

Radius-NN Algorithm:

Step 1: Select the number of R of radius.

Step 2: Calculate the Euclidean distance of R no. of neighbor.

Step 3: Take the radius as per for the calculated Euclidean distance.

Step 4: Among all the data points, count all no. of data points in each category.

Step 5: Assign the new data point where the maximum no. of data points in a category.

Step 6: Our model is ready.

It works by calculating the distance between a given data point and all the other data points in the dataset. It then identifies the data points that are within a specified radius of the given point.

The distance can be calculated using various metrics, such as Euclidean distance, Manhattan distance, or Cosine similarity.

61 of 87

RNN FLOW CHART

62 of 87

Advantages

  1. Robustness to Outliers
  2. Adaptability to Data Density
  3. Suitable for Sparse Data

Applications

  1. Outlier Detection
  2. Clustering
  3. Data Density Analysis
  4. Recommendation Systems
  5. Robotics and Motion Planning
  6. Materials Science and Astronomy
  7. Image Processing & Computer Vision

63 of 87

KNN Regression

  • It is one of the simplest & powerful supervised learning techniques used for classification & Regression.
  • It is non-parametric, meaning it doesn't vary any assumptions about the underlying data distribution, which makes it versatile for various applications. KNN works by analyzing the proximity 'or' closeness of data points based on specific distance metrics.

Types:

  1. Regression: KNN predicts continuous values by averaging the values of the KNN.

Ex: If you're predicting house prices, KNN will use the average prices of the KNN to estimate the prices of a new house.

2. Classification: KNN assigns a class label to a new data point based on the majority class of its nearest neighbor.

64 of 87

Algorithm:

65 of 87

66 of 87

Select the value of k in KNN algorithm:

  • Choosing the correct value of k is critical to the performance of the KNN algorithm.
  • Small k value can make the model too sensitive to noise, resulting in overfitting.
  • Large k value can make over-simplify the models, underfitting.

Methods to select the optimal k:

Cross Validation:

  • A commonly used technique for choosing the values of k is cross-validation. By splitting the dataset into training & validation sets & evaluating model performance across different values of k, the optimal k value can be determined based on which k produce the lowest error rate.
  • Common values of k: In practice, values of k such as 3, 5, 7 are typically chosen. Smaller value of k allow the model to capture local patterns while larger k values generalizes better across the dataset.

67 of 87

Advantages:

  1. Simple to Implement
  2. Robust to noisy training data
  3. Effective if training data is large

Disadvantages:

  • Need to determine the value of K & Computation cost is high.
  • A feedback loop can be added to determine the no. of neighbors.

Applications:

  1. Financial Forecasting
  2. Healthcare
  3. Recommendation systems
  4. Data preprocessing
  5. Environmental Science
  6. Real Estate

68 of 87

Performance of Classifiers

In data science, classifier performance measures the predictive capabilities of ML models with metrics like:

  • Precision
  • Accuracy
  • Recall
  • F1 Score
  • All metrics are based on the concepts of true & false predictions created by the model & measured against the actual outcome.
  • If the model is perfect, then all the predictions are true positive or true negative.
  • Predictions that do not match the actual outcomes are labelled false positive or false negatives.

Confusion Matrix

A Confusion matrix is a table i.e., often used to describe the performance of a classification model (or classifier) on a set of test data for which true values are known.

69 of 87

Key Definitions

Predicted: Negative & Actual values: Positive

You predicted false (FN)

Predicted: Negative & Actual values: Negative

You predicted True (TN)

Predicted: Positive & Actual values: Positive

You predicted True (TP)

Predicted: Positive & Actual value: Negative

You predicted false (FP)

70 of 87

1. Accuracy

Accuracy is one of the metrics for evaluating a classification model. Formally, accuracy could be defined as the number of correct predictions to a total number of predictions.

71 of 87

2. Precision

Precision measures the accuracy of positive predictions. It answers the question, "Of all the items the model labeled as positive, how many are actually positive?"

72 of 87

3. Recall

Recall measures the model's ability to find the positive instances. It shows the question, "Of all the actual positives, how many did the [model] correctly identify?"

73 of 87

4. F1 Score

F1 Score is the harmonic mean of Precision & Recall. It balances the two metrics into a single number, making it especially useful when Precision & Recall are in trade-off.

74 of 87

Example:

F21

F12

T33

75 of 87

76 of 87

77 of 87

**************************************************

78 of 87

79 of 87

Regression Performance

Regression performance refers to how well a regression model predicts continuous numerical values like prices, temperatures, or sales figures.

  • It's evaluated using metrics that quantify the difference between predicted & actual values.
  • Performance is measured by how close the model's predictions are to the actual observed values.
  • Different metrics are used to quantify the errors between predicted & actual values. They are:
    1. Mean Absolute Error (MAE)
    2. Mean Squared Error (MSE)
    3. R-Squared (R2) Error
    4. Root Mean Squared Error (RMSE)

80 of 87

1. Mean Absolute Error (MAE)

  • Calculates the average absolute differences between predicted & actual values. Less sensitive to outliers than MSE.
  • It's a measurement of the typical absolute discrepancies between a dataset's actual values & predicted values.

Formula:

MAE =(Sum of data points)/(Total no. of data points)

81 of 87

82 of 87

2. Mean Squared Error (MSE)

  • Calculates the average squared difference between predicted & actual values.
  • Sensitive to outliers due to squaring of errors.
  • It measures the square root of the average discrepancies between a dataset's actual values & predicted values.
  • MSE is frequently utilized in regression issues & is used to assess how well predictive models work.

83 of 87

84 of 87

3. R-Squared (Coefficient of Determination)

  • Indicates the proportion of variance in the dependent variables that is predictable from the independent variables. Ranges from 0 to 1, with higher values indicating better fit.
  • It quantifies the percentage of the dependent variable's variation that the model's independent variables contribute to.
  • R2 is a useful statistic for evaluating the overall effectiveness and explanatory power of a regression model.

85 of 87

4.Root Mean Squared Error (RMSE):

  • The Square root of MSE, providing a metric in the same units as the target variable, making it easier to interpret.
  • It is usually used metric in regression analysis & ML to measure the accuracy of fit of a predictive model, especially

86 of 87

87 of 87