MACHINE LEARNING
Mr. P.T.Krishna Sai
Assistant Professor
UNIT-2: Nearest Neighbor-Based Models
UNIT-2: Nearest Neighbor-Based Models
Introduction to Proximity Measures, Distance Measures, Non-Metric Similarity Functions, Proximity Between Binary Patterns, Different Classification Algorithms Based on the Distance Measures, K-Nearest Neighbor Classifier, Radius Distance Nearest Neighbor Algorithm, KNN Regression, Performance of Classifiers, Performance of Regression Algorithms.
Proximity Measures
Similarity
Definition: A numerical measure of how alike two data objects are.
Characteristics:
It is higher when objects are more alike.
Often falls in the range [0, 1].
Similarity measure for two objects i & j returns:
0 – if they are unlike.
1 – if they are alike.
Key Principles:
The higher the similarity value, the greater the similarity between objects.
1 indicates complete similarity.
Dissimilarity
Definition: Numerical measures of how different two objects are.
Characteristics:
Lower when objects are more alike.
Minimum dissimilarity is often zero.
Upper limit varies.
Dissimilarity measure for two objects i & j returns:
0 – if they are alike.
1 – if they are unlike.
Key Principles:
The higher the dissimilarity value, the greater the dissimilarity between objects.
1 indicates complete dissimilarity.
General Relation
The general relation between similarity (s) and dissimilarity (d) is:
s = 1 – d (or) d = 1 - s
Case 1: Equal Values
Problem: Let p = 10 and q = 10. Find the dissimilarity and similarity.
Calculation:
Result: For p=q, the objects have 0 dissimilarity and 1 similarity.
Case 2: Unequal Values
Problem: Let p = 10 and q = 25. Find the dissimilarity and similarity.
Calculation:
Using the formula s = 1 - d
s = 1 - 1 = 0
Result: For p q, the objects have 1 dissimilarity and 0 similarity.
2. Ordinal Attributes
The calculation for dissimilarity between two points, p and q, where the number of variables (or states) n = 2.
Case 1:
Given: p = 10, q = 5, and n = 2.
Formula used:
Calculation:
d = {|10 - 5|}/{2 - 1} = {5}/{1} = 5
Note: There is a crossed-out section below this, followed by a second calculation: s = 1 - 5 = -4.
3. Interval or Ratio Attributes
This section explores how to convert dissimilarity values (d) into similarity values (s).
Given Dissimilarities:
The values of d are 0, 1, 10, 100.
Similarity Formulas used:
Calculations for s = {1}/{1 + d}:
Dissimilarity (d) | Similarity Calculation (s) | Result |
d = 0 | s = {1}/{1 + 0} | s = 1 |
d = 1 | s = {1}/{1 + 1} = {1}/{2} | s = 0.5 |
d = 10 | s = {1}/{1 + 10} = {1}/{11} | s 0.09 |
d = 100 | s = {1}/{1 + 100} ={1}/{101} | s 0.01 |
It demonstrate that as dissimilarity (d) increases, the similarity (s) decreases toward zero, reflecting an inverse relationship between the two metrics.
Data Matrix
Suppose that we have n objects (ex. person or courses) described by p attributes (ex. age, height, weight & gender).
The objects are:
Dissimilarity matrix :- (object by object structure)
This structure stores a collection of proximities that are available for all pairs of n objects.
The proximity measures can be classified into two types :-
Distance Measures :-
Euclidean Distance
Final Distance Matrix
| B[0] | B[1] |
A[0] | 5.6569 | 8.4853 |
A[1] | 2.8284 | 5.6569 |
A = [1, 2],
[3, 4]
B = [5, 6],
[7, 8]
3D Euclidean Distance
Applications
It is useful metric in many algorithms. They are:
Manhattan Distance
Also known as: Taxicab distance or city block distance.
Definition: Calculates the distance between two points in a grid-like space. It measures the sum of absolute differences of their Cartesian coordinates, effectively representing the distance a taxi would travel in a city with a grid layout like Manhattan.
Key Characteristics
Grid-based: It is used when movement is restricted to horizontal and vertical lines like city streets.
Sum of absolute differences: Instead of a straight line (Euclidean distance), it adds up the distances along each axis.
Non-negative: Because of the absolute value, the distance always results in a positive number.
Mathematically, the Manhattan distance between two points in an n-dimensional space is the sum of the absolute differences of their Cartesian coordinates.
Mathematical Formulas
Example: Google Maps
Calculate the distance between two points, A and B.
Point A: (1, 1)
Point B: (4, 5)
Calculation Steps (as written):
d = (1 - 4) + (1 - 5) = -3 + (-4) = -7
Applications:
Manhattan distance finds applications in various fields: data analysis & geospatial technology. Here some key areas where manhattan distance is particularly useful. They are:
Minkowski Distance
It is mainly a mathematical measure used to calculate the distance between two points in a multi-dimensional space. It's an extension of the more commonly known Euclidean distance.
Example: The distance calculations between point A(2, 3) and point B(5, 7).
Point A: (2, 5, 8)
Point B: (3, 1, 6)
Chebyshev Distance
Also known as: "Chessboard distance" or “ metric."
Definition: A distance metric that measures the distance between two points as the maximum absolute difference between their corresponding coordinates.
Usage: It is used to calculate distance in a space where movement can occur in any direction, including diagonally. It is especially useful in grid-based systems like chessboards or pathfinding in games where diagonal movement is allowed.
Characteristics: Simple and fast to compute; works well in environments with uniform movement in all directions.
Example Calculation
Given points:
A = (4, 7, 8)
B = (6, 9, 1)
Formula for 3D space:
D = max(|x1 - y1|, |x2 - y2|, |x3 - y3|)
Step-by-step Calculation:
Result:
D = max(2, 2, 7) = 7
Metric Similarity Functions in Nearest Neighbors Based Models
Proximity Measures Proximity measures are categorized into a hierarchy shown below:
Characteristics of Non-metric Similarity Functions
The Cosine Similarity Formula
To compute the Cosine similarity between vectors A and B, you can use the following formula:
The Cosine similarity is one of the most common measures of document similarity. If A and B are two document vectors, then:
Mathematical Calculations
Vectors:
A = (4, 5, 6)
B = (2, 1, 2)
Applications
�Advantages
Disadvantages
Jaccard Similarity
Key Characteristics
Example:
doc1 = "Data is the new oil of the digital economy"
doc2 = "Data is a new oil"
Let's get the set of unique words for each document.
words_doc1 = {'data', 'is', 'the', 'new', 'oil', 'of', 'digital', 'economy'}
words_doc2 = {'data', 'is', 'a', 'new', 'oil'}
Now, we will calculate the intersection & union of these two sets of words & measure the Jaccard similarity between doc1 & doc2.
Applications:
Pearson Correlation
The Pearson coefficient descriptions were indicated to be below,
Note on Correlation Values
The correlation coefficient (r) indicates the strength and direction of a linear relationship:
Uses of Correlation
Proximity Between Binary Patterns
In Machine Learning & pattern recognition, measuring the proximity (similarity or dissimilarity) between binary patterns (vectors 0's & 1's) is essential for tasks like clustering, classification & Informational retrieval.
Common Methods:-
Hamming Distance (HD):-
It counts the no. of bit positions where two binary vectors differ.
Formula:-
Simple Matching Coefficient (SMC)
SMC is a commonly used similarity coefficient. It is represented mathematically as:
Example: Two binary vectors, x and y, each containing 10 bits.
Vector x : 1 0 0 0 0 0 1 0 0 0
Vector y : 0 0 0 0 0 0 1 0 0 1
From these vectors, the following counts are derived based on pairwise bit comparisons:
Jaccard Similarity & Distance
The Jaccard Similarity (also known as the Jaccard Index or Jaccard Coefficient) and the Jaccard Distance are statistical measures used to understand the relationship between two sets of data. They are particularly popular in data science, ecology, and information retrieval (like comparing two documents).
Jaccard Similarity Formula
Example:
Cosine Similarity
Treats binary vectors as real vectors and computes the cosine of the angle between them.
Formula:
Dice Similarity Coefficient
Gives more weight to matching 1's.
Different classification algorithms based on distance measures
To determine the class of a sample by computing distances between data points in a feature space. These algorithms typically classify a sample by identifying which classes' samples are closest to it.
Different types of classification algorithms based on distance measures in ML are:
1. KNN :-
Distance metrics are :-
2. SVM (Support Vector Machine)
3. Nearest Centroid Classifier
Distance Measures
4. Learning Vector Quantization (LVQ)
5. Radius Neighbors Classifier
6. Fuzzy k-NN
k-Nearest Neighbor Classifier
Types of kNN:
1. Classification:
2. Regression:-
kNN algorithm:-
KNN FLOW CHART
Selecting the Value of k in the KNN Algorithm
Methods to Select the Optimal k:
Advantages:
Disadvantages:
Radius Distance Nearest Neighbors Algorithm
Radius-NN Algorithm:
Step 1: Select the number of R of radius.
Step 2: Calculate the Euclidean distance of R no. of neighbor.
Step 3: Take the radius as per for the calculated Euclidean distance.
Step 4: Among all the data points, count all no. of data points in each category.
Step 5: Assign the new data point where the maximum no. of data points in a category.
Step 6: Our model is ready.
It works by calculating the distance between a given data point and all the other data points in the dataset. It then identifies the data points that are within a specified radius of the given point.
The distance can be calculated using various metrics, such as Euclidean distance, Manhattan distance, or Cosine similarity.
RNN FLOW CHART
Advantages
Applications
KNN Regression
Types:
Ex: If you're predicting house prices, KNN will use the average prices of the KNN to estimate the prices of a new house.
2. Classification: KNN assigns a class label to a new data point based on the majority class of its nearest neighbor.
Algorithm:
Select the value of k in KNN algorithm:
Methods to select the optimal k:
Cross Validation:
Advantages:
Disadvantages:
Applications:
Performance of Classifiers
In data science, classifier performance measures the predictive capabilities of ML models with metrics like:
Confusion Matrix
A Confusion matrix is a table i.e., often used to describe the performance of a classification model (or classifier) on a set of test data for which true values are known.
Key Definitions
Predicted: Negative & Actual values: Positive
You predicted false (FN)
Predicted: Negative & Actual values: Negative
You predicted True (TN)
Predicted: Positive & Actual values: Positive
You predicted True (TP)
Predicted: Positive & Actual value: Negative
You predicted false (FP)
1. Accuracy
Accuracy is one of the metrics for evaluating a classification model. Formally, accuracy could be defined as the number of correct predictions to a total number of predictions.
2. Precision
Precision measures the accuracy of positive predictions. It answers the question, "Of all the items the model labeled as positive, how many are actually positive?"
3. Recall
Recall measures the model's ability to find the positive instances. It shows the question, "Of all the actual positives, how many did the [model] correctly identify?"
4. F1 Score
F1 Score is the harmonic mean of Precision & Recall. It balances the two metrics into a single number, making it especially useful when Precision & Recall are in trade-off.
Example:
F21
F12
T33
**************************************************
Regression Performance
Regression performance refers to how well a regression model predicts continuous numerical values like prices, temperatures, or sales figures.
1. Mean Absolute Error (MAE)
Formula:
MAE =(Sum of data points)/(Total no. of data points)
2. Mean Squared Error (MSE)
3. R-Squared (Coefficient of Determination)
4.Root Mean Squared Error (RMSE):