K Nearest Neighbor Classification
Bayes Classifier: Recap
L
P( HILSA | L)
P( TUNA | L)
P( SHARK | L)
Maximum Aposteriori (MAP) Rule
Distributions assumed to be of particular family (e.g., Gaussian), and
parameters estimated from training data.
Bayes Classifier: Recap
L +- Ξ΄
P( HILSA | L)
P( TUNA | L)
P( SHARK | L)
Approximate Maximum Aposteriori (MAP) Rule
Non-parametric (data driven) approach: consider a small window around L,
Find which class is most populous in that window.
Nearest Neighbor Classifiers
Training Records
Test Record
Compute Distance
Choose k of the βnearestβ records
Basic Idea
Definition of Nearest Neighbor
K-nearest neighbors of a record x are data points that have the k smallest distance to x
Voronoi Diagram
Properties:
Distance-weighted k-NN
General Kernel functions like Parzen Windows may be considered
Instead of inverse distance.
Predicting Continuous Values
Nearest-Neighbor Classifiers: Issues
Value of K
Rule of thumb:
K = sqrt(N)
N: number of training points
Distance Metrics
Distance Measure: Scale Effects
Standardization
Nearest Neighbor : Dimensionality
1 1 1 1 1 1 1 1 1 1 1 0
0 1 1 1 1 1 1 1 1 1 1 1
1 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 1
vs
d = 1.4142
d = 1.4142
Distance for Nominal Attributes
Distance for Heterogeneous Data
Wilson, D. R. and Martinez, T. R., Improved Heterogeneous Distance Functions, Journal of Artificial Intelligence Research, vol. 6, no. 1, pp. 1-34, 1997
Nearest Neighbour : Computational Complexity
Reduction in Computational Complexity
Condensation: Decision Regions
Each cell contains one sample, and every location within the cell is closer to that sample than to any other sample.
A Voronoi diagram divides the space into such cells.
Every query point will be assigned the classification of the sample within that cell. The decision boundary separates the class regions based on the 1-NN decision rule.
Knowledge of this boundary is sufficient to classify new points.
The boundary itself is rarely computed; many algorithms seek to retain only those points necessary to generate an identical boundary.
Condensing
Original data
Condensed data
Minimum Consistent Set
Condensing
Condensing
Condensing
Condensing
Condensing
Condensing
Condensing
High dimensional search
Query
kd-tree: data structure for range search
x=5
y=3
y=6
x=6
A
B
C
D
E
x<5
x>=5
kd-tree example
X=5
y=5
y=6
x=3
y=2
x=8
x=7
X=5
X=8
X=7
X=3
Y=6
Y=2
KNN: Alternate Terminologies