2 of 32

Bayes Classifier: Recap

P( HILSA | L)

P( TUNA | L)

P( SHARK | L)

Maximum Aposteriori (MAP) Rule

Distributions assumed to be of particular family (e.g., Gaussian), and

parameters estimated from training data.

3 of 32

Bayes Classifier: Recap

L +- δ

P( HILSA | L)

P( TUNA | L)

P( SHARK | L)

Approximate Maximum Aposteriori (MAP) Rule

Non-parametric (data driven) approach: consider a small window around L,

Find which class is most populous in that window.

4 of 32

Nearest Neighbor Classifiers

Basic idea:

If it walks like a duck, quacks like a duck, then it’s probably a duck

Training Records

Test Record

Compute Distance

Choose k of the “nearest” records

5 of 32

Basic Idea

k-NN classification rule is to assign to a test sample the majority category label of its k nearest training samples
In practice, k is usually chosen to be odd, so as to avoid ties
The k = 1 rule is generally called the nearest-neighbor classification rule

6 of 32

Definition of Nearest Neighbor

K-nearest neighbors of a record x are data points that have the k smallest distance to x

7 of 32

Voronoi Diagram

Properties:

All possible points within a sample's Voronoi cell are the nearest neighboring points for that sample
For any sample, the nearest sample is determined by the closest Voronoi cell edge

8 of 32

Distance-weighted k-NN

Replace by:

General Kernel functions like Parzen Windows may be considered

Instead of inverse distance.

9 of 32

Predicting Continuous Values

Replace by:

Note: unweighted corresponds to w_i=1 for all i

10 of 32

Nearest-Neighbor Classifiers: Issues

The value of k, the number of nearest neighbors to retrieve
Choice of Distance Metric to compute distance between records
Computational complexity

Size of training set
Dimension of data

11 of 32

Value of K

Choosing the value of k:

If k is too small, sensitive to noise points
If k is too large, neighborhood may include points from other classes

Rule of thumb:

K = sqrt(N)

N: number of training points

12 of 32

Distance Metrics

13 of 32

Distance Measure: Scale Effects

Different features may have different measurement scales

E.g., patient weight in kg (range [50,200]) vs. blood protein values in ng/dL (range [-3,3])

Consequences

Patient weight will have a much greater influence on the distance between samples
May bias the performance of the classifier

14 of 32

Standardization

Transform raw feature values into z-scores

is the value for the i^th sample and j^th feature
is the average of all for feature j
is the standard deviation of all over all input samples

Range and scale of z-scores should be similar (providing distributions of raw feature values are alike)

15 of 32

Nearest Neighbor : Dimensionality

Problem with Euclidean measure:

High dimensional data

curse of dimensionality

Can produce counter-intuitive results
Shrinking density – sparsification effect

1 1 1 1 1 1 1 1 1 1 1 0

0 1 1 1 1 1 1 1 1 1 1 1

1 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 1

d = 1.4142

16 of 32

Distance for Nominal Attributes

17 of 32

Distance for Heterogeneous Data

Wilson, D. R. and Martinez, T. R., Improved Heterogeneous Distance Functions, Journal of Artificial Intelligence Research, vol. 6, no. 1, pp. 1-34, 1997

18 of 32

Nearest Neighbour : Computational Complexity

Expensive

To determine the nearest neighbour of a query point q, must compute the distance to all N training examples

Pre-sort training examples into fast data structures (kd-trees)
Compute only an approximate distance (LSH)
Remove redundant data (condensing)

Storage Requirements

Must store all training data P

Remove redundant data (condensing)
Pre-sorting often increases the storage requirements

High Dimensional Data

“Curse of Dimensionality”

Required amount of training data increases exponentially with dimension
Computational cost also increases dramatically
Partitioning techniques degrade to linear search in high dimension

19 of 32

Reduction in Computational Complexity

Reduce size of training set

Condensation, editing

Use geometric data structure for high dimensional search

20 of 32

Condensation: Decision Regions

Each cell contains one sample, and every location within the cell is closer to that sample than to any other sample.

A Voronoi diagram divides the space into such cells.

Every query point will be assigned the classification of the sample within that cell. The decision boundary separates the class regions based on the 1-NN decision rule.

Knowledge of this boundary is sufficient to classify new points.

The boundary itself is rarely computed; many algorithms seek to retain only those points necessary to generate an identical boundary.

21 of 32

Condensing

Aim is to reduce the number of training samples
Retain only the samples that are needed to define the decision boundary

Decision Boundary Consistent – a subset whose nearest neighbour decision boundary is identical to the boundary of the entire training set
Minimum Consistent Set – the smallest subset of the training data that correctly classifies all of the original training data

Original data

Condensed data

Minimum Consistent Set

22 of 32

Condensing

Condensed Nearest Neighbour (CNN)

Initialize subset with a single (or K) training example
Classify all remaining samples using the subset, and transfer any incorrectly classified samples to the subset
Return to 2 until no transfers occurred or the subset is full

Incremental
Order dependent
Neither minimal nor decision boundary consistent
O(n³) for brute-force method

23 of 32

Condensing

Condensed Nearest Neighbour (CNN)

Initialize subset with a single training example
Classify all remaining samples using the subset, and transfer any incorrectly classified samples to the subset
Return to 2 until no transfers occurred or the subset is full

24 of 32

Condensing

Condensed Nearest Neighbour (CNN)

Initialize subset with a single training example
Classify all remaining samples using the subset, and transfer any incorrectly classified samples to the subset
Return to 2 until no transfers occurred or the subset is full

25 of 32

Condensing

Condensed Nearest Neighbour (CNN)

Initialize subset with a single training example
Classify all remaining samples using the subset, and transfer any incorrectly classified samples to the subset
Return to 2 until no transfers occurred or the subset is full

26 of 32

Condensing

Condensed Nearest Neighbour (CNN)

Initialize subset with a single training example
Classify all remaining samples using the subset, and transfer any incorrectly classified samples to the subset
Return to 2 until no transfers occurred or the subset is full

27 of 32

Condensing

Condensed Nearest Neighbour (CNN)

Initialize subset with a single training example
Classify all remaining samples using the subset, and transfer any incorrectly classified samples to the subset
Return to 2 until no transfers occurred or the subset is full

28 of 32

Condensing

Condensed Nearest Neighbour (CNN)

Initialize subset with a single training example
Classify all remaining samples using the subset, and transfer any incorrectly classified samples to the subset
Return to 2 until no transfers occurred or the subset is full

29 of 32

High dimensional search

Given a point set and a nearest neighbor query point

Find the points enclosed in a rectangle (range) around the query
Perform linear search for nearest neighbor only in the rectangle

Query

30 of 32

kd-tree: data structure for range search

Index data into a tree
Search on the tree
Tree construction: At each level we use a different dimension to split

x=5

y=3

y=6

x=6

x<5

x>=5

31 of 32

kd-tree example

X=5

y=5

y=6

x=3

y=2

x=8

x=7

X=5

X=8

X=7

X=3

Y=6

Y=2

1 of 32

2 of 32

3 of 32

4 of 32

5 of 32

6 of 32

7 of 32

8 of 32

9 of 32

10 of 32

11 of 32

12 of 32

13 of 32

14 of 32

15 of 32

16 of 32

17 of 32

18 of 32

19 of 32

20 of 32

21 of 32

22 of 32

23 of 32

24 of 32

25 of 32

26 of 32

27 of 32

28 of 32

29 of 32

30 of 32

31 of 32

32 of 32