1 of 41

Multivariate Analysis Practical STA701

Performing Classification Analysis��

Copyright © 2023 by Retha Luus. All rights reserved.

2 of 41

Introduction

    • Discriminant analysis
    • Classification analysis

First two topics of block week 2:

Both concerned with separation of observations into groups

    • Understand differences between groups
    • Assess contribution of variables to group separation
      • Focus: exploratory analysis of grouped data

Discriminant analysis

    • Predictive discriminant analysis
    • Focus:
      • Build a classification model/rule to assign new observations to groups (perform)
      • Assess the classification model’s predictive performance (assess)

Classification analysis

3 of 41

Introduction

  •  
  • Confusion matrix provides summary of classification rule’s predictive performance

 

4 of 41

Introduction

Resubstitution method

    • Use same data to compute and test classification functions
    • Provides apparent error rate
    • Too optimistic

Validation set method

    • AKA: Sample partitioning method
    • Partition data into training and validation/test sets
    • Compute classification functions on training data
    • Test classification functions on validation/test data
    • Improved estimate of misclassification error

LOOCV method

    • AKA: Holdout method
    • Use all but one observations as training data to compute classification functions
    • Use left-out observation as validation/test data to test classification functions
    • Repeat times misclassification error estimate is average of estimates

5 of 41

Agenda

Two-Group Classification Analysis

Performing Two-Group Classification Analysis

Hands-on Practice

Several-Group Classification: Equal Covariances

Several-Group Classification: Unequal Covariances

Performing Several-Group Classification Analysis

Hands-on Practice

Summary

6 of 41

Two-Group Classification Analysis

  •  

7 of 41

Two-Group Classification Analysis

  •  

8 of 41

Two-Group Classification Analysis

  •  

 

9 of 41

Data for Demo

  •  

 

10 of 41

Performing Two-Group Classification Analysis

  • Equal covariance matrices?

library(heplots)boxM(cbind(y1,y2,y3,y4) ~ Group, data=dat5.1)

## �## Box's M-test for Homogeneity of Covariance Matrices�## �## data: Y�## Chi-Sq (approx.) = 13.551, df = 10, p-value = 0.1945

11 of 41

Performing Two-group Classification Analysis

  • Resubstitution method:
    1. Construct classification rule:

library(MASS)dat5.1.fit <- lda(Group~., data=dat5.1)

    • Use classification rule to predict same data:

dat5.1.pred <- predict(dat5.1.fit)

12 of 41

Performing Two-group Classification Analysis

  1. Calculate confusion matrix

dat5.1.conf <- table(dat5.1$Group, dat5.1.pred$class)dat5.1.conf�## 1 2�## 1 28 4�## 2 4 28

  1. Calculate estimate of misclassification error (apparent error):

dat5.1.apperr <- (dat5.1.conf[1,2]+dat5.1.conf[2,1])/sum(dat5.1.conf)dat5.1.apperr

## [1] 0.125

  • misclassifications
  • Apparent error rate: 12.5%

13 of 41

Performing Two-group Classification Analysis

  • Validation set approach:
    1. Partition the data (70:30)

library(caret)set.seed(1)inds <- createDataPartition(dat5.1$Group, times=1, p=0.7)dat5.1.train <- dat5.1[inds[[1]],]dat5.1.test <- dat5.1[-inds[[1]],]

14 of 41

Performing Two-group Classification Analysis

  1. Construct the classification rule on training data

library(MASS)dat5.1.fit <- lda(Group~., data= dat5.1.train)

  1. Use classification rule to predict test data

dat5.1.pred <- predict(dat5.1.fit, newdata = dat5.1.test)

15 of 41

Performing Two-group Classification Analysis

  1. Calculate confusion matrix

dat5.1.conf <- table(dat5.1.test$Group, dat5.1.pred$class)dat5.1.conf�## 1 2�## 1 9 0�## 2 3 6

  1. Calculate (misclassification error):

dat5.1.vserr <- (dat5.1.conf[1,2]+dat5.1.conf[2,1])/sum(dat5.1.conf)dat5.1.vserr

## [1] 0.1666667

  • misclassifications
  • Misclassification rate estimate: 16.7%

16 of 41

Performing Two-group Classification Analysis

  • LOOCV approach:
    1. Set CV=TRUE in lda for LOOCV:

library(MASS)dat5.1.fit <- lda(Group~., data=dat5.1, CV=TRUE)

    • Calculate confusion matrix

dat5.1.conf <- table(dat5.1$Group, dat5.1.fit$class)dat5.1.conf�## 1 2�## 1 28 4�## 2 5 27

    • Calculate (misclassification error):

dat5.1.loocverr <- (dat5.1.conf[1,2]+dat5.1.conf[2,1])/sum(dat5.1.conf)dat5.1.loocverr

## [1] 0.140625

  • misclassifications
  • Misclassification rate estimate: 14.1%

17 of 41

Hands-on Practice

  • Using the Beetle data (T5_5_FBEETLES.DAT)
    1. Find the classification function
    2. Assess this classification rule’s predictive performance:
      1. Resubstitution method
      2. Validation set method (80:20)
      3. LOOCV method

18 of 41

Several-Group Classification: Equal Covariances

  •  

19 of 41

Several-Group Classification: Equal Covariances

  •  

20 of 41

Several-Group Classification: Equal Covariances

  •  

 

21 of 41

Several-Group Classification: Unequal Covariances

  •  

22 of 41

Several-Group Classification: Unequal Covariances

  •  

23 of 41

Several-Group Classification: Unequal Covariances

  •  

 

24 of 41

Several-group classification analysis

 

Yes

No

 

 

 

 

Assess

Resubstitution

Validation Set

LOOCV

25 of 41

Data for Demo

  • Table 8.3 of Rencher and Christensen (2012) (T8_3_FOOTBALL.DAT)
  • 30 subjects in each of three groups: high school (1), college (2) and non-football players (3)
  • Six head measurements made on each subject (variables):
    • WDIM: head width
    • CIRCUM: head circumference
    • FBEYE: front-to-back measurement at eye level
    • EYEHD: eye-to-top-of-head measurement
    • EARHD: ear-to-top-of-head measurement
    • JAW: jaw width

## Data��dat8.3 <- read.table("Your path//T8_3_FOOTBALL.DAT", header=FALSE)colnames(dat8.3) <- c("Group","WDIM","CIRCUM","FBEYE","EYEHD","EARHD","JAW")dat8.3$Group <- factor(dat8.3$Group)

 

26 of 41

Performing Several-Group Classification Analysis

  • Equal covariance matrices?

library(heplots)boxM(cbind(WDIM,CIRCUM,FBEYE,EYEHD,EARHD,JAW) ~ Group, data=dat8.3)

## �## Box's M-test for Homogeneity of Covariance Matrices�## �## data: Y�## Chi-Sq (approx.) = 57.472, df = 42, p-value = 0.05622

  • Linear discriminant analysis (LDA)

27 of 41

Performing Several-Group Classification Analysis

  • Resubstitution method:
    1. Use lda function to construct classification rules:

library(MASS)dat8.3.lda <- lda(Group~., data=dat8.3)

    • Use LDA classification rules to predict same data:

dat8.3.pred <- predict(dat8.3.lda)

28 of 41

Performing Several-Group Classification Analysis

  1. Calculate confusion matrix:

dat8.3.lda.conf <- table(dat8.3$Group, dat8.3.pred$class)dat8.3.lda.conf�## 1 2 3�## 1 26 1 3�## 2 1 20 9�## 3 2 8 20

  1. Calculate apparent error rate:

dat8.3.lda.apperr <- (sum(dat8.3.lda.conf)-dat8.3.lda.conf[1,1]-dat8.3.lda.conf[2,2]-dat8.3.lda.conf[3,3])/sum(dat8.3.lda.conf)dat8.3.lda.apperr

## [1] 0.2666667

  • misclassifications
  • Apparent error rate: 26.7%

29 of 41

Performing Several-Group Classification Analysis

  • Validation set approach:
    1. Partition the data (70:30)

library(caret)set.seed(1)inds <- createDataPartition(dat8.3$Group, times=1, p=0.7)dat8.3.train <- dat8.3[inds[[1]],]dat8.3.test <- dat8.3[-inds[[1]],]

    • Construct classification rules on training data:

library(MASS)dat8.3.lda <- lda(Group~., data=dat8.3.train)

    • Use LDA classification rules to predict test data:

dat8.3.lda.pred <- predict(dat8.3.lda, newdata = dat8.3.test)

30 of 41

Performing Several-Group Classification Analysis

  1. Calculate confusion matrix:

dat8.3.lda.conf <- table(dat8.3.test$Group, dat8.3.lda.pred$class)dat8.3.lda.conf�## 1 2 3�## 1 7 1 1�## 2 0 6 3�## 3 1 2 6

  1. Calculate misclassification error estimate:

dat8.3.lda.vserr <- (sum(dat8.3.lda.conf)-dat8.3.lda.conf[1,1]-dat8.3.lda.conf[2,2]-dat8.3.lda.conf[3,3])/sum(dat8.3.lda.conf)dat8.3.lda.vserr

## [1] 0.2962963

  • misclassifications
  • Misclassification error estimate: 29.6%

31 of 41

Performing Several-Group Classification Analysis

  • LOOCV approach:
    1. Use lda function with CV=TRUE for LOOCV:

library(MASS)dat8.3.lda <- lda(Group~., data=dat8.3, CV=TRUE)

    • Calculate confusion matrix:

dat8.3.lda.conf <- table(dat8.3$Group, dat8.3.lda$class)dat8.3.lda.conf�## 1 2 3�## 1 26 1 3�## 2 1 18 11�## 3 2 9 19

    • Calculate misclassification error estimate:

dat8.3.lda.loocverr <- (sum(dat8.3.lda.conf)-dat8.3.lda.conf[1,1]-dat8.3.lda.conf[2,2]-dat8.3.lda.conf[3,3])/sum(dat8.3.lda.conf)dat8.3.lda.loocverr

## [1] 0.3

  • misclassifications
  • Misclassification error estimate: 30.0%

32 of 41

Hands-on Practice

  • For the Fish data (T6_17_FISH.DAT):
    • Assess the LDA classification rules’ predictive performance:
      1. Resubstitution method
      2. Validation set method (80:20)
      3. LOOCV method
    • Assess the QDA classification rules’ predictive performance:
      • Resubstitution method
      • Validation set method (80:20)
      • LOOCV method
    • Which performed better?

For Block Week 2 Practical Question 3:

Submit summary table of LDA and QDA error rates. Which would you choose?

33 of 41

QDA?

  • Equal covariance matrices?

library(heplots)boxM(cbind(WDIM,CIRCUM,FBEYE,EYEHD,EARHD,JAW) ~ Group, data=dat8.3)

## �## Box's M-test for Homogeneity of Covariance Matrices�## �## data: Y�## Chi-Sq (approx.) = 57.472, df = 42, p-value = 0.05622

  • Quadratic discriminant analysis (QDA)

34 of 41

QDA?

  • Resubstitution method:
    1. Use qda function to construct classification rules:

library(MASS)dat8.3.qda <- qda(Group~., data=dat8.3)

    • Use QDA classification rules to predict same data:

dat8.3.pred <- predict(dat8.3.qda)

35 of 41

QDA?

  1. Calculate confusion matrix:

dat8.3.qda.conf <- table(dat8.3$Group, dat8.3.pred$class)dat8.3.qda.conf�## 1 2 3�## 1 27 1 2�## 2 2 21 7�## 3 1 4 25

  1. Calculate apparent error rate:

dat8.3.qda.apperr <- (sum(dat8.3.qda.conf)-dat8.3.qda.conf[1,1]-dat8.3.qda.conf[2,2]-dat8.3.qda.conf[3,3])/sum(dat8.3.qda.conf)dat8.3.qda.apperr

## [1] 0.1888889

  • misclassifications
  • Apparent error rate: 18.9%

36 of 41

QDA?

  • Validation set approach:
    1. Use same training and test data sets from LDA (to compare)
    2. Use training data to construct QDA classification rules:

library(MASS)dat8.3.qda <- qda(Group~., data=dat8.3.train)

    • Use QDA classification rules to predict test data:

dat8.3.qda.pred <- predict(dat8.3.qda, newdata = dat8.3.test)

37 of 41

QDA?

  1. Calculate confusion matrix:

dat8.3.qda.conf <- table(dat8.3.test$Group, dat8.3.qda.pred$class)dat8.3.qda.conf�## 1 2 3�## 1 6 2 1�## 2 0 5 4�## 3 2 1 6

  1. Calculate misclassifaction error estimate:

dat8.3.qda.vserr <- (sum(dat8.3.qda.conf)-dat8.3.qda.conf[1,1]-dat8.3.qda.conf[2,2]-dat8.3.qda.conf[3,3])/sum(dat8.3.qda.conf)dat8.3.qda.vserr

## [1] 0.3703704

  • misclassifications
  • Misclassification error estimate: 37.0%

38 of 41

QDA?

  • LOOCV approach:
    1. Use qda function with CV=TRUE for LOOCV:

library(MASS)dat8.3.qda <- qda(Group~., data=dat8.3, CV=TRUE)

  1. Calculate confusion matrix:

dat8.3.qda.conf <- table(dat8.3$Group, dat8.3.qda$class)dat8.3.qda.conf�## 1 2 3�## 1 26 2 2�## 2 3 16 11�## 3 4 9 17

  1. Calculate misclassification error estimate:

dat8.3.qda.loocverr <- (sum(dat8.3.qda.conf)-dat8.3.qda.conf[1,1]-dat8.3.qda.conf[2,2]-dat8.3.qda.conf[3,3])/sum(dat8.3.qda.conf)dat8.3.qda.loocverr

## [1] 0.3444444

  • misclassifications
  • Misclassification error estimate: 34.4%

39 of 41

Classification Analysis Summary

Misclassification Error Estimation Method

Classification Method

Resubstitution

Validation Set

LOOCV

Two-group

0.125

0.167

0.141

Several-group LDA

0.267

0.296

0.300

Several-group QDA

0.189

0.370

0.344

  • See how apparent error is too optimistic
  • In small data sets LOOCV recommended

40 of 41

Summary

Decide whether to use linear or quadratic classification functions

Are equal?

Perform two-group and several-group classification analysis by:

writing your own classification functions (see notes)

using the lda and qda functions in MASS

41 of 41

Summary

    • Resubstitution method

Calculate apparent error rate of classification functions

    • Validation set/Sample partition method
    • LOOCV/Holdout method

Calculate improved estimates of error rate