dog-identification-app_report

Capstone Project	Seongjae Ryu
Machine Learning Engineer Nanodegree	August 20, 2021

CNN Project: Dog Breed Classifier

Project Report

I. Definition

Project Overview

Dog breed classifier is one of the popular CNN(convolutional neural network) projects [1]. The main problem is to identify a dog breed from an input image. It also needs to be identified whether it is an image of a dog or not. In case when a given input is identified as an image with a human face, a similar dog breed will be provided.

This idea is based on supervised machine learning with a multi-class classification problem. The intention of this model is to deploy APIs and build an application with the APIs.

Problem Statement

The problem is to build a dog breed classification model and deploy this model for an application.

Dog-image identifier: Inferring whether it is a dog image or not from an input image.
Dog-breed classifier: Inferring a dog breed from an input image.
Human-face-image identifier: Inferring whether it is a human face image or not from an input image.
Human-face-image Resemble-dog classifier: Inferring a dog breed, resembling a human face input image.

Metrics

Only the dataset for the main problem of building a dog breed classification is balanced. ‘Accuracy’ is an optimal means of performance measurement only when the dataset is balanced. Thus ‘accuracy’ is used only for measuring ‘dog breed classification’ performance.

Human-face-image identifier

The percentage of human face images inferred as a human face:

The percentage of dog images inferred as not a human face:

Dog-image identifier

The percentage of dog images inferred as a dog:

The percentage of human face images inferred as not a dog:

Dog-breed classifier

II. Analysis

Data Exploration

Dog images and human-face images are included in the dataset, which is labeled, organized, and provided by Udacity [2]. The dog image dataset is balanced, but the human-face dataset is imbalanced.

Dog image dataset [3]

It contains 8,351 RGB images of dogs with each labeled file name. The images are organized in each directory; train, test, and valid. There are 6,680 train images, 836 test images, and 835 valid images. Each breed of 133 is grouped in each folder with label number and breed name; i.e. ‘/dog_images/train/103.Mastiff/Mastiff_06826.jpg’. Image sizes are varied.
Balanced dataset, which is optimal to use ‘accuracy’.

Human-face dataset [4]

It contains 13,233 RGB human-face images with each labeled file name. Images are grouped by each person in each folder with label number and person’s name; i.e. ‘/Daniele_Bergamin/Daniele_Bergamin_0001.jpg’. There are 5,750 folders and the data is imbalanced, such as one has few images and the other has many images. The image size is 250 by 250 px.
Imbalanced dataset, which is not optimal to use ‘accuracy’.

Exploratory Visualization

Dog image dataset: It is suitable for modeling. Each label has enough data for modeling and the data is properly divided into each set of training, validation, and test.

Human-face dataset: It is suitable for the purpose of testing whether it is a human face or not. But it is not suitable for creating a new model. Some labels have only 1 image. This is the reason for using a pre-trained face detection model.

Algorithms and Techniques

A convolutional neural network is used for multi-class classification, dog breed classifier. VGG16 is chosen for a pre-trained CNN model because it achieves 92.7% top-5 test accuracy in ImageNet, which is a dataset of over 14 million images belonging to 1000 classes [5]. One of the cons of using pre-trained CNN type models is that it only fits on specific situations. If the condition changes, such as parameter modification, the modified model needs to be trained a lot again to fit the very large amount of weights.

Haar Cascade Classifier is used for a face detection model. One of the primary benefits of using the classifier is the fastness and the lightness, which fit for the situation that the face detection is not a primary problem [6].

Dog-image identifier: Pre-trained VGG-16 torchvision model [7].

Load pre-trained VGG-16 model via torchvision.
Load image and convert into tensor image.
Infer the image and check whether the output is in 133 breeds or not.

Dog-breed classifier: Fitted pre-trained VGG-16 torchvision model with custom classifier for 133 breeds.

Load pre-trained VGG-16 model via torchvision.
Define a new custom classifier to fit the model into the dataset with 133 breed categories.
Load image and convert into tensor image with transforms.
Train the model with train dataset and valid dataset.
Test the model with a test dataset.
Deploy the model and infer the most likely breed.

Human-face-image identifier: OpenCV with pre-trained face detector, haarcascade_frontalface_alt [8].

Load pre-trained OpenCV model with haarcascade_frontalface_alt.xml.
Load and convert RGB image into grayscale.
Infer the image.

Human-face-image Resemble-dog classifier: Combination of human-face-image identifier and dog-breed classifier.

Infer the image via human-face-image identifier and if the inference tells it is a human face image, infer the breed via dog-breed classifier.

Benchmark

Human-face-image identifier must have 70% or above true inference and vice versa.

The percentage of human face images inferred as a human face:

The percentage of dog images inferred as not a human face:

Dog-image identifier must have 70% or above true inference and vice versa.

The percentage of dog images inferred as a dog:

The percentage of human face images inferred as not a dog:

Dog-breed classifier must have 60% or above accuracy.

III. Methodology

Data Processing

Dog images and human-face images are included in the dataset, which is labeled, organized, and provided by Udacity [2]. Each data path is stored in a ‘numpy array’ and it is sliced for model tests. But further transformations are still needed for each model after this data processing.

Data path list creation of dataset: The list of each dataset into each numpy array by using ‘glob’ and ‘numpy’ library.

Data path list division for test: Slicing each numpy array

Implementation

Import library: Python(3.6.3), PIL(5.2.0), cv2(3.3.1), matplotlib(2.1.0), numpy(1.12.1), pandas(0.23.3), seaborn(0.8.1), torch(0.4.0), torchvision(0.2.1), tqdm(4.11.2)

Import dataset

Load dataset

Load dataset by ‘data directory’ and check each file and label number is correct.

Check the number of dog data labels is correct.

Explore the data

Check the distribution of human images. If the number of labels is above 150, plot_human_data_distribution function will show a chart of top 50 and bottom 50. The number of the human labels is 5749, so the function shows the result of top 50 and bottom 50.

Check the distribution of dog images.

Detect humans

Implement a face detector: With the implementation, the model will infer face from a human face image. Then, the detected area will be shown.

Test the face detector with part of the whole dataset: to get the percentage of human face images inferred as a human face and the percentage of dog images inferred as not a human face.

Dog Detector

Implement a pre-trained classifier model through ‘torchvision’ library: VGG16 is used. The input data is transformed into a randomly cropped 224 by 224 tensor. If CUDA is available, the tensor object and the model object will work with GPU. The model returns predicted classes. An index of the maximum value among the classes will be used for the final output of VGG16_predict function.

Use pre-trained classifier model as a dog detector: The labels of the dog dataset correspond to dictionary keys 151-268, inclusive, to include all categories from 'Chihuahua' to 'Mexican hairless' of model label text file(imagenet 1000 class idx to human readable labels) [9]. So if an input image is a dog, the index will be in the dictionary keys.

Test dog detector: Test the performance of the dog_detector function on the images in human_files_short and dog_files_short.

Compare torchvision models: Compare the performance of ‘torchvision’ models to decide whether to use VGG16 or not. For the comparison, define functions for multiple-model creation and tests [10].

Do test 5 times to compare the performance of each model: 'resnet18', 'alexnet', 'squeezenet1_0', 'vgg16', 'densenet161' [10].

Compare the result and select a model with high true result and low false result. Among the five models, VGG16 fits best for the dataset.

Create a CNN to classify dog breeds (from scratch)

Specify data loaders for the dog dataset: For the train set, an image is randomly 30 degree rotated and randomly resized into 224 by 224 tensor. It is also randomly horizontally flipped. For validation set and test set, an image is resized into 255 by 255 tensor and centerly cropped into 224 by 224 tensor. Both input tensors are 224 by 224. Reason of choosing the input size is that it is a common size choice for pre-trained CNN models.

Implement Model Architecture: The initial input of the convolutional 2d layer is ‘3’ which is each channel of RGB. LeakyReLU with 0.2 angle negative slope is used for the activation function for each convolutional layer. MaxPool2d(2,2) is used for dimension reduction in half. After 4 sequential layer sets of Conv2d, LeakyReLU, and MaxPool2d, AvgPool2d(14) is used to apply a 2D average pooling over an input signal composed of several input planes, and a method, ‘view’, is used to allow a tensor to be a view of an existing tensor. Then, a classifier gets 128 inputs which is equal to the output number of AvgPool2d, and returns 133 outputs which is equal to the number of dog labels.

Implement a loss function and an optimizer.

Implement a model training function with validation. Then train the model and save and use the model with the best validation accuracy.

Test the model: Keep training until the accuracy of the test result gets at least above 10%. If the accuracy is below 60%, the model will not be used.

Create a CNN to classify dog breeds (using transfer learning)

Specify Data Loaders for the Dog Dataset: Also delete previous data loaders to avoid memory issues.

Model Architecture: In the comparison of torchvision models, VGG16 fits best for the dataset so it is used as the base model. After turning off the gradient option of the loaded base model, the last linear layer of the classifier is newly defined. Its output parameter is the only difference. This is to maintain the original performance as much as possible by minimizing the deformation from the original. It is changed from 1000 to 133, which is equal to the number of dog labels.

Implement a loss function and an optimizer: Same loss function and optimizer is used. This is to maintain the original performance as much as possible by minimizing the deformation from the original.

Implement a model training function with validation. Then train the model and save and use the model with the best validation accuracy.

Test the model: Keep training until the accuracy of the test result gets at least above 10%. If the accuracy is below 60%, the model will not be used.

Implement a dog breed predictor with the best accuracy model.

Define an app function: It will display input file name and graphic. Then it will display a prediction output result. There are 3 cases for the result, which is human-face-case, dog-case and neither-case.

Refinement

Minimizing the deformation from the original results in the best performance. The minimization is changing only the output size of the last linear layer of the classifier. The better performance can be expected through more learning. But the difference in performance between the minimization case and the others is definitely too big. One example for ‘the others’ is changing 3 linear layers of the classifier. The total learning time is above 20 minutes, which is 5 times longer than the minimization case. But the accuracy of the example is only 1% by comparison with 74% accuracy of the minimization case.