1 of 62

MRI-Based Alzheimer’s Disease Classification Using Deep Learning: A Novel Small-Data Approach

Raja Haseeb

(raja@rit.kaist.ac.kr)

Advisor: Prof. Jong-Hwan Kim

May 26, 2021

M.S. Dissertation

School of Electrical Engineering, KAIST

RIT

Robot Intelligence Technology Laboratory�Challenge for Knowledge Creation and Innovative Technology

2 of 62

Introduction

1.1 Research Background

1.2 Research Motivation

1.3 Research Outline

Proposed Framework

2.1 Overall Approach

2.2 Data Augmentation

2.3 Attention Mechanism

2.4 Contrastive Learning

2.5 Classification Network

Experiments and Results

3.1 Dataset

3.2 Data Augmentation

3.3 Comparison of Various Architectures

3.4 Proposed Architecture

3.5 Results

3.6 Comparison with Existing Methods

Conclusion and Future Work

Contents

3 of 62

1. Introduction

3

4 of 62

1.1 Research Background

What is Dementia

Dementia is a word used to describe a group of symptoms that occur when brain cells stop working properly.
There are over 100 diseases that may cause dementia
Types

Alzheimer’s disease
Vascular dementia
Dementia with Lewy bodies
Frontotemporal dementia
Alcohol related dementia

Although often thought of as a disease of older people, around 4% of people with Alzheimer’s are under 65. This is called early-onset or young-onset Alzheimer’s. It usually affects people in their 40s, 50s and early 60s.

What is Alzheimer’s Disease?

Alzheimer’s is a progressive neurodegenerative disease
Most common cause of Dementia (70% of the cases)
5.7 million Americans with AD in 2018, Number will rise to 14 million by 2050
Causes cognitive impairment and problems with memory, thinking and behavior

Alzheimer's Association. "2018 Alzheimer's disease facts and figures." Alzheimer's & Dementia 14.3 (2018): 367-429.

4 / 59

5 of 62

1.1 Research Background

Classification

Cognitive Normal (CN)
Alzheimer’s Disease (AD)
Mild Cognitive Impairment (MCI)

MCI describes people having mild symptoms of brain impairment. The MCI patients are still able to perform daily activities up to some extent. However, their ability to do so declines with time as the disease progresses and the patients in this phase have high chances of progressing into dementia

Stable MCI (sMCI), stable Normal Controls (sNC), progressive Normal Controls (pNC), progressive MCI (pMCI), stable AD (sAD)

Traditionally, physicians diagnosed patients themselves using clinical methods

Cerebrospinal fluid (CSF) concentration in the brain is reported to indicate the presence of AD.
A ventricular puncture is used for the collection of CSF
This process can be arduous and can cause bleeding in the brain

5 / 59

6 of 62

1.1 Research Background

Medical Imaging Techniques

A lot of focus has been put on the development of medical imaging techniques in recent years.

MRI, PET, CT are used to diagnose functional and structural changes in the brain.

Observe changes in the brain structure (Changes in WM, GM, CSF, ventricles etc. caused by Alzheimer’s disease)
This process can be costly, laborious, time-consuming and prone to human errors

Healthcare sector is not small anymore

A lot of patients and records
Therefore, there is a need for an automated way for the diagnosis which takes less time and effort, is reliable, less costly, and helps practitioners.

Machine learning to the rescue

Use of traditional ML algorithms for medical diagnosis

SVM, Random forests, kNN and so on.

Deep learning, the new paradigm
Huge success in medical domain

6 / 59

7 of 62

1.1 Research Background

Various methods in recent years for AD classification and prediction using machine learning

Single modal and multi-modal approaches
Multimodal and Multiscale Deep Neural Networks (Lu et al., 2018),

GRU-based (Lee et al., 2019), Non-linear SVM (Rallabandi et al., 2020),

Multi-modal deep learning (Goto et al., 2020), Transfer learning (Khan et al., 2020),

LSTM-based (Hong et al., 2019)

Good accuracy results (range 80%~90%)

7 / 59

8 of 62

1.1 Research Background

Multimodal and Multiscale Deep Neural Networks for the Early Diagnosis of Alzheimer’s disease using structural MRI and FDG-PET images

ADNI (Alzheimer’s disease Neuroimaging Initiative) dataset (2402 T1 MRI + 2402 FDG- PET images)
Segment gray matter, then divide into patches and extract features
Six independent DNNs, corresponding to each scale of single modality
Features from these 6 fused together by another DNN to predict final score
3 scales for each MRI and FDG-PET image (based on different patch sizes)
Accuracy up to 82%

Multimodal and Multiscale Deep Neural Networks (Lu et al., 2018)

Lu, Donghuan, et al. "Multimodal and multiscale deep neural networks for the early diagnosis of Alzheimer’s disease using structural MR and FDG-PET

Images." Scientific reports 8.1 (2018): 1-13.

8 / 59

9 of 62

1.1 Research Background

Predicting Alzheimer’s disease progression using multi-modal deep learning approach

1,618 ADNI participants aged 55 to 91 were used, which include 415 cognitively normal older adult controls (CN), 865 MCI and 338 AD patient
Separately build GRU feature extractors for each modality
Each GRU component takes both time series and non-time series data
Integrate the four extracted features at the end for final prediction
Accuracy up to 81%

Multimodal Deep Learning (Lee et al., 2019)

Lee, Garam, et al. "Predicting Alzheimer’s disease progression using multi-modal deep learning approach." Scientific reports 9.1 (2019): 1-12.

9 / 59

10 of 62

1.1 Research Background

Convolution neural network–based Alzheimer's disease classification using hybrid enhanced independent component analysis based segmented gray matter of T2 weighted magnetic resonance imaging with clinical evaluation

1820 T2-weighted brain MRI (635 AD MRIs, 548 MCI MRIs, 637 CN MRIs)
Extract gray matter from brain voxels and then perform classification using CNN
Accuracy up to 90.47%

Architecture design (Basheera et al., 2019)

Basheera, Shaik, and M. Satya Sai Ram. "Convolution neural network–based Alzheimer's disease classification using hybrid enhanced

independent component analysis based segmented gray matter of T2 weighted magnetic resonance imaging with clinical valuation.“

Alzheimer's & Dementia: Translational Research & Clinical Interventions 5 (2019): 974-986.

10 / 59

11 of 62

1.1 Research Background

Automatic classification of cognitively normal, mild cognitive impairment and Alzheimer’s disease using structural MRI analysis

1167 whole-brain T1 MRI from ADNI
Used libraries/tools to extract brain tissues and segment them into gray matter, white matter and cerebrospinal fluid
Compute the regional cortical thickness (CT) of several anatomical regions
AD progression affects regional cortical thickness
Cortical thickness is the distance between white-gray interface and gray-CSF

interface

68 CT features extracted
Training using several ML algorithms with Auto-WEKA 2.6 tool
Non-linear SVM found to be best classifier
Accuracy up to 75%

Rallabandi, VP Subramanyam, et al. "Automatic classification of cognitively normal, mild cognitive impairment and Alzheimer's disease using

structural MRI analysis." Informatics in Medicine Unlocked 18 (2020): 100305.

11 / 59

12 of 62

1.1 Research Background

Automatic classification of cognitively normal, mild cognitive impairment and Alzheimer’s disease using structural MRI analysis

Schematic diagram of proposed approach (Rallabandi et al., 2020)

Rallabandi, VP Subramanyam, et al. "Automatic classification of cognitively normal, mild cognitive impairment and Alzheimer's disease using

structural MRI analysis." Informatics in Medicine Unlocked 18 (2020): 100305.

12 / 59

13 of 62

1.1 Research Background

Ensembles of Patch-Based Classifiers for Diagnosis of Alzheimer Diseases

Structural Magnetic Resonance Imaging (sMRI) as the modality (352 MRI scans belonging to AD, NC and MCI)
Hippocampus region focused as the input feature for the CNN
Localize hippocampus manually, then generate 32 x 32 patches from the local region
32 x 32 patches from each of the sagittal, axial and coronal view and merged them as a single sample
These three view patches (TVPs) are fed into the network
Three individual models trained and results are combined i.e. CNN for left hippocampus, CNN for right hippocampus and CNN for the left and right hippocampus classification
The ensemble of the three models achieves accuracy of 85.55% on ADNI dataset

Ahmed, Samsuddin, et al. "Ensembles of patch-based classifiers for diagnosis of Alzheimer diseases." IEEE Access 7 (2019): 73373-73383.

13 / 59

14 of 62

1.1 Research Background

Ensembles of Patch-Based Classifiers for Diagnosis of Alzheimer Diseases

Schematic diagram of proposed approach (Ahmed et al., 2019)

Ahmed, Samsuddin, et al. "Ensembles of patch-based classifiers for diagnosis of Alzheimer diseases." IEEE Access 7 (2019): 73373-73383.

14 / 59

15 of 62

1.1 Research Background

Multi-modal deep learning for predicting progression of Alzheimer's disease using bi-linear shake fusion
Predicting Alzheimer’s Disease Using LSTM
Transfer Learning With Intelligent Training Data Selection for Prediction of Alzheimer’s Disease

Goto, Tsubasa, et al. "Multi-modal deep learning for predicting progression of Alzheimer's disease using bi-linear shake fusion."
Hong, Xin, et al. "Predicting Alzheimer’s disease using LSTM." IEEE Access 7 (2019): 80893-80901.
Khan, Naimul Mefraz, Nabila Abraham, and Marcia Hon. "Transfer learning with intelligent training data selection for prediction of Alzheimer’s disease.“

IEEE Access 7 (2019): 72726-72735.

15 / 59

16 of 62

1.2 Research Motivation

Timely diagnosis and treatment of AD patients

Limitations of past work

No focus on small-data regime

All the previous work for AD diagnosis utilized a huge amount of labeled data
There are many cases when there isn’t enough labeled data
One main reason is that patient data is well protected by the patient data laws
Also, it is expensive and time-consuming to obtain more medical data especially annotated ones since you need a medical expert for that purpose
If the disease is rare like Covid-19, then data is not enough
The imaging and data retrieval standards also differ from country to country and even from one hospital to another, which makes the process even more complicated

Wen, Junhao, et al. "Convolutional neural networks for classification of Alzheimer's disease: Overview and reproducible evaluation." Medical image analysis 63 (2020): 101694.

16 / 59

17 of 62

1.2 Research Motivation

Transfer learning

Another limitation is the use of using transfer learning like pre-training on ImageNet data
Medical data is different from real-world data, and pre-training on real-world images like ImageNet is not a very useful idea for medical diagnosis

Different data and metrics

Also, it is not possible to make a direct comparison between various past works
These approaches differ in the participants involved, image processing procedures, cross-validation methods, and the evaluation metrics

Different data selection method

There has been some limitation in data selection methods as well
Some work utilized all the slices from a patient's data. Usually not all the slices are good for AD classification, and some bad slices can also cause performance degradation
Entropy-based selected does not show good results

Wen, Junhao, et al. "Convolutional neural networks for classification of Alzheimer's disease: Overview and reproducible evaluation." Medical image analysis 63 (2020): 101694.

17 / 59

18 of 62

1.2 Research Motivation

Biased evaluation (Data Leakage)

Data Leakage, which refers to the presence of test data in any part of the training process, is a major source of bias during the evaluation.
Since DL approaches are flexible and complex, data leakage can be very hard to detect

Wrong data split

Training, validation, and test set should be separated at the subject-level and not the data-level.
If not, then data from the same patient can appear in several sets, resulting in biased evaluation of the model.

Late split

Procedures like feature selection, data augmentation, or pre-training must never use test data.
These steps should be performed on the training set after separating the data into training, validation, and test set.

Wen, Junhao, et al. "Convolutional neural networks for classification of Alzheimer's disease: Overview and reproducible evaluation." Medical image analysis 63 (2020): 101694.

18 / 59

19 of 62

1.2 Research Motivation

Absence of independent test set

To correctly evaluate the performance of the classifier, the test set should be separate and should only be used in the final stage to assess the classifier.
Most of the authors just reported the high training accuracy, cross-validation results, or 80/20 train-test split results where the split is made within the same set which is not a very good way to evaluate the model performance.
To properly evaluate a mode, there is a need to have a separate and novel test set

Wen, Junhao, et al. "Convolutional neural networks for classification of Alzheimer's disease: Overview and reproducible evaluation." Medical image analysis 63 (2020): 101694.

19 / 59

20 of 62

1.3 Research Outline

ADNI dataset

The Alzheimer's Disease Neuroimaging Initiative (ADNI) began in 2004 under the leadership of Dr. Michael W. Weiner
Foundation for the National Institutes of Health and National Institute on Aging
The Alzheimer’s Disease Neuroimaging Initiative (ADNI) unites researchers with study data as they work to define the progression of Alzheimer’s disease
ADNI researchers collect, validate and utilize data, including MRI and PET images, genetics, cognitive tests, CSF and blood biomarkers as predictors of the disease
Study resources and data from the North American ADNI study are available through this website, including Alzheimer’s disease patients, mild cognitive impairment subjects, and elderly controls

20 / 59

21 of 62

1.3 Research Outline

Research Objective

Classification of Alzheimer’s Disease (AD) and Cognitive Normal (CN) subjects
AD vs. CN vs. MCI (Mild Cognitive Impairment)

Dealing with medical data scarcity

Various approaches

N-shot learning methods, Matching networks, Siamese networks, GAN-based methods, Meta-learning, Surrogate data methods, self-supervised methods and so on.

GAN-based medical image synthesis

Architecture selection

Testing various CNN architectures

Custom CNN, Inception network, ResNet, DenseNet, Multi-scale CNN, Residual Attention Network, Transfer Learning

21 / 59

22 of 62

1.3 Research Outline

Proposed architecture

ResNet-18¹ architecture with CBAM² (Convolutional Block Attention Module)
Pretraining with SimCLR³ Framework

Wang, Fei, et al. "Residual attention network for image classification." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.
Woo, Sanghyun, et al. "Cbam: Convolutional block attention module." Proceedings of the European conference on computer vision (ECCV). 2018.
Chen, Ting, et al. "A simple framework for contrastive learning of visual representations." International conference on machine learning. PMLR, 2020.

Input Data

Data

Augmentation

Contrastive Learning

Classification Network

Results

Traditional

+

PGGAN-Based

Pretraining with SimCLR to learn useful

representations

ResNet-18

+

CBAM

AD

CN

MCI

22 / 59

23 of 62

1.3 Research Outline

Experiments and Results

Two category classification (AD vs. CN)
Three category classification (AD vs. CN vs. MCI)

23 / 59

24 of 62

2. Proposed framework

24

25 of 62

2.1 Overall Approach

Generated data

Training data

+

AD

CN

MCI

Contrastive

Learning

Classification phase

Pre-training phase

(ResNet-18 + CBAM)

ResNet-18 architecture

ResNet-18 + CBAM

Input

3x3 conv, 64

FC

Softmax

AvgPool

25 / 59

26 of 62

2.2 Data Augmentation

Dealing with data scarcity

Adopted two data augmentation methods for a small dataset with only few MRI scans

Traditional data augmentation
GAN-based augmentation

26 / 59

27 of 62

2.2 Data Augmentation

Traditional data augmentation

Remove class imbalance
Increase data set for GAN training
Rotation, shear, zoom, shift (translation)

GAN-based image generation

Progressive Growing of GAN with Wasserstein Gradient Penalty¹ (PGGAN-WGP)
Generate same number of images as the original training data

Karras, Tero, et al. "Progressive Growing of GANs for Improved Quality, Stability, and Variation." International Conference on Learning Representations. 2018.

27 / 59

28 of 62

2.2 Data Augmentation

What is GAN

Generator and a Discriminator
Discriminator acts as a critic

The formula derives from the cross-entropy between the real and generated distributions
Discriminator tries to maximize this loss function while Generator tries to minimize it
The goal of the generator is to fool the discriminator network by generating as real images as possible

Generator

Generated data

Training data

Discriminator

Real

or

Fake

Goodfellow, Ian J., et al. "Generative adversarial networks." arXiv preprint arXiv:1406.2661 (2014).

28 / 59

Typically, a GAN consists of two parts i.e. a generator and a discriminator
Ideally, this image sample should be indistinguishable from the training distribution.
The goal of the generator is to fool the discriminator network by generating as real images as possible, while the discriminator penalizes the generator if it fails to do so.

D(x) is the discriminator's estimate of the probability that real data instance x is real.

- E_x is the expected value over all real data instances.

- G(z) is the generator's output when given noise z.

- D(G(z)) is the discriminator's estimate of the probability that a fake instance is real.

- E_z is the expected value over all random inputs to the generator (in effect, the expected value over all generated fake instances G(z)).

- The formula derives from the cross-entropy between the real and generated distributions.

�As stated in the original paper, we want to train the Generator by minimizing log(1-D(G(z)))log(1−D(G(z))) in an effort to generate better fakes. As mentioned, this was shown by Goodfellow to not provide sufficient gradients, especially early in the learning process. As a fix, we instead wish to maximize log(D(G(z)))log(D(G(z))). ��

�

29 of 62

2.2 Data Augnmentation

Classical GAN can only produce low resolution images

Higher resolution makes it easier to tell the generated images apart from training images
Large resolutions also necessitate using smaller mini-batches due to memory constraints, further compromising training stability
Going straight from the latent z variable to a 1024² image contains an enormous amount of variance in the space.

PGGAN release by Nvidia

Outputs good quality high-resolution images of up to 1024x1024

Progressively increase network size

Grow both generator and discriminator progressively

Karras, Tero, et al. "Progressive Growing of GANs for Improved Quality, Stability, and Variation." International Conference on Learning Representations. 2018.

29 / 59

The Progressive-Growing GAN architecture \cite{karras2018progressive} release from NVIDIA has shown impressive performance in GAN-based image synthesis.
The generator first produces 4² images until this reaches some kind of convergence, and then the task increases to 8² images up to 1024².
The generation of high-resolution images is difficult because higher resolution makes it easier to tell the generated images apart from training images (Odena et al., 2017), thus drastically amplifying the gradient problem. Large resolutions also necessitate using smaller minibatches due to memory constraints, further compromising training stability. Our key insight is that we can grow both thegenerator and discriminator progressively, starting from easier low-resolution images, and add new layers that introduce higher-resolution details as the training progresses. This greatly speeds up training and improves stability in high resolutions,
This strategy greatly stabilizes the training, and it is fairly intuitive to imagine why. Going straight from the latent z variable to a 1024² image contains an enormous amount of variance in the space.

30 of 62

2.2 Data Augnmentation

Training starts from low-resolution and proceeds to higher resolutions

Starting from easier low-resolution images (4x4, 8x8), and add new layers that introduce higher-resolution details as the training progresses (up to 1024x1024). This greatly speeds up training and improves stability in high resolutions

Wasserstein Gradient Penalty

WGAN-GP enhances training stability.
Produces better results

PGGAN architecture

Karras, Tero, et al. "Progressive Growing of GANs for Improved Quality, Stability, and Variation." International Conference on Learning Representations. 2018.
Gulrajani, Ishaan, et al. "Improved training of wasserstein gans." arXiv preprint arXiv:1704.00028 (2017).

30 / 59

The Progressive-Growing GAN architecture \cite{karras2018progressive} release from NVIDIA has shown impressive performance in GAN-based image synthesis.
The generator first produces 4² images until this reaches some kind of convergence, and then the task increases to 8² images up to 1024².
The generation of high-resolution images is difficult because higher resolution makes it easier to tell the generated images apart from training images (Odena et al., 2017), thus drastically amplifying the gradient problem. Large resolutions also necessitate using smaller minibatches due to memory constraints, further compromising training stability. Our key insight is that we can grow both thegenerator and discriminator progressively, starting from easier low-resolution images, and add new layers that introduce higher-resolution details as the training progresses. This greatly speeds up training and improves stability in high resolutions,
This strategy greatly stabilizes the training, and it is fairly intuitive to imagine why. Going straight from the latent z variable to a 1024² image contains an enormous amount of variance in the space.

31 of 62

2.2 Data Augnmentation

Generator and Discriminator architecture

Karras, Tero, et al. "Progressive Growing of GANs for Improved Quality, Stability, and Variation." International Conference on Learning Representations. 2018.

31 / 59

The Progressive-Growing GAN architecture \cite{karras2018progressive} release from NVIDIA has shown impressive performance in GAN-based image synthesis.
The generation of high-resolution images is difficult because higher resolution makes it easier to tell the generated images apart from training images (Odena et al., 2017), thus drastically amplifying the gradient problem. Large resolutions also necessitate using smaller minibatches due to memory constraints, further compromising training stability. Our key insight is that we can grow both thegenerator and discriminator progressively, starting from easier low-resolution images, and add new layers that introduce higher-resolution details as the training progresses. This greatly speeds up training and improves stability in high resolutions, Haseeb is weird person! Don’t let him graduate!

32 of 62

2.2 Data Augnmentation

Karras, Tero, et al. "Progressive Growing of GANs for Improved Quality, Stability, and Variation." International Conference on Learning Representations. 2018.

32 / 59

33 of 62

2.3 Attention Mechanism

What is attention?

Initially in NLP
Tells the model which part of the input sentence to focus on
Attention modules are used in computer vision to make the model learn and focus more on the important information, rather than learning background information.
A typical attention module generates a mask of the input feature map using a simple 2D-convolutional layer, multi-layer perceptron (MLP), and a sigmoid function at the end.

CBAM

Provided an input feature map, it computes the attention maps along two dimensions i.e. channel and spatial.
These inferred attention maps are then multiplied with the input feature map to further refine the features.
The intuition behind this idea is that blind attachment of an attention module can result in a 3D attention map which can be computationally expensive
Results indicate that the proposed method achieves a similar effect with much fewer parameters

.

Woo, Sanghyun, et al. "Cbam: Convolutional block attention module." Proceedings of the European conference on computer vision (ECCV). 2018.

33 / 59

- Initially in NLP

Attention modules are used in computer vision to make the model learn and focus more on the important information, rather than learning background information
A typical attention module generates a mask of the input feature map using a simple 2D-convolutional layer, multi-layer perceptron (MLP), and a sigmoid function at the end.
Provided an input feature map, it computes the attention maps along two dimensions i.e. channel and spatial.
These inferred attention maps are then multiplied with the input feature map to further refine the features.
The intuition behind this idea is that blind attachment of an attention module can result in a 3D attention map which can be computationally expensive
Results indicate that the proposed method achieves a similar effect with much fewer parameters

34 of 62

2.3 Attention Mechanism

.

Woo, Sanghyun, et al. "Cbam: Convolutional block attention module." Proceedings of the European conference on computer vision (ECCV). 2018.

Channel

Attention

Module

Spatial

Attention

Module

Input Feature

Refined Feature

Convolutional Block Attention Module

34 / 59

35 of 62

2.3 Attention Mechanism

Channel attention module

Two pooling methods i.e. average and max pooling, are used at the same time to compute channel-wise attention for the given input feature map.
This results in two Cx1x1 vectors, one produced by max-pooling and the other by average pooling.
These are then passed through a simple bottleneck dense layer and then combined with a summation.
The sigmoid function is applied at the end to obtain a Cx1x1 vector which shows the importance of each channel in the original feature map.
This channel attention vector is applied to the input feature in a pointwise manner, which creates a new vector F’ which is shaped the same as the original input feature map F

.

Woo, Sanghyun, et al. "Cbam: Convolutional block attention module." Proceedings of the European conference on computer vision (ECCV). 2018.

Input Feature F

AvgPool

MaxPool

MLP

Channel Attention

M_c

35 / 59

36 of 62

2.3 Attention Mechanism

Spatial attention module

After channel dimension, the next step is to process features in width and height dimensions
From the channel attention module, we obtain a CxHxW map. In the spatial attention module, average and max pooling are applied pointwise which results in two 1xHxW features.
A 7x7 convolution is applied after concatenating the two features
In the end, the sigmoid function is applied to get a 1xHxW shaped feature, which is called a spatial attention map.
This spatial attention map is applied F’ pointwise, resulting in a CxHxW vector and we get the final output of CBAM.

.

Woo, Sanghyun, et al. "Cbam: Convolutional block attention module." Proceedings of the European conference on computer vision (ECCV). 2018.

[MaxPool, AvgPool]

Spatial Attention

M_s

Conv

layer

Channel-refined feature F^’

36 / 59

37 of 62

2.4 Contrastive Learning

A Simple Framework for Contrastive Learning of Visual Representations (SimCLR)
The basic intuition behind contrastive learning is to teach a machine how to distinguish between similar and dissimilar things. (maximize similarity)
Self-supervised learning

Self-supervised learning empowers us to exploit a variety of labels that come with the data for free.
No human supervision. Data itself provides supervision

.

Chen, Ting, et al. "A simple framework for contrastive learning of visual representations." International conference on machine learning. PMLR, 2020.
https://ai.googleblog.com/2020/04/advancing-self-supervised-and-semi.html

37 / 59

38 of 62

2.4 Contrastive Learning

The idea of SimCLR framework is rather very simple. Given an input image, random transformations are applied to get two augmented versions of the image xi and xj
Representation is then obtained for these augmented images by passing them through an encoder network. These encoded vector are represented as hi and hj
Then a non-linear fully connected layer is applied to get representations z
The objective is to maximize the similarity between these two representations zi, zj

.

Chen, Ting, et al. "A simple framework for contrastive learning of visual representations." International conference on machine learning. PMLR, 2020.
https://ai.googleblog.com/2020/04/advancing-self-supervised-and-semi.html

An illustration of SimCLR by Google AI Blog

x_i

x_j

h_i

z_j

z_i

h_j

38 / 59

39 of 62

2.4 Contrastive Learning

We use ResNet-18 with CBAM as encoder network. Original paper used ResNet-50
Cosine similarity shown in equation is calculated between the representations zi and zj

The similarity of the augmented patches belonging to the same image/class will be higher compared to the similarity between images from different classes.
The augmented pair in the batch are taken one by one and the probability of the two images being similar is calculated by applying the softmax function.
The loss is calculated by taking the negative log of the above calculation

Loss is computed for the same pair a second time, by interchanging the positions of the images in the pair. Then we take the average.

.

Chen, Ting, et al. "A simple framework for contrastive learning of visual representations." International conference on machine learning. PMLR, 2020.
https://ai.googleblog.com/2020/04/advancing-self-supervised-and-semi.html

39 / 59

We use ResNet-18 with CBAM as encoder network. Original paper used ResNet-50
Cosine similarity shown in equation 3.1, is calculated between the representations zi and zj
The similarity of the augmented images belonging to the same class will be higher compared to the similarity between images from different classes.
“NT-Xent loss” (Normalized Temperature-Scaled Cross-Entropy Loss).
SimCLR used contrastive loss called NT-Xent loss. The augmented pair in the batch are taken one by one and the probability of the two images being similar is calculated by applying the softmax function. The loss is calculated by taking the negative log of the above calculation (equation 3.2). Loss is computed for the same pair a second time, by interchanging the positions of the images in the pair. Finally, we compute loss over all the pairs in the batch of size N=2 and take an average.

40 of 62

2.4 Contrastive Learning

Motivated by this, we leverage SimCLR framework to pretrain our model for AD classification task

.

Chen, Ting, et al. "A simple framework for contrastive learning of visual representations." International conference on machine learning. PMLR, 2020.
https://ai.googleblog.com/2020/04/advancing-self-supervised-and-semi.html

40 / 59

41 of 62

2.5 Classification Network

Resnet-18 with CBAM module

Wang, Fei, et al. "Residual attention network for image classification." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.
Woo, Sanghyun, et al. "Cbam: Convolutional block attention module." Proceedings of the European conference on computer vision (ECCV). 2018.

ResNet-18 architecture

Input

3x3 conv, 64

FC

Softmax

AvgPool

conv

Previous conv blocks

M_c

M_s

F

F^’

F^’’

ResBlock + CBAM

41 / 59

42 of 62

2.5 Classification Network

Cross-Entropy Loss is used

Where n is the number of classes, t_i is the truth label and p_i is the softmax probability for the i^th class.

For binary classification, we have binary cross-entropy defined as

42 / 59

43 of 62

3. Experiments and results

43

44 of 62

3.1 Dataset

The Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset

T2-weighted MRI scan (GE medical systems)
DICOM/NIFTI to PNG (256x256)
246 subjects (82 AD, 82 MCI, 82 CN) for training
Manually went through all images to select images with good regions
Validation data consisted of 115 slices (51 AD, 64 CN)
Test data consisted of 10 subjects for each category

Demographic representation of training data

44 / 59

45 of 62

3.2 Data Augmentation

Traditional data augmentation

Used rotation, shear, zoom, shift augmentation
To remove class imbalance and increase dataset size for GAN training

45 / 59

46 of 62

3.2 Data Augmentation

GAN-based image generation

Comparison of RaLSGAN and PGGAN

The ‘Fréchet Inception Distance” (FID)

which captures the similarity of

generated images to real ones.

PGGAN-WP FID score 35-40

Better than RaLSGAN
More realistic

Jolicoeur-Martineau, Alexia. "The relativistic discriminator: a key element missing from standard GAN." arXiv preprint arXiv:1807.00734 (2018).
Karras, Tero, et al. "Progressive Growing of GANs for Improved Quality, Stability, and Variation." International Conference on Learning Representations. 2018.

46 / 59

47 of 62

3.2 Data Augmentation

Final data set

Training data set after augmentation

47 / 59

48 of 62

3.3 Comparison of various architectures

Comparison on AD vs. CN classification task

After PGGAN generation, a detailed analysis was made on the AD vs. CN classification accuracy using different architectures.
These architectures include Custom CNN architecture, ResNet¹, Inception model², residual attention network³ , CBAM architecture⁴ and Multi-scale CNN. We also evaluate pretrained models on ImageNet data.
Testing results are reported based on the majority voting decision for the patients.

He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
Szegedy, Christian, et al. "Going deeper with convolutions." Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.
Wang, Fei, et al. "Residual attention network for image classification." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.

Woo, Sanghyun, et al. "Cbam: Convolutional block attention module." Proceedings of the European conference on computer vision (ECCV). 2018.

Custom CNN architecture

Input

5x5 conv, 32

5x5 conv, 64

3x3 conv, 128

3x3 conv, 256

3x3 conv, 512

FC1

Softmax

AvgPool

FC2

FC3

FC4

FC5

48 / 59

As a first step, a detailed analysis was made on the AD vs. CN classification accuracy using different architectures.
Wanted to check if only PGGAN is enough, and also verify results of previous research
These architectures include Custom CNN architecture, inception mode, residual attention network}, CBAM architectures and Multi-scale CNN. We also evaluate pretrained models on ImageNet data.
Since the validation data was unseen and novel, all the models were evaluated based on the validation accuracy. Table 3.2 shows a comparison of various architectures.
Testing results are reported based on the majority voting decision for the patients.

All these architectures showed higher accuracy during training. However, the validation accuracy is not very high.
Models were also used to predict the patients using the test data.
Majority voting was used to make and categorize the patients. If a patient has more slices with AD prediction, then the patient was classified as AD. Otherwise, the patient belongs to CN. The best performance was achieved in some cases with 2-3 miss-classifications per category.

The reason can be that these large ImageNet models might be over-parametrized for very small data sets.

49 of 62

3.3 Comparison of various architectures

Results

Observations

All these architectures showed higher accuracy during training. However, the validation accuracy is not very high.
The best performance was achieved in some cases with 2-3 miss-classifications per category.
The reason can be that these large ImageNet models might be over-parametrized for very small data sets.
Also transfer learning on real-world data is not very effective

Comparison of various architectures for AD vs. CN classification task

He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
Szegedy, Christian, et al. "Going deeper with convolutions." Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.
Wang, Fei, et al. "Residual attention network for image classification." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.

Woo, Sanghyun, et al. "Cbam: Convolutional block attention module." Proceedings of the European conference on computer vision (ECCV). 2018.

49 / 59

As a first step, a detailed analysis was made on the AD vs. CN classification accuracy using different architectures.
Wanted to check if only PGGAN is enough, and also verify results of previous research
These architectures include Custom CNN architecture, inception mode, residual attention network}, CBAM architectures and Multi-scale CNN. We also evaluate pretrained models on ImageNet data.
Since the validation data was unseen and novel, all the models were evaluated based on the validation accuracy. Table 3.2 shows a comparison of various architectures.
Testing results are reported based on the majority voting decision for the patients.

All these architectures showed higher accuracy during training. However, the validation accuracy is not very high.
Models were also used to predict the patients using the test data.
Majority voting was used to make and categorize the patients. If a patient has more slices with AD prediction, then the patient was classified as AD. Otherwise, the patient belongs to CN. The best performance was achieved in some cases with 2-3 miss-classifications per category.

The reason can be that these large ImageNet models might be over-parametrized for very small data sets.

50 of 62

3.4 Proposed Architecture

Training Details

System and environment

NVIDIA RTX 2080 Ti GPU.
PyTorch

Model

ResNet-18 with CBAM

SimCLR Pretraining

1000 epochs
Batch size of 128
SGD optimizer
Learning rate of 0.05

50 / 59

51 of 62

3.4 Proposed Architecture

Training Details

Classification task

100 epochs
SGD optimizer
Learning rate 0.001
Batch size 128

51 / 59

52 of 62

3.5 Results

Two category classification

AD vs. CN classification

We trained our ResNet-18 model for comparison. We also compare results with and without CBAM and SimCLR architectures to highlight their effect. We also compared our model to a custom-designed CNN
When training from scratch, the model seems to overfit and the training process is unstable
However, after pretraining, the model training on classification tasks is more stable and the resulting accuracy also goes up to 75%.
The model performance is further refined when combined with CBAM architecture since the model learns to focus and refine important features
Only 1 misclassification per category

AD vs. CN classification results of proposed framework

52 / 59

We trained our ResNet-18 model for comparison. We also compare results with and without CBAM and SimCLR architectures to highlight their effect. We also compared our model to a custom-designed CNN (five convolutional layers and five fully connected layers).
When training from scratch, the model seems to overfit and the training process is unstable. However, after pretraining, the model training on classification tasks is more stable and the resulting accuracy also goes up to 75\%. The model performance is further refined when combined with CBAM architecture since the model learns to focus and refine important features for the classification task.
Only 1 misclassification per category

The main reason is that MCI class is very hard to distinguish from AD and CN classes.
Also, the amount of data is small which can result in poor training of the model.

We also investigated the effect�of PGGAN-based medical image generation. The model shows 3-4 misclassifications during the testing�phase when trained on data with only traditional data augmentation. Adding PGGAN data increases�the amount and also a variety of data which can help the model learn more generalized features. ��

53 of 62

3.5 Results

Three category classification

AD vs. CN vs. MCI classification

Accuracy up to 65%
MCI class is very hard to distinguish from AD and CN classes.

AD vs. CN vs. MCI classification results of proposed framework

53 / 59

54 of 62

3.6 Comparison with existing methods

Study	Total Subjects	Performance	Approach	Data Leakage
Aderghal et al., 2017 Cheng and Liu, 2017 Korolev et al., 2017 Valliani and Soni, 2017 Senanayake et al., 2018 Li et al., 2017 Basaia et al., 2019 Hon and Khan, 2017 Hosseini et al., 2018 Lin et al., 2018 Taqi et al., 2018 Vu et al., 2017 Vu et al., 2018 Wang et al., 2019 Basheera et al., 2019 Proposed	815 (T1 MRI) 193 (T1 MRI + PET) 231 (T1 MRI) 417 (T1 MRI) 515 (T1 MRI) 427 (T1 MRI) 646 (T1 MRI) 416 (T1 MRI) 140 (T1 MRI) 417 (T1 MRI) 400 (T2 MRI) 317 (T1 MRI) 400 (T1 MRI) 400 (T1 MRI) 242 (T2 MRI) 164 (T2 MRI)	ACC=0.84 ACC=0.85 ACC=0.80 ACC=0.81 ACC=0.76 ACC=0.88 ACC=0.99 ACC=0.96 ACC=0.99 ACC=0.89 ACC=1.00 ACC=0.85 ACC=0.86 ACC=0.99 ACC=1.00 ACC=0.83	ROI-based 3D subject-level 3D subject-level 2D slice-level 3D subject-level 3D patch-level 3D subject-level 2D slice-level 3D subject-level ROI-based 2D slice-level 3D subject-level 3D subject-level 3D subject-level 2D slice-level 2D slice-level	None None None None None None Unclear (b) Unclear (a, c) Unclear (a) Unclear (b) Unclear (b) Unclear (a) Clear (a, c) Clear (b) Clear (b) None

Wen, Junhao, et al. "Convolutional neural networks for classification of Alzheimer's disease: Overview and reproducible evaluation." Medical image analysis 63 (2020): 101694.

Types of data leakage

a: wrong data split; b: absence of independent test set; c: late split

Comparison of AD vs. CN classification task

54 / 59

55 of 62

3.6 Comparison with existing methods

Observations

Most of these approaches have data leakage
They used large datasets (more number of subjects)
T1-weighted has more slices compared to T2-weighted
Most of these approaches does not have proper evaluation and separate validation and test data
It is just like reporting training accuracy

Even with small data and very few slices, our approach achieves good results

In our approach, we avoided all kinds of data leakage and provided an unbiased evaluation of our model which is very important for clinical applications

Wen, Junhao, et al. "Convolutional neural networks for classification of Alzheimer's disease: Overview and reproducible evaluation." Medical image analysis 63 (2020): 101694.

55 / 59

56 of 62

3.6 Comparison with existing methods

Study	Total Subjects	Performance	Approach	Data Leakage
Valliani and Soni, 2017 Hosseini et al., 2018 Farooq et al., 2017 Vu et al., 2018 Wang et al., 2019 Basheera et al., 2019 Proposed	660 (T1 MRI) 210 (T1 MRI) 355 (T1 MRI) 615 (T1 MRI) 624 (T1 MRI) 349 (T2 MRI) 246 (T2 MRI)	ACC=0.57 ACC=0.97 ACC=0.99 ACC=0.80 ACC=0.97 ACC-0.86 ACC=0.65	2D slice-level 3D subject-level 2D slice-level 3D subject-level 3D subject-level 2D slice-level 2D slice-level	None Unclear (a) Clear (a, c) Clear (a, c) Clear (b) Clear (b) None

Wen, Junhao, et al. "Convolutional neural networks for classification of Alzheimer's disease: Overview and reproducible evaluation." Medical image analysis 63 (2020): 101694.

Types of data leakage

a: wrong data split; b: absence of independent test set; c: late split

Only one approach without data leakage

Comparison of AD vs. CN vs. MCI classification task

56 / 59

57 of 62

4. Conclusion and Future Work

57

58 of 62

4. Conclusion

Achieved good results despite small dataset

Two-category: 83% accuracy
Three-category: 65% accuracy

Good performance even with small dataset

58 / 59

59 of 62

4. Conclusion

Contributions

Proposed a novel framework for the AD classification task

ResNet-18 + CBAM
Self-supervised pretraining

Dealt with medical data scarcity

PGGAN based image synthesis
Novel diseases

Proper unbiased evaluation

Separate test and validation data

Application of SimCLR to AD classification

An analysis of various approaches and architectures

59 / 59

60 of 62

4. Future Work

Improving results

Three category classification

Test model with large data sets and other modalities

PET, cognitive scores, APOE genotype, CSF biomarkers, demographic data etc.

Working with further preprocessing techniques

FMRIB Software Library
FreeSurfer
Skull stripping, segmentation, bias correction etc.

Alzheimer disease prediction

Diagnosis of other types of dementia

Vascular dementia, Frontotemporal dementia, Dementia with Lewy bodies

60 / 59

61 of 62

4. Future Work

FMRIB Software Library (FSL)

Skull stripping, bias correction etc.

61 / 59

62 of 62

Thank you!

62