1 of 30

Eye-Deep

Detecting Diabetes

with Convolutional Neural Networks

team o_O

Mathis Antony

sveitser@gmail.com

Stephan Brüggemann

https://github.com/sveitser/kaggle_diabetic

https://www.kaggle.com/c/diabetic-retinopathy-detection

2 of 30

Intro

Supervised Learning

Training set (features + labels) and test set (only features)
Training

learn relationship between features and labels (on training set)

Testing

predict labels from test data and measure performance

Deep learning

Deep → many layers
Concepts not “new”

More data (internet)
More computational power (GPUs)
advancements in the field
great open source software

3 of 30

Neurons

Complete neuron cell diagram

Mariana Ruiz Villarreal

https://en.wikipedia.org/wiki/Neuron

4 of 30

Artificial Neurons

ReLU: max(x, 0)

Leaky ReLU: max(x/100, x)

x (sum of inputs)

y (output)

Rectified Linear Unit: ReLU

inputs

output

1. sum inputs

2. activation function

5 of 30

Forward Pass on Toy Neural Network

output

input

tail

age

weights

1

-1

-2

3

1

2

-1

2

-2

weights

tail? = yes | age = 3 | grumpiness = 10

loss/error:

(prediction - truth)²
→ (11 - 10)² = 1

1

3

1·1 - 2·3 = -5

-1·1 + 3·3 = 8

1·1 + 2·2 = 5

1·5 + 2·8 - 2·5 = 11

Creature grumpiness
Features: “has tail?”, age
Target: grumpiness
Loss: mean squared error

6 of 30

Gradient Descent

Compute derivative of loss function with respect to weights
Update weights��η: learning rate

7 of 30

Training

Initialize weights randomly
Until happy, repeat

Forward pass through network (make prediction)
Calculate error
Backward propagation of errors (backprop)
Update weights

Done in mini batches
One batch in memory at a time if necessary
Libraries provide almost everything

8 of 30

Image Convolutions

-1	-1	-1
-1	8	-1
-1	-1	-1

deeplearning.stanford.edu/wiki/index.php/Feature_extraction_using_convolution

-1	0	-1
0	-1	0
-1	0	-1

filter

9 of 30

Max Pooling

pool size 3

1	1	1	1	7
0	2	1	1	2
1	2	4	0	1
2	4	5	5	6
2	4	1	4	2

4

7

4	7
5	6

5

stride 2

max pooling

pool size 3

stride 2

10 of 30

from sklearn.datasets import load_digits�d = load_digits()�

X = d.images

# reshape to n_samples, n_channels, n_x, n_y�# and convert to 32-bit (to train on GPU)

X = X.reshape((-1, 1, 8, 8)).astype('f4')��# standardize�X = (X - X.mean()) / X.std()��# convert target to 32-bit int�y = d.target.astype('i4')�

from lasagne import layers�from lasagne.nonlinearities import softmax��my_layers = [� (layers.InputLayer, {'shape': (None, 1, 8, 8)}),� (layers.Conv2DLayer, {'num_filters': 64,� 'filter_size': (3, 3)}),� (layers.MaxPool2DLayer, {'pool_size': (3, 3),� 'stride': (2, 2)}),� (layers.DenseLayer, {'num_units': 20}),� (layers.DenseLayer, {'num_units': 10,� 'nonlinearity': softmax}),�]�

from nolearn.lasagne import NeuralNet�net = NeuralNet(my_layers, verbose=1,

max_epochs=20,� update_learning_rate=0.02)�# train network�net.fit(X, y)�# make predictions�y_pred = net.predict(X)

11 of 30

$python nn_example.py

Using gpu device 0: GeForce GTX 980 Ti (CNMeM is disabled)

## Layer information

# name size

--- ---------- ------

0 input0 1x8x8

1 conv2d1 64x6x6

2 maxpool2d2 64x3x3

3 dense3 20

4 dense4 10

�

epoch train loss valid loss train/val valid acc dur�------- ------------ ------------ ----------- ----------- -----� 1 2.17409 1.94451 1.11807 0.55025 0.09s� 2 1.67648 1.35972 1.23297 0.64005 0.09s� 3 1.03381 0.75170 1.37530 0.89149 0.10s� 4 0.56765 0.41487 1.36825 0.90712 0.10s� 5 0.33763 0.27013 1.24991 0.94387 0.09s

...� 18 0.02589 0.07183 0.36048 0.98438 0.10s� 19 0.02325 0.07053 0.32962 0.98438 0.09s� 20 0.02129 0.06951 0.30625 0.98698 0.10s�

12 of 30

Kaggle https://www.kaggle.com

13 of 30

Problem

input data

high resolution color retinal images
training set: 35126 images
test set: 53576 images

target

stage of diabetic retinopathy

0 No DR
1 Mild
2 Moderate
3 Severe
4 Proliferative DR

Highly unbalanced dataset

14 of 30

Metric

Quadratic (Weighted) Cohen’s kappa (κ)

Agreement between rating of two parties

0 agreement by chance
0 - 0.2 poor
...
0.8 - 1.0 very good
1 total agreement

“Weighted” → Ordinal classification problem
“Less penalty for classifying a 0 as a 1 than as a 2”
Our “solution”:

Regression with mean squared error
thresholding at [0.5, 1.5, 2.5, 3.5]

15 of 30

Dataset

stage 0

stage 1

stage 2

stage 3

stage 4

16 of 30

What are we looking for?

Saiprasad Ravishankar, Arpit Jain, Anurag Mittal

IEEE Conf. on Computer Vision Pattern Recognition (CVPR) 2009 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=5206763

17 of 30

Preprocessing

Simple heuristics to isolate and crop foreground
Resize to 512 pixel squares
Standardize each channel (RGB) to have zero mean and unit variance
That’s it!
But, training large networks requires a lot of data.

18 of 30

Augmentation

Problem: “Small” Dataset
Artificially increase size of dataset

translation
rotation

can become the bottleneck for large images

flipping
shearing
stretching
color augmentation*

19 of 30

Layer Types

☑Convolutional Layer (find features)
☑Max Pooling Layer (find features + reduce size)
☑Fully Connected Layer (prediction from features)
Dropout Layer (model averaging, against overfitting)

Zero half the neurons
Network becomes different for each mini batch

5

2

8

1

9

2

5

0

1

9

0

2

0

3

5

2

1

2

Maxout Layer

Take maximum value over 2 Neurons

20 of 30

Network Architecture

Input image to many tiny “images” (feature maps) a few pixels wide.
Extract features on the way through the network.
Layers with stride 2 halve width and height of feature maps.
Handy “Units”

2 - 4 convolutional layers with small filters (2 x 2 to 5 x 5)
followed by max pooling layer with stride 2 and pool size 3

Add ReLUs (or similar)
1 or 2 fully connected layers with dropout at the end
Weight decay for convolutional layers.
“If it doesn’t overfit, you should probably it bigger”.
In competition:

Tiny features → larger input images: 64 → 128 → 256 → 512 (→ 768)
More and more convolution and pooling layers

21 of 30

Network Architecture

Convolution (4x4)

Pooling (3x3, stride 2)

Dropout

Maxout

Fully Connected

32

64

128

1024

256

512

1024

512

22 of 30

Training

Deep networks (many layers) are sometimes hard to train.
Initialization strategy is very important.
learning rate:

Find largest value for loss still converges
When loss doesn’t decrease, decrease learning rate by factor of 5 or 10

Use “Adam” optimizer or “Nesterov Momentum”.
In competition

Dynamic resampling to deal with class imbalance.
Train smaller network and use learned weights to initialize bigger networks.
200 - 250 epochs
~ 2 days to train one network

23 of 30

24 of 30

What does it “see”?

input (stage 1)

5x5 pixel occluded prediction

overlay

Visualizing and Understanding Convolutional Networks

Matthew D Zeiler, Rob Fergus

http://arxiv.org/abs/1311.2901

25 of 30

What does it “see”?

input (stage 1)

Visualizing and Understanding Convolutional Networks

Matthew D Zeiler, Rob Fergus

http://arxiv.org/abs/1311.2901

26 of 30

Feature Extraction

Output of any layer can be used as features
Could use pretrained networks for feature extraction (unless kaggle)

output of last pooling layer → features

Original score: κ 0.79 (~ rank 13 on final kaggle leaderboard)
Features of last pooling layer:

Blend Network

features → FC 32 → maxout → FC 32 → maxout → output

κ ~ 0.80 (~ rank 12)
fully connected layers in our convolutional network not well trained

27 of 30

Test Time Averaging (TTA)

From winners of kaggle plankton competition early 2015:�https://github.com/benanne/kaggle-ndsb
Average output of last pooling layer over multiple augmentations for each eye
Use mean and standard deviation of each feature
Same blend network

features → FC 32 → maxout → FC 32 → maxout → output

with TTA mean κ ~ 0.81 (~ rank 11)
with TTA mean + standard deviation κ ~ 0.815 (~ rank 10)

28 of 30

“Per Patient” Blend

both eyes for each patient in dataset
some images look very “different”
correlation between labels for left and right eye is very high: ρ ~ 0.85

use TTA features from left and right eye and blend

[features of this eye, features of patient’s other eye, left eye indicator]

left: [left eye features, right eye features, 1] → left eye label
right: [right eye features, left eye features, 0] → right eye label

mean, standard deviation, indicator: 8193 features

Train Blend Network: κ → ~ 0.84 (~ rank 2 - 3)

right eye label for patients

with left eye label 3

29 of 30

Final Result

Ensembling

averaged results from 2 similar network architectures and 3 sets of weights each: κ → ~ 0.845

HK Electric wins too

30 of 30

Thank you

Q&A

edu:

code: