1 of 23

DLCV04

Manel Baradad, Míriam Bellver, Martí Cervià, Hector Esteban, Carlos Roig

Visit our GitHub page here

1

2 of 23

Task 1: + Resources

Software used:

Keras with GPU accelerated Tensor Flow as backend, using python 2.7.

Hardware used:

PC with GPU (NVIDIA 970GTX)

2

3 of 23

Task 1: Architecture

For the first task, we modified a convent designed for the MNIST dataset.

3

ConvLayer_1

32x26x26

ReLU

ConvLayer_2

32x24x24

ReLU

MaxPooling

2x2 kernel

FullyConnectedLayer

128

Softmax

10 classes

Input

1x28x28

epoch time 18s parameters:600810

4 of 23

Task 1: Architecture

Removing one Conv layer

Adding 3 more Conv layers

Epoch time: 11s

Total parameters: 693962

Epoch time: 39s

Total parameters: 370506

4

Adding new Conv layers results expensive in terms of total run time!

The accuracy of the network worsens if we take out one conv layer, and improves if we add 3 layers to the network.

Using this layers, removing conv layers results in actually more parameters, see below:

convolution2d_1 (Convolution2D) (None, 32, 26, 26) 320 convolution2d_input_1[0][0]

____________________________________________________________________________________________________

activation_1 (Activation) (None, 32, 26, 26) 0 convolution2d_1[0][0]

____________________________________________________________________________________________________

dropout_1 (Dropout) (None, 32, 26, 26) 0 activation_1[0][0]

____________________________________________________________________________________________________

convolution2d_2 (Convolution2D) (None, 32, 24, 24) 9248 dropout_1[0][0]

____________________________________________________________________________________________________

activation_2 (Activation) (None, 32, 24, 24) 0 convolution2d_2[0][0]

____________________________________________________________________________________________________

Every conv layer drops the 2 outer pixel layers in order to do the convolution iwth the 3x3 filter without using zero padding, the effect

of this is that, as more conv layers exist, less params are needed for the last FC layer, as the size of the last conv network is smaller.

Also, adding more conv layers is expensive in terms of computation, deleting one conv layer shaved 8 seconds of epoch time, and

adding 3 conv layers added 20. Why is this difference not linear? For the same reason as before, every layer that is added reduces the

size of the next activation layer, so the computational costs of adding more layers actually decreases, as the resolution lowers.

5 of 23

Task 1: Architecture

Adding a new FC layer with output size 1024 between the Conv layers and the original FC layer

5

Epoch time: 28 sec

Total Parameters: 4.9M

Adding new FC layers is expensive in terms of memory consumption!

6 of 23

Task 2: Batch Size

6

Test with MNIST

Only 10 classes → no perceptible difference changing the batch (same happened with CIFAR10)

Epoch time = 28 s

Epoch time = 40 s

7 of 23

Task 2: Data Augmentation

7

Trained with CIFAR10

Results worse with Augmentation than without Augmentation.

Possible causes:

Not enough epochs
This is real-time augmentation, so images are randomly flipped / translated, etc, but the amount of images per epochs is the same

8 of 23

Task 2: Batch Normalization

8

Trained with CIFAR10

With Batch Normalization the model learns faster, with fewer epochs

9 of 23

Task 2: Overfitting

We have trained a network with 2 conv layers and 1 fully connected layer

Database: Terrassa, only 450 training images, no data augmentation

9

without dropout

with dropout

10 of 23

Task 3 - Filter Visualization

10

Conv

Layer 1

Conv

Layer 2

Conv

Layer 3

Conv

Layer 4

ConvLayer_1

ReLU

ConvLayer_2

ReLU

MaxPooling

FC Layer

512

Softmax

10 classes

Input

32x32 pixels

ConvLayer_3

ReLU

ConvLayer_4

ReLU

MaxPooling

CIFAR10 NET

11 of 23

Task 3 - Filter Visualization

Filters of pre-trained VGG16 'conv5_1'

They are calculated defining a loss function that maximizes the activation of a specific filter in a specific layer. We have simply executed a keras/example on the following link

11

12 of 23

Task 3 - T-SNE

The T-SNE is a tool to visualize high-dimensional data.

Converts similarities between data points to joint probabilities and tries to minimize the KL divergence.

In this example we used the MNIST dataset with 2500 images.

12

13 of 23

Task 3 - Off-the-shelf VGG-16 Local Classification

prob: 0.113

prob: 0.0493

prob: 0.25

13

We did these tests using a trained VGG16 network for the Imagenet dataset, the probability displayed is the value for the class of the original image.

14 of 23

Task 4 - Fine tuning

Train network on CIFAR10 and fine-tune for Terrassa Buildings 900 2

14

Terrassa database

Preprocessing

resize (32,32)
normalization color channels

CIFAR

model of Keras

pre-trained 200 epochs

acc: 0.7361

removed last layer and added a dense layer with 13 neurons

3000 epochs

batches size: 32

CIFAR fine tuning for Terrassa

val acc: 0.7750

640 labeled samples

15 of 23

Task 4 - Fine tuning

Train network on CIFAR10 and fine-tune for Terrassa Buildings 900 2

Accuracy

Losses

15

16 of 23

Task 4 - Fine tuning

2. Use pre-trained weights of VGG16 and train on top a classifier for Terrassa Buildings 900 2

16

Terrassa database

Preprocessing

resize (224,224)
subtracted mean of Imagenet dataset

VGG16

pre-trained weights of VGG16 trained with ImageNet

removed last layer and added a dense layer with 13 neurons

350 epochs

VGG16 fine tuning for Terrassa

val acc: 0.85

640 labeled samples

17 of 23

Task 4 - Fine tuning

2. Use pre-trained weights of VGG16 and train on top a classifier for Terrassa Buildings 900 2

Accuracy

Losses

17

18 of 23

Task 5 - Open Project

Neural Style

Goal: Generate a new image with the content of image1 and the style of image2

18

base image

style

How to encode content? ['conv4_2'] of VGG16

How to encode style? Gram matrix ['conv1_1', 'conv2_1', 'conv3_1', 'conv4_1', 'conv5_1'] of VGG16

The goal al Neural Style is to generate images with the content of a base image and the style of another image. It uses convolutional features extracted from the VGG16 pretrained model in order to describe the content of the base image and also the style of the style reference image. In fact, for the content it uses the fourth conv features and for the style it computes a gram matrix (similar to correlation) using several conv features. The goal is to generate an image from random noise that has same content and style features.

info extra:

In order to generate this new image, noise is inputted in a pre-trained VGG16 network and this noise is updated until the generated image to match the same style and content. In order to encode the content we extract the features of a later conv layer, and in order to encode the style we compute a kind of correlation matrix computed from a combination of different conv layers.

resum en catala:

BASE - VGG - miro activacions de les conv

STYLE - VGG - miro activatcions i calculo style a partir de la corelacio de les activacions

PASSO SOROLL RANDOM - VGG - miro activacions i vull optimitzar que el base i style sassemblin, llavors canvio el sorolll per tal de que s’assemblin, i itero.

19 of 23

Task 5 - Open Project

Neural Style

19

base image

style

10 iterations

20 iterations

5 iterations

1 iteration

default style and content features

20 of 23

Task 5 - Open Project

Test 1: Only first conv style features feature_layers = ['conv1_1', 'conv2_1']

20

1

5

10

20

10

20

5

1

default

first conv layers features

21 of 23

Task 5 - Open Project

Test 2: Only last conv style features feature_layers = ['conv4_1', 'conv5_1']

21

10

20

5

1

default

last conv layer features

1

10

5

20

22 of 23

Thanks!! Questions?

22

23 of 23

References

Keras

Keras repository

Keras Documentation

Visualization tool for VGG16 blog

Neural Style

A Neural Algorithm of Artistic Style

https://github.com/fchollet/keras/blob/master/examples/neural_style_transfer.py

t-SNE

Implementations on different languages

23