1 of 169

Deep Learning in a Nutshell

Md Sohel Rana

(Graduate Researcher)

La Trobe University

1

2 of 169

How Do We Learn?

Do we get our knowledge from tables and spreadsheets?

This is what traditional machine learning is mainly used for.

Slide from Zhen He

2

3 of 169

How Do We Learn?

Listening

Deep Learning is best for speech recognition
30% better performance than non-deep learning techniques
Musenet – Music generation

Slide from Zhen He

3

4 of 169

How Do We Learn?

Looking

Deep Learning is best for computer vision
Over 100% better performance than non-deep learning algorithms

Slide from Zhen He

4

5 of 169

How Do We Learn?

Reading text

Deep Learning is becoming the best for natural language understanding/processing

Language translation, word embeddings, thought vectors question and answering, etc.

Slide from Zhen He

5

6 of 169

Deep Learning versus Traditional Machine Learning

Traditional machine learning

Structured data
Hand engineered features
Fixed length input and output

Deep learning

Unstructured data
Feature learning
Variable length input and output

Slide from Zhen He

6

7 of 169

Image representation

7

8 of 169

Hand Engineered Features

Slide from Zhen He

8

9 of 169

Image Recognition: Hand Crafted Features

[Andrew Ng, UCLA deep learning summer school 2012]

9

10 of 169

Traditional Machine Learning

Traditional machine learning assume features are given to it

Input

Learning

algorithm

Feature Representation

Machine Learning

Slide from Zhen He

10

11 of 169

Traditional Machine Learning

Machine Learning Algorithm

prob(car) prob(motorbike)

Hand engineered features

Slide from Zhen He

11

12 of 169

Feature Learning

Move machine learning to include learning feature representation
Learn features automatically

Input

Learning

algorithm

Feature Representation

Machine Learning

Slide from Zhen He

12

13 of 169

Feature Learning

During training the deep learning algorithm automatically builds hierarchy of features.
The top levels are high level features of the input data.
Classifying using high level features is much easier.

prob(car) prob(motorbike)

X₁ (number of lights)

X₂ (number of handlebars)

Decision boundary

Cars

Motor bikes

13

14 of 169

Playing Lego

Deep learning algorithms are composed of many different plug and playable blocks.

E.g. one block turns image into a vector

2^nd block turns vector into text.

Encode Image

CNN

NN

Vector

Decode to text

Lego blocks connect via vectors

Small bird with orange chest

Slide from Zhen He

14

15 of 169

Deep Learning Lego

Encode English

Vector

Decode to Spanish

Spanish Sentence

English Sentence

Encode English

Vector

Decode to German

German Sentence

English Sentence

Slide from Zhen He

15

16 of 169

Many many Lego pieces

Combinatorial number of algorithms can be created

16

17 of 169

Who is Using Deep Learning?

17

18 of 169

Who is using Deep Learning?

Google Search

Rank Brain

Ranking using deep learning

Google’s traditional rule based search team initially reluctant to use deep learning for search
Head to head

RankBrain versus Traditional Google Search Algorithm

RankBrain 80% correct
Traditional Google Search Algorithm 70% correct

Next generation google assistant

More automation
Realtime translation because of compressed model downloaded in android phone

http://www.wired.com/2016/02/ai-is-changing-the-technology-behind-google-searches/

Slide from Zhen He

18

19 of 169

Language Translation

Google is using deep learning for language translation

Reduces translation errors by an average of 60% compared to Google's phrase-based production system.

Slide from Zhen He

19

20 of 169

Who is using Deep Learning?

Google Android Speech Recognition

Deep recurrent neural networks reduced word errors by more than 30%

Slide from Zhen He

20

21 of 169

Who is using Deep Learning?

Google has over 1500 deep learning projects
Facebook AI research
Jeff Dean

Inventor of MapReduce leads deep learning in GoogleBrain

21

22 of 169

Who is using Deep Learning?

Google CEO Sundar Pichai

Over time computers in all shapes (mobile device, watch, in car, etc.) will be an intelligent assistant helping you through your day.

Using deep learning personal assistants will become much smarter and better at speech recognition

Microsoft Cortana Facebook M

Next generation google assistant Amazan Echo

22

23 of 169

Who is using Deep Learning?

Self driving cars

Google- waymo
Nvidia

23

24 of 169

Just getting started

Deep learning has only been around for around 7 years!
Many many more new applications coming!

24

25 of 169

Introduction to Neural Networks

25

26 of 169

Neural network

a(x)

h(x)

26

27 of 169

Activation Function

Usually drawn

this way

27

28 of 169

28

29 of 169

XOR solution

		y
0	0	0
0	1	1
1	0	1
1	1	0

4

Examples

W21

W11

W22

W12

Wh1

Wh2

29

30 of 169

Deep Learning

Using more layers you can learn much more complex representations of the data.

Effectively much more complex decision boundaries

http://playground.tensorflow.org/

…

Inputs

Outputs

W₁

W₂

W_L

…

Slide from Zhen He

30

31 of 169

Forward Pass

Take inputs x and then compute the activations of hidden units
Then compute outputs ……..

…

Slide from Zhen He

31

32 of 169

Loss Function

Depending on the loss function

Compare output to ground truth labels y₁ … y_K

…

N is the number of training examples
K is the number of outputs

32

33 of 169

Back Propagation

Use the calculus chain rule to propagate the gradient of the loss from the output all the way back to all the weights.
Adjust the weights using the gradients.

…

Adjust

33

34 of 169

Optimization: Gradient Descent

The way to minimize the loss function is to find the gradient (slope) of the loss with respect to the weight W.
Then move in the negative direction of the slope, so you descend the loss function.

W

Loss(W)

learning rate

Concave optimization problem

34

35 of 169

The optimal learning rate

Very small learning rate

learning rate

Very large learning rate

(converges too slowly)

(jumps out of the minima)

35

36 of 169

Popular Optimization Method

Stochastic Gradient Descent (SGD)

Manual schedule.

For example

1 epoch – 18 epoch, Learning rate = 1e-2
19 epoch – 29 epoch , Learning rate = 5e-3
30 epoch – 43 epoch, Learning rate = 1e-3
44 epoch – 52 epoch, Learning rate = 5e-4

Momentum

Learning rate goes down

36

37 of 169

Popular Optimization Methods

Adam

Automatically adjusts learning rate
A good initial learning rate is 1e-4
Too high initial learning rate can lead to bad results

RMSProp

Also adjusts learning rate automatically.

37

38 of 169

Tuning Initial Learning Rate

Usually it is a good idea to try several different initial learning rates to see which one performs the best.

1Cycle learning rate finder

https://naadispeaks.wordpress.com/tag/1cycle-policy/

38

39 of 169

Three Types of Gradient Descent

Full batch gradient descent

Use all examples in each iteration

Stochastic gradient descent

Use 1 example in each iteration

Mini-batch gradient descent

Use b examples in each iteration
This is the one that is most commonly used for deep learning

Usually b is around 128.

39

40 of 169

Back Propagation

red numbers are gradients

40

41 of 169

Gradient At Branches

p

q

c

Loss

. . . . .

41

42 of 169

The Softmax Layer and the Cross Entropy Loss Function

42

43 of 169

How to Remember Cross Entropy Loss

The the one hot encoded true label multiplied by the log of the predicted probability.

43

44 of 169

Consider the Following Classification Problem

a_b – activation for banana
a_g – activation for grape
a_p – activation for pear

– predicted probability for banana
– predicted probability for grape
– predicted probability for pear

X

W

Neural network

a_b a_ga_p

softmax

44

45 of 169

Consider the Following Classification Problem

a_b – activation for banana
a_g – activation for grape
a_p – activation for pear

– predicted probability for banana
– predicted probability for grape
– predicted probability for pear

The way we compute the probability of class banana given X and weight W using the softmax function is as follows:

X

W

Neural network

a_b a_ga_p

softmax


0.2	0.1	0.7

logits

probabilities

45

46 of 169

What if I don’t care about the maths and just want to use it

Targets should be 1 hot encoded. For example

Banana 1, 0, 0
Pear 0, 1, 0
Apple 0, 0, 1

Each target should be mutually exclusive

For example the object is either a banana or a pear. It can not be both.

Neural network layers

softmax

8

5.0

2.0

1.0

0.94

0.04

0.02

Cross entropy loss

0.067

1 0 0

One hot encoded target

Pytorch code:

import torch

criterion=nn.CrossEntropyLoss() //Softmax+cross entropy. Don’t need one-hot encoding

logits

46

47 of 169

Regression Loss Function

When you want to predict a real value instead of a class (regression) then people usually use the mean squared error loss function
L1 loss

47

48 of 169

Activation Functions

48

49 of 169

Sigmoid Function

49

50 of 169

Sigmoid Function

Saturated neurons kill the gradients

50

51 of 169

Rectified Linear Unit (ReLU)

51

52 of 169

Convolutional Neural Networks

52

53 of 169

Convolutional Neural Networks

Sliding window of image patches to detect each feature

The neurons of the same feature share the same weights

The output feature map tells us how well the filter matches the input image patch

Output feature map

Filter

2

Output feature map

Filter

	5

		4

Filter

Output feature map

53

54 of 169

What is the result of training?

These names all mean the same

Filter / Kernel / Weights

The filter contains the learnt weights

During training the neural network will determine which filters it should use to give the highest classification accuracy.

Spatial invariance

Sliding the same filter across the entire image means you look for the same feature any where in an image.

Output feature map

Filter

Kernel

Weight

2

Output feature map

	5

		4

Output feature map

Filter

Kernel

Weight

Filter

Kernel

Weight

54

55 of 169

Spatial Invariance

Even though the ball only appears in the top left in the training set.
The filters we learn during training can be used to detect the ball on the bottom right in the testing set.

Training Set

Testing Set

55

56 of 169

Multiple layers

Output feature map layer 2

Filter

2

3

Filter

Output feature map layer 1

(input for the next layer)

56

57 of 169

How to Compute a Convolution?

We slide the filter across the input feature map and compute the dot product between the filter and the current set window of values of the input feature map.

1.2	2.2	1.2	3.1
0.2	2.1	2.5	-1.2
-1.1	2.3	1.1	-0.2
1.2	2.1	-2.3	-1.1

0.2	1.2
3.1	1.1

Filter

Input Feature Map

5.81

Output Feature Map

0.2 * 1.2 + 1.2 * 2.2 +

3.1 * 0.2 + 1.1 * 2.1

57

58 of 169

How to Compute a Convolution?

1.2	2.2	1.2	3.1
0.2	2.1	2.5	-1.2
-1.1	2.3	1.1	-0.2
1.2	2.1	-2.3	-1.1

0.2	1.2
3.1	1.1

Filter

Input Feature Map

5.81	11.14

Output Feature Map

0.2 * 2.2 + 1.2 * 1.2 +

3.1 * 2.1 + 1.1 * 2.5

58

59 of 169

Padding with Zeros

Padding the input feature map with zeros allow you to output a larger feature map.

0	0	0	0
1.2	2.2	1.2	3.1
0.2	2.1	2.5	-1.2
-1.1	2.3	1.1	-0.2
1.2	2.1	-2.3	-1.1
0	0	0	0

Input Feature Map

2.1	1.2	0.5
0.7	8.1	1.20
5.2	3.2	1.1

Filter

Output Feature Map

59

60 of 169

Convolutions generates and output feature map

https://community.arm.com/graphics/b/blog/posts/when-parallelism-gets-tricky-accelerating-floyd-steinberg-on-the-mali-gpu

Output feature map

60

61 of 169

The depth represents three feature maps / channels concatenated together.

Usually in the first layer these three channels represent the three colour

Channels R G B.

61

62 of 169

62

63 of 169

63

64 of 169

64

65 of 169

65

66 of 169

Layer 1 and 2 of CNN

Left shows the filters
Right shows which part of image get most strongly activated by filter

source: https://adeshpande3.github.io/adeshpande3.github.io/The-9-Deep-Learning-Papers-You-Need-To-Know-About.html

66

67 of 169

Layer 3 of CNN

source: https://adeshpande3.github.io/adeshpande3.github.io/The-9-Deep-Learning-Papers-You-Need-To-Know-About.html

67

68 of 169

Layer 4 and 5 of CNN

source: https://adeshpande3.github.io/adeshpande3.github.io/The-9-Deep-Learning-Papers-You-Need-To-Know-About.html

68

69 of 169

Convolution with Larger Stride

In this example we will slide the window across with a stride of 2.

1.2	2.2	1.2	3.1
0.2	2.1	2.5	-1.2
-1.1	2.3	1.1	-0.2
1.2	2.1	-2.3	-1.1

0.2	1.2
3.1	1.1

Filter

Input Feature Map

5.81

Output Feature Map

0.2 * 1.2 + 1.2 * 2.2 +

3.1 * 0.2 + 1.1 * 2.1

69

70 of 169

Convolution with Larger Stride

In this example we will slide the window across with a stride of 2.

1.2	2.2	1.2	3.1
0.2	2.1	2.5	-1.2
-1.1	2.3	1.1	-0.2
1.2	2.1	-2.3	-1.1

0.2	1.2
3.1	1.1

Filter

Input Feature Map

5.81	10.39

Output Feature Map

0.2 * 1.2 + 1.2 * 3.1 +

3.1 * 2.5 + 1.1 * -1.2

70

71 of 169

Pooling / Subsampling

In order to reduce the size of feature maps pooling/subsampling is often used.
Max pooling is the most popular type of pooling.

Allows you to have a certain amount of translational invariance

In max pooling the max value with each window is outputted in the output feature map.

Another types of pooling include average pooling, etc.

In the above example we applied maxpooling with a 2 x 2 pooling window with a stride of 2.

1.2	2.2	1.2	3.1
0.2	2.1	2.5	-1.2
-1.1	2.3	1.1	-0.2
1.2	2.1	-2.3	-1.1

Input Feature Map

2.2

Output Feature Map

71

72 of 169

Pooling / Subsampling

1.2	2.2	1.2	3.1
0.2	2.1	2.5	-1.2
-1.1	2.3	1.1	-0.2
1.2	2.1	-2.3	-1.1

Input Feature Map

2.2	3.1

Output Feature Map

72

73 of 169

A Typical Convolutional Network

Notice as we move deeper into the network the feature maps get smaller and there are more feature maps per layer.

This is analogous to having higher level concepts which cover larger regions and also more concepts (one concept per feature map)

All the convolutional layers together form the feature extractor
The fully connected layer takes the features as input and then outputs the classification

73

74 of 169

Where to use Fully Connected Layers (Linear Layers)?

Suppose we take an image as input and we want to predict it belongs to 1 of 3 possible object?
We should use a linear layer to make the network output 4 predicted values

After we pass that through a softmax we will have a predicted probability for each class.

Convolutional layers

softmax

8

5.0

2.0

1.0

0.94

0.04

0.02

Linear layer

Prob grapes

Prob pear

Prob banana

74

75 of 169

Fully Connect Network (Linear Layer)

Massive number of weights

Every input neuron is connected to every output neuron via weight

M x N weights

M input neurons
N output neurons

Output Neurons

Input Neurons

w_ij

A weight for each connection

75

76 of 169

Convolutions layers much more efficient than fully connected layers

It is common for 99% of weights

for fully connected layers

It is common for 1% of weights

for convolutional layers

76

77 of 169

Convolutional Layers Have Small Number of Weights

Convolution layers do massive weight sharing
Using just 3 (width) x 3 (length) x 3 (number of input feature maps) x 6 (number of output feature maps) weights = 162 weights

Connects 224 x 224 x 3 (150,528) inputs neurons into 222 x 222 x 6 (295,704) output neurons.

If we were to use a fully connected layer to do the same then we would need

150,528 x 295,704 weights = 44, 511, 731,712 weights!!!

224

3

3 x 3 x3 x 6 filter

222

77

78 of 169

Three Famous Convolutional Architectures

AlexNet

2012
8 layers
ImageNet top 5 error: 16.4%

VGG

2014
19 layers
ImageNet top 5 error: 6.8%

ResNet

Nov 2015
152 layers
ImageNet top 5 error: 3.57%

78

79 of 169

AlexNet

79

80 of 169

AlexNet

ImageNet Classification with Deep Convolutional Neural Networks

Alex Krizhevsky, et. al.
NIPS 2012

The paper that started it all in 2012
Ground breaking for a number of reasons

ReLU
Dropout
Using GPUs

80

81 of 169

VGG

81

82 of 169

VGG

Very Deep Convolutional Networks for Large Scale Image Recognition

Karen Simonyan, et. al.
ICLR 2015
Oxford computer vision group

Won Imagenet ILSVRC-2014 competition for localization
Second Imagenet ILSVRC-2014 competition for classification

Error rate of 6.8%

82

83 of 169

VGG16 architecture

83

84 of 169

VGG

Much deeper than Alexnet

Up to 19 layers instead of 8 layers
They can afford it due to the much smaller filters
Number of parameters not that many compared to Alexnet

VGG 133 – 144 million
Alexnet 60 million

84

85 of 169

ResNet!

85

86 of 169

Deep Residual Learning for Image Recognition�

Kaiming He, Xiangyu Zhang, Shao
arXiv 2015

Existing techniques accuracy saturates when going deeper
Degradation is not caused by overfitting!
Adding more layers leads to higher training error.
See experiments below

86

87 of 169

Wouldn’t it be cool!

Wouldn’t it be cool to be able shut down entire layers based on the input?

So selective keep layers depending on the input

Input A

Input B

87

88 of 169

One Simple Awesome Trick! (deep residual network

Using the above approach can produce much deep nets

Successfully train over 100 layers and explored over 1000 layers

Imagenet

152 layered residual net
Lower complexity than VGG
3.57% top-5 error

1^st place 2015 results (ImageNet classification, ImageNet detection, ImageNet localization, COCO detection, COCO segmentation)

88

89 of 169

Generalization

Data augmentation
Dropout

89

90 of 169

Data Augmentation

Increase data size to help generalization
Types of augmentation

Random flip
Random rotate
Random scale
Random noise
Random crop

90

91 of 169

Dropout Method

Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I. and Salakhutdinov, R. R. (2012), Improving neural networks by preventing co-adaptation of feature detectors

91

92 of 169

Amount of Dropout

Dropout is used initial fully connected layer

Initial layer is used to learn low level features

Usually dropout is used before activation function
Dropout probability is 0.5

Dropout code in pytorch: nn.Dropout(0.5)

92

93 of 169

Used Dropout to Avoid Overfitting

93

94 of 169

Normalization is Important

Deep Learning algorithms train much better and fast when activations are normalized.

When the data is normalized the weights of the neural network can stay in a more reasonable range and have close to a normal distribution.

E.g. Try to predict price of a house.

Features include: number of bedrooms, land size, etc.

Features are of very different scales.

Normalization

Make data have zero mean
Unit variance

Two Places to Normalize

Normalize input
Normalize internal layers

Ioffe, Sergey, and Christian Szegedy. "Batch normalization: Accelerating deep network training by reducing internal covariate shift." arXiv preprint arXiv:1502.03167 (2015).

94

95 of 169

batch size

Number of dimensions

95

96 of 169

96

97 of 169

Results

Notice the massive increase in the convergence rate of Batch Normal variants (BN)

97

98 of 169

Batch Normalization Popularity

Basically everyone uses batch normalization now!

Pytorch code:

torch.nn.BatchNorm1d(shape_of_output_feature)

98

99 of 169

Both Theano and TensorFlow uses Idea of Computational Graphs

The computational graph is created by connecting symbolic expressions that take symbolic variables as inputs.
The symbolic expressions are very simple but can be used to build very complex functions.
A major benefit of this is that these system can do automatic differentiation.

99

100 of 169

PyTorch

Developed by facebook
Python frontend
Built on top of Torch C++ backend
Does automatic differentiation
Dynamic graph calculation
More pythonic
Easy to debug (Python debugger pdb)
All most all pretrained model:

Alexnet
Resnet
VGG
Faster rcnn detector weights
etc

Intuitive and easy to use

Overall people like it

100

101 of 169

Deep Dive into Pytorch

101

102 of 169

What is a Tensor?

A multi-dimensional array of numbers

1D Tensor – Vector (Rank 1)

E.g. v = torch.tensor([1.1, 2.2, 3.3])

2D Tensor – Matrix (Rank 2)

E.g. m = torch.tensor([ [1, 2, 3], [4, 5, 6], [7, 8, 9] ])

3D Tensor – Cube of numbers ( Rank 3)

E.g. t = torch.tensor([[[2], [4], [6]], [[8], [10], [12]], [[14], [16], [18]]])

v = [1.1, 2.2, 3.3]

m = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]

102

103 of 169

Some tensor processing

Permute:

in=torch.tensor([[4,5,8],[1,2,0]])

out=in.permute(1,0)

print(out)

Output: tensor([[4, 1],

[5, 2],

[8, 0]])

Squeezing and unsqueezing:

in=torch.randn(4,5)

out=in.unsqueeze(dim=0)

print(out.shape)

Output: torch.Size([1, 4, 5])

out=out.squeeze(dim=0)

print(out.shape)

Out: torch.Size([ 4, 5])

Stacking tensor list:

in=[]

in.append(torch.tensor([4,5]))

in.append(torch.tensor([2,1]))

in.append(torch.tensor([0,3]))

out = torch.stack(in)

Out: tensor([[4, 5],

[2, 1],

[0, 3]])

Reshaping:

in=torch.randn(4,5)

out = in.view(-1,10)

print(out.shape)

Output: torch.Size([2, 10])

103

104 of 169

How do we train in Pytorch?

Prepare dataset

Dataset

Batch management
Shuffling data

Dataloader

Design a NN model

Model

Train the model
Evaluate

Train

104

105 of 169

Writing code in pytorch

Custom dataloader part
Model part
Training part

105

106 of 169

Custom dataloader structure

Return single batch example with label

Load all annotations of examples and labels

106

107 of 169

Library and configuration

import torch

import os

from PIL import Image

from torch.utils.data.dataset import Dataset

import numpy as np

import csv

from torch.utils.data import DataLoader

from torchvision import transforms

import torch.nn as nn

import torch.optim as optim

import torch.nn.functional as F

TRAIN_IMG_FILE_ANNOTATION = './dataset/train.txt'

TRAIN_IMG_DIR = './dataset/train_img'

TEST_IMG_FILE_ANNOTATION = './dataset/test.txt'

TEST_IMG_DIR = './dataset/test_img'

NLABELS = 5

batch_size = 32

num_epoch=4

107

108 of 169

Custom dataloader part

class DataPreparation(Dataset):

def __init__(self, annotation_file,img_root_path, datatypes):

self.img_root_path = img_root_path

self.images=[]

self.labels=[]

if datatypes=="train":

self.transforms = transforms.Compose([transforms.Resize((64,64)),transforms.RandomRotation((-4,4)),transforms.ToTensor()])

if datatypes=="val":

self.transforms = transforms.Compose([transforms.Resize((64,64)),transforms.ToTensor()])

with open(annotation_file, 'r') as annotation_reader:

annotation = csv.reader(annotation_reader, delimiter=',')

for row in annotation:

self.images.append(row[0])

self.labels.append(int(row[1]))

def __getitem__(self, index):

img = Image.open(os.path.join(self.img_root_path, self.images[index]))

img = img.convert('RGB')

img = self.transforms(img)

labels=self.labels[index]

return {"img":img,"labels":labels}

def __len__(self):

return len(self.images)

train_dataset=DataPreparation(TRAIN_IMG_FILE_ANNOTATION,TRAIN_IMG_DIR,datatypes="train")

train_size=len(train_dataset)

train_loader = DataLoader(train_dataset,batch_size=batch_size,shuffle=True,num_workers=0)

test_dataset=DataPreparation(TEST_IMG_FILE_ANNOTATION,TEST_IMG_DIR,datatypes="val")

test_size=len(test_dataset)

test_loader = DataLoader(test_dataset,batch_size=batch_size,shuffle=True,num_workers=0)

dataloader={"train":train_loader,"val":test_loader}

datasize={"train":train_size,"val":test_size}

108

109 of 169

Model part

class MultiLabelNN(nn.Module):

def __init__(self, nlabel):

super(MultiLabelNN, self).__init__()

self.nlabel = nlabel

self.conv1 = nn.Conv2d(3, 6, 5)

self.pool = nn.MaxPool2d(2, 2)

self.conv2 = nn.Conv2d(6, 16, 5)

self.fc1 = nn.Linear(10816,512)

self.fc2 = nn.Linear(512, nlabel)

def forward(self, x):

x = self.conv1(x)

x = F.relu(x)

x = self.pool(x)

x = self.conv2(x)

x = F.relu(x)

x = x.view(-1, 10816)

x = self.fc1(x)

x = F.relu(x)

x = self.fc2(x)

return x

109

110 of 169

Training part

model = MultiLabelNN(NLABELS)

optimizer = optim.Adam(model.parameters(), lr=0.001) # optimizer

criterion = nn.CrossEntropyLoss() #Loss function

for epoch in range(num_epoch):

print("Epoch: {0}/{1}".format(epoch+1,num_epoch))

for phase in ["train","val"]:

if phase=="train":

model.train()

if phase=="val":

model.eval()

running_loss = 0.0

running_correct = 0.0

for index,data in enumerate(dataloader[phase]):

images=data["img"]

labels=data["labels"]

optimizer.zero_grad()

outputs = model.forward(images)

loss=criterion(outputs,labels)

_, predicted = torch.max(outputs.data, 1)

if phase=="train":

loss.backward()

optimizer.step()

running_loss += loss.item()*labels.shape[0]

running_correct += (predicted == labels).sum().item()

if phase=="train":

train_loss=running_loss/datasize["train"]

train_accuracy=running_correct/datasize["train"]

print("Training loss: {0}".format(train_loss))

print("Training accuracy: {0}".format(train_accuracy))

if phase=="val":

test_loss=running_loss/datasize["train"]

test_accuracy=running_correct/datasize["val"]

print("Validation loss: {0}".format(test_loss))

print("Validation accuracy: {0}".format(test_accuracy))

110

111 of 169

Use GPU in pytorch

Load GPU using both model and data

Code:

model=model.cuda()

……….

images=data["img"].cuda()

labels=data["labels"].cuda()

111

112 of 169

Tensorboard code

from tensorboardX import SummaryWriter ## Add Library

train_writer = SummaryWriter("./logs/train") ## Write this two writer on top.

val_writer = SummaryWriter("./logs/val")

if phase=="train":

train_loss=running_loss/datasize[phase]

train_accuracy=running_correct/datasize[phase]

train_writer.add_scalar("loss",train_loss,epoch) ## plot train loss by train_writer

train_writer.add_scalar("accuracy",train_accuracy,epoch) ## plot train accuracy by train_writer

print("Training loss: {0}".format(train_loss))

print("Training accuracy: {0}".format(train_accuracy))

if phase=="val":

validation_loss=running_loss/datasize[phase] ## plot validation loss by val_writer

validation_accuracy=running_correct/datasize[phase] ## plot validation accuracy by val_writer

val_writer.add_scalar("loss",validation_loss,epoch)

val_writer.add_scalar("accuracy",validation_accuracy,epoch)

112

113 of 169

Tensorboard command

tensorboard --logdir="./logs" --port 6006

113

114 of 169

Tensorboard output & Early Stopping

Stop

114

115 of 169

Understand how a model is performing

Overfitting

Underfitting

Good fitting

115

116 of 169

Small data set?

What happens when you only have a small labelled data set but you want to use deep learning?

Transfer learning

Works really well.
Used very widely.

116

117 of 169

Transfer learning

Slide source: https://www.slideshare.net/xavigiro/deep-learning-for-computer-vision-transfer-learning-and-domain-adaptation-upc-2016

117

118 of 169

Slide source: https://www.slideshare.net/xavigiro/deep-learning-for-computer-vision-transfer-learning-and-domain-adaptation-upc-2016

118

119 of 169

Transfer Learning Training

Two options when transferring

Freeze Convolutional layers
Fine tune convolutional layers

119

120 of 169

Transfer learning code

class TModel(nn.Module): ## Model

def __init__(self, nlabel):

super(TModel, self).__init__()

self.resnet = models.resnet18(pretrained=True)

self.sliced_resnet = torch.nn.Sequential(*(list(self.resnet.children())[:-1]))

self.fc = nn.Linear(512,nlabel)

def forward(self, x):

x = self.sliced_resnet(x)

x=x.view(-1,x.shape[1])

x = self.fc(x)

return x

for param in model.resnet.parameters():

param.requires_grad=False ## Fridge part of a model

120

121 of 169

How to combine information

What if you have two separate pieces of information to feed in

E.g. Image and attribute (e.g. age)

Four options

Gating
Addition
Concatenation

Early fusion
Late fusion

121

122 of 169

Gating (one side acts like a switch)

Image

Attribute (e.g. age)

Linear layer to make the dimensionality the same

Gating function

- Element-wise multiply operation

Sigmoid layer (acts like a switch)

- values are between 0 and 1.

This will have a large influence due to it being a switch

1.0	0.2	1.0	0.8	0.1	0.3	0.0

0.2	2.1	4.1	2.2	5.5	6.1	1.2

*

switch

. . . . . . . . ..

122

123 of 169

Addition

Image

Attribute (e.g. age)

Linear layer to make the dimensionality the same

. . . . . . . . ..

Elementwise add

. . . . . . . . ..

123

124 of 169

Concatenation�

Image

Attribute (e.g. age)

Concatenate

. . . . . . . . ..

124

125 of 169

Types of concatenation

https://drive.google.com/file/d/0B3Y3MuzIUaGtc19DMmQwUU8tQ28/view

125

126 of 169

Gating, addition and concatenation code

class TModel(nn.Module): ## Model

def __init__(self, nlabel):

super(TModel, self).__init__()

self.resnet = models.resnet18(pretrained=True)

self.sliced_resnet = torch.nn.Sequential(*(list(self.resnet.children())[:-1]))

self.fc = nn.Sequential(nn.Linear(1,512),nn.ReLU(inplace=True))

def forward(self, image,age):

f1 = self.sliced_resnet(image)

f1=f1.view(-1,f1.shape[1]) ##Output

f2 = self.fc(age)

out=f1*f2 ##Gating of feature f1 and f2.

return out

Adding: out=f1+f2,

Concatenation: out=torch.cat((f1,f2),dim=0)

126

127 of 169

Forcing the encoder to output a reduced dimensionality representation forces the encoder to learn high level features.

A good auto encoder will be able to reproduce the input almost perfectly despite the bottle layer z.

L2 loss function often used

Reduced Dimensionality

(bottleneck layer)

(Convolutional Layers)

(UpConv Layers)

127

128 of 169

Autoencoders – Why use

Compressed latent space

Trained weights

Semantic segmentation

128

129 of 169

Autoencoder-Configuration and dataloader

import torch

import torchvision

from torch import nn

from torch.autograd import Variable

from torch.utils.data import DataLoader

from torchvision import transforms

from torchvision.utils import save_image

from torchvision.datasets import MNIST

num_epoch = 100

batch_size = 128

img_transform = transforms.Compose([transforms.ToTensor(),transforms.Normalize((0.5,), (0.5,))])

train_dataset = MNIST('./data', transform=img_transform,train=True,download=True)

train_size=len(train_dataset)

train_loader = DataLoader(train_dataset,batch_size=batch_size,shuffle=True,num_workers=0)

test_dataset = MNIST('./data', transform=img_transform,train=False,download=True)

test_size=len(test_dataset)

test_loader = DataLoader(test_dataset,batch_size=batch_size,shuffle=True,num_workers=0)

dataloader={"train":train_loader,"val":test_loader}

datasize={"train":train_size,"val":test_size}

model=autoencoder()

criterion = nn.MSELoss()

optimizer = torch.optim.Adam(model.parameters(), lr=0.001,weight_decay=1e-5)

def to_img(x):

x = 0.5 * (x + 1)

x = x.clamp(0, 1)

x = x.view(x.size(0), 1, 28, 28)

return x

129

130 of 169

Autoencoder – Model

class autoencoder(nn.Module):

def __init__(self):

super(autoencoder, self).__init__()

self.encoder = nn.Sequential(

nn.Conv2d(1, 16, 3, stride=3, padding=1),

nn.ReLU(True),

nn.MaxPool2d(2, stride=2),

nn.Conv2d(16, 8, 3, stride=2, padding=1),

nn.ReLU(True),

nn.MaxPool2d(2, stride=1)

)

self.decoder = nn.Sequential(

nn.ConvTranspose2d(8, 16, 3, stride=2),

nn.ReLU(True),

nn.ConvTranspose2d(16, 8, 5, stride=3, padding=1),

nn.ReLU(True),

nn.ConvTranspose2d(8, 1, 2, stride=2, padding=1)

nn.Tanh()

)

def forward(self, x):

x = self.encoder(x)

x = self.decoder(x)

return x

130

131 of 169

Autoencoder-Training

for epoch in range(num_epoch):

print("Epoch: {0}/{1}".format(epoch+1,num_epoch))

for phase in ["train","val"]:

if phase=="train":

model.train()

if phase=="val":

model.eval()

running_loss = 0.0

for index,data in enumerate(dataloader[phase]):

img,_=data

optimizer.zero_grad()

output = model.forward(img)

loss = criterion(output,img)

if phase=="train":

loss.backward()

optimizer.step()

running_loss += loss.item()*img.shape[0]

if phase=="train":

train_loss=running_loss/datasize[phase]

print("Training loss: {0}".format(train_loss))

if phase=="val":

validation_loss=running_loss/datasize[phase]

print("Validation loss: {0}".format(validation_loss))

pic = to_img(output.data)

save_image(pic, './dc_img/image_{}.png'.format(epoch))

torch.save(model.state_dict(), './conv_autoencoder.pth')

131

132 of 169

Why use GANs

Generative adversarial network

Style transfer

MuseGAN – Music generation

Generate new face

Image inpainting

Deep fake videos

132

133 of 169

How do GANs work?

| Footer text

Page 133