1 of 169

Deep Learning in a Nutshell

Md Sohel Rana

(Graduate Researcher)

La Trobe University

1

2 of 169

How Do We Learn?

  • Do we get our knowledge from tables and spreadsheets?

  • This is what traditional machine learning is mainly used for.

Slide from Zhen He

2

3 of 169

How Do We Learn?

  • Listening
    • Deep Learning is best for speech recognition
    • 30% better performance than non-deep learning techniques
    • Musenet – Music generation

Slide from Zhen He

3

4 of 169

How Do We Learn?

  • Looking
    • Deep Learning is best for computer vision
    • Over 100% better performance than non-deep learning algorithms

Slide from Zhen He

4

5 of 169

How Do We Learn?

  • Reading text
    • Deep Learning is becoming the best for natural language understanding/processing
      • Language translation, word embeddings, thought vectors question and answering, etc.

Slide from Zhen He

5

6 of 169

Deep Learning versus Traditional Machine Learning

  • Traditional machine learning
    • Structured data
    • Hand engineered features
    • Fixed length input and output

  • Deep learning
    • Unstructured data
    • Feature learning
    • Variable length input and output

Slide from Zhen He

6

7 of 169

Image representation

7

8 of 169

Hand Engineered Features

Slide from Zhen He

8

9 of 169

Image Recognition: Hand Crafted Features

[Andrew Ng, UCLA deep learning summer school 2012]

9

10 of 169

Traditional Machine Learning

  • Traditional machine learning assume features are given to it

Input

Learning

algorithm

Feature Representation

Machine Learning

Slide from Zhen He

10

11 of 169

Traditional Machine Learning

Machine Learning Algorithm

prob(car) prob(motorbike)

Hand engineered features

Slide from Zhen He

11

12 of 169

Feature Learning

  • Move machine learning to include learning feature representation
  • Learn features automatically

Input

Learning

algorithm

Feature Representation

Machine Learning

Slide from Zhen He

12

13 of 169

Feature Learning

  • During training the deep learning algorithm automatically builds hierarchy of features.
  • The top levels are high level features of the input data.
  • Classifying using high level features is much easier.

prob(car) prob(motorbike)

X1 (number of lights)

X2 (number of handlebars)

Decision boundary

Cars

Motor bikes

13

14 of 169

Playing Lego

  • Deep learning algorithms are composed of many different plug and playable blocks.
    • E.g. one block turns image into a vector

2nd block turns vector into text.

Encode Image

CNN

NN

Vector

Decode to text

Lego blocks connect via vectors

Small bird with orange chest

Slide from Zhen He

14

15 of 169

Deep Learning Lego

Encode English

Vector

Decode to Spanish

Spanish Sentence

English Sentence

Encode English

Vector

Decode to German

German Sentence

English Sentence

Slide from Zhen He

15

16 of 169

Many many Lego pieces

  • Combinatorial number of algorithms can be created

16

17 of 169

Who is Using Deep Learning?

17

18 of 169

Who is using Deep Learning?

  • Google Search
    • Rank Brain
      • Ranking using deep learning

  • Google’s traditional rule based search team initially reluctant to use deep learning for search
  • Head to head
    • RankBrain versus Traditional Google Search Algorithm
      • RankBrain 80% correct
      • Traditional Google Search Algorithm 70% correct
  • Next generation google assistant
    • More automation
    • Realtime translation because of compressed model downloaded in android phone

  • http://www.wired.com/2016/02/ai-is-changing-the-technology-behind-google-searches/

Slide from Zhen He

18

19 of 169

Language Translation

  • Google is using deep learning for language translation

Reduces translation errors by an average of 60% compared to Google's phrase-based production system.

Slide from Zhen He

19

20 of 169

Who is using Deep Learning?

  • Google Android Speech Recognition

Deep recurrent neural networks reduced word errors by more than 30%

Slide from Zhen He

20

21 of 169

Who is using Deep Learning?

  • Google has over 1500 deep learning projects
  • Facebook AI research
  • Jeff Dean
    • Inventor of MapReduce leads deep learning in GoogleBrain

21

22 of 169

Who is using Deep Learning?

  • Google CEO Sundar Pichai
    • Over time computers in all shapes (mobile device, watch, in car, etc.) will be an intelligent assistant helping you through your day.
  • Using deep learning personal assistants will become much smarter and better at speech recognition
    • Microsoft Cortana Facebook M

    • Next generation google assistant Amazan Echo

22

23 of 169

Who is using Deep Learning?

  • Self driving cars
    • Google- waymo
    • Nvidia

23

24 of 169

Just getting started

  • Deep learning has only been around for around 7 years!
  • Many many more new applications coming!

24

25 of 169

Introduction to Neural Networks

25

26 of 169

Neural network

a(x)

h(x)

26

27 of 169

Activation Function

Usually drawn

this way

27

28 of 169

28

29 of 169

XOR solution

y

0

0

0

0

1

1

1

0

1

1

1

0

4

Examples

W21

W11

W22

W12

Wh1

Wh2

 

29

30 of 169

Deep Learning

  • Using more layers you can learn much more complex representations of the data.
    • Effectively much more complex decision boundaries

http://playground.tensorflow.org/

Inputs

Outputs

W1

W2

WL

Slide from Zhen He

30

31 of 169

Forward Pass

  • Take inputs x and then compute the activations of hidden units
  • Then compute outputs ……..

Slide from Zhen He

31

32 of 169

Loss Function

  • Depending on the loss function

Compare output to ground truth labels y1 … yK

  • N is the number of training examples
  • K is the number of outputs

32

33 of 169

Back Propagation

  • Use the calculus chain rule to propagate the gradient of the loss from the output all the way back to all the weights.
  • Adjust the weights using the gradients.

Adjust

33

34 of 169

Optimization: Gradient Descent

  • The way to minimize the loss function is to find the gradient (slope) of the loss with respect to the weight W.
  • Then move in the negative direction of the slope, so you descend the loss function.

W

Loss(W)

 

learning rate

Concave optimization problem

34

35 of 169

The optimal learning rate

Very small learning rate

 

learning rate

Very large learning rate

(converges too slowly)

(jumps out of the minima)

35

36 of 169

Popular Optimization Method

  • Stochastic Gradient Descent (SGD)
    • Manual schedule.
      • For example
        • 1 epoch – 18 epoch, Learning rate = 1e-2
        • 19 epoch – 29 epoch , Learning rate = 5e-3
        • 30 epoch – 43 epoch, Learning rate = 1e-3
        • 44 epoch – 52 epoch, Learning rate = 5e-4
    • Momentum

Learning rate goes down

36

37 of 169

Popular Optimization Methods

  • Adam
    • Automatically adjusts learning rate
    • A good initial learning rate is 1e-4
    • Too high initial learning rate can lead to bad results

  • RMSProp
    • Also adjusts learning rate automatically.

37

38 of 169

Tuning Initial Learning Rate

  • Usually it is a good idea to try several different initial learning rates to see which one performs the best.
    • 1Cycle learning rate finder

38

39 of 169

Three Types of Gradient Descent

  • Full batch gradient descent
    • Use all examples in each iteration

  • Stochastic gradient descent
    • Use 1 example in each iteration

  • Mini-batch gradient descent
    • Use b examples in each iteration
    • This is the one that is most commonly used for deep learning
      • Usually b is around 128.

39

40 of 169

Back Propagation

red numbers are gradients

40

41 of 169

Gradient At Branches

p

q

c

Loss

. . . . .

. . . . .

41

42 of 169

The Softmax Layer and the Cross Entropy Loss Function

42

43 of 169

How to Remember Cross Entropy Loss

  • The the one hot encoded true label multiplied by the log of the predicted probability.

43

44 of 169

Consider the Following Classification Problem

  • ab – activation for banana
  • ag – activation for grape
  • ap – activation for pear

  • – predicted probability for banana
  • – predicted probability for grape
  • – predicted probability for pear

X

W

Neural network

ab ag ap

softmax

44

45 of 169

Consider the Following Classification Problem

  • ab – activation for banana
  • ag – activation for grape
  • ap – activation for pear

  • – predicted probability for banana
  • – predicted probability for grape
  • – predicted probability for pear

  • The way we compute the probability of class banana given X and weight W using the softmax function is as follows:

X

W

Neural network

ab ag ap

softmax

0.2

0.1

0.7

logits

probabilities

45

46 of 169

What if I don’t care about the maths and just want to use it

  • Targets should be 1 hot encoded. For example
    • Banana 1, 0, 0
    • Pear 0, 1, 0
    • Apple 0, 0, 1
  • Each target should be mutually exclusive
    • For example the object is either a banana or a pear. It can not be both.

Neural network layers

softmax

8

5.0

2.0

1.0

0.94

0.04

0.02

Cross entropy loss

0.067

1 0 0

One hot encoded target

Pytorch code:

import torch

criterion=nn.CrossEntropyLoss() //Softmax+cross entropy. Don’t need one-hot encoding

logits

46

47 of 169

Regression Loss Function

  • When you want to predict a real value instead of a class (regression) then people usually use the mean squared error loss function
  • L1 loss

47

48 of 169

Activation Functions

48

49 of 169

Sigmoid Function

49

50 of 169

Sigmoid Function

Saturated neurons kill the gradients

50

51 of 169

Rectified Linear Unit (ReLU)

51

52 of 169

Convolutional Neural Networks

52

53 of 169

Convolutional Neural Networks

  • Sliding window of image patches to detect each feature
    • The neurons of the same feature share the same weights
  • The output feature map tells us how well the filter matches the input image patch

Output feature map

Filter

2

Output feature map

Filter

5

4

Filter

Output feature map

53

54 of 169

What is the result of training?

  • These names all mean the same
    • Filter / Kernel / Weights
  • The filter contains the learnt weights
    • During training the neural network will determine which filters it should use to give the highest classification accuracy.
  • Spatial invariance
    • Sliding the same filter across the entire image means you look for the same feature any where in an image.

Output feature map

Filter

Kernel

Weight

2

Output feature map

5

4

Output feature map

Filter

Kernel

Weight

Filter

Kernel

Weight

54

55 of 169

Spatial Invariance

  • Even though the ball only appears in the top left in the training set.
  • The filters we learn during training can be used to detect the ball on the bottom right in the testing set.

Training Set

Testing Set

55

56 of 169

Multiple layers

Output feature map layer 2

Filter

2

3

Filter

Output feature map layer 1

(input for the next layer)

56

57 of 169

How to Compute a Convolution?

  • We slide the filter across the input feature map and compute the dot product between the filter and the current set window of values of the input feature map.

1.2

2.2

1.2

3.1

0.2

2.1

2.5

-1.2

-1.1

2.3

1.1

-0.2

1.2

2.1

-2.3

-1.1

0.2

1.2

3.1

1.1

Filter

Input Feature Map

5.81

Output Feature Map

0.2 * 1.2 + 1.2 * 2.2 +

3.1 * 0.2 + 1.1 * 2.1

57

58 of 169

How to Compute a Convolution?

1.2

2.2

1.2

3.1

0.2

2.1

2.5

-1.2

-1.1

2.3

1.1

-0.2

1.2

2.1

-2.3

-1.1

0.2

1.2

3.1

1.1

Filter

Input Feature Map

5.81

11.14

Output Feature Map

0.2 * 2.2 + 1.2 * 1.2 +

3.1 * 2.1 + 1.1 * 2.5

58

59 of 169

Padding with Zeros

  • Padding the input feature map with zeros allow you to output a larger feature map.

0

0

0

0

0

0

0

1.2

2.2

1.2

3.1

0

0

0.2

2.1

2.5

-1.2

0

0

-1.1

2.3

1.1

-0.2

0

0

1.2

2.1

-2.3

-1.1

0

0

0

0

0

0

0

Input Feature Map

2.1

1.2

0.5

0.7

8.1

1.20

5.2

3.2

1.1

Filter

Output Feature Map

59

60 of 169

Convolutions generates and output feature map

https://community.arm.com/graphics/b/blog/posts/when-parallelism-gets-tricky-accelerating-floyd-steinberg-on-the-mali-gpu

Output feature map

60

61 of 169

The depth represents three feature maps / channels concatenated together.

Usually in the first layer these three channels represent the three colour

Channels R G B.

61

62 of 169

62

63 of 169

63

64 of 169

64

65 of 169

65

66 of 169

Layer 1 and 2 of CNN

  • Left shows the filters
  • Right shows which part of image get most strongly activated by filter

source: https://adeshpande3.github.io/adeshpande3.github.io/The-9-Deep-Learning-Papers-You-Need-To-Know-About.html

66

67 of 169

Layer 3 of CNN

source: https://adeshpande3.github.io/adeshpande3.github.io/The-9-Deep-Learning-Papers-You-Need-To-Know-About.html

67

68 of 169

Layer 4 and 5 of CNN

source: https://adeshpande3.github.io/adeshpande3.github.io/The-9-Deep-Learning-Papers-You-Need-To-Know-About.html

68

69 of 169

Convolution with Larger Stride

  • In this example we will slide the window across with a stride of 2.

1.2

2.2

1.2

3.1

0.2

2.1

2.5

-1.2

-1.1

2.3

1.1

-0.2

1.2

2.1

-2.3

-1.1

0.2

1.2

3.1

1.1

Filter

Input Feature Map

5.81

Output Feature Map

0.2 * 1.2 + 1.2 * 2.2 +

3.1 * 0.2 + 1.1 * 2.1

69

70 of 169

Convolution with Larger Stride

  • In this example we will slide the window across with a stride of 2.

1.2

2.2

1.2

3.1

0.2

2.1

2.5

-1.2

-1.1

2.3

1.1

-0.2

1.2

2.1

-2.3

-1.1

0.2

1.2

3.1

1.1

Filter

Input Feature Map

5.81

10.39

Output Feature Map

0.2 * 1.2 + 1.2 * 3.1 +

3.1 * 2.5 + 1.1 * -1.2

70

71 of 169

Pooling / Subsampling

  • In order to reduce the size of feature maps pooling/subsampling is often used.
  • Max pooling is the most popular type of pooling.
    • Allows you to have a certain amount of translational invariance
  • In max pooling the max value with each window is outputted in the output feature map.
    • Another types of pooling include average pooling, etc.
  • In the above example we applied maxpooling with a 2 x 2 pooling window with a stride of 2.

1.2

2.2

1.2

3.1

0.2

2.1

2.5

-1.2

-1.1

2.3

1.1

-0.2

1.2

2.1

-2.3

-1.1

Input Feature Map

2.2

Output Feature Map

71

72 of 169

Pooling / Subsampling

1.2

2.2

1.2

3.1

0.2

2.1

2.5

-1.2

-1.1

2.3

1.1

-0.2

1.2

2.1

-2.3

-1.1

Input Feature Map

2.2

3.1

Output Feature Map

72

73 of 169

A Typical Convolutional Network

  • Notice as we move deeper into the network the feature maps get smaller and there are more feature maps per layer.
    • This is analogous to having higher level concepts which cover larger regions and also more concepts (one concept per feature map)
  • All the convolutional layers together form the feature extractor
  • The fully connected layer takes the features as input and then outputs the classification

73

74 of 169

Where to use Fully Connected Layers (Linear Layers)?

  • Suppose we take an image as input and we want to predict it belongs to 1 of 3 possible object?
  • We should use a linear layer to make the network output 4 predicted values
    • After we pass that through a softmax we will have a predicted probability for each class.

Convolutional layers

softmax

8

5.0

2.0

1.0

0.94

0.04

0.02

Linear layer

Prob grapes

Prob pear

Prob banana

74

75 of 169

Fully Connect Network (Linear Layer)

  • Massive number of weights
    • Every input neuron is connected to every output neuron via weight

    • M x N weights
      • M input neurons
      • N output neurons

Output Neurons

Input Neurons

wij

A weight for each connection

75

76 of 169

Convolutions layers much more efficient than fully connected layers

It is common for 99% of weights

for fully connected layers

It is common for 1% of weights

for convolutional layers

76

77 of 169

Convolutional Layers Have Small Number of Weights

  • Convolution layers do massive weight sharing
  • Using just 3 (width) x 3 (length) x 3 (number of input feature maps) x 6 (number of output feature maps) weights = 162 weights
    • Connects 224 x 224 x 3 (150,528) inputs neurons into 222 x 222 x 6 (295,704) output neurons.
  • If we were to use a fully connected layer to do the same then we would need

150,528 x 295,704 weights = 44, 511, 731,712 weights!!!

224

224

3

3 x 3 x3 x 6 filter

222

222

77

78 of 169

Three Famous Convolutional Architectures

  • AlexNet
    • 2012
    • 8 layers
    • ImageNet top 5 error: 16.4%
  • VGG
    • 2014
    • 19 layers
    • ImageNet top 5 error: 6.8%
  • ResNet
    • Nov 2015
    • 152 layers
    • ImageNet top 5 error: 3.57%

78

79 of 169

AlexNet

79

80 of 169

AlexNet

  • ImageNet Classification with Deep Convolutional Neural Networks
    • Alex Krizhevsky, et. al.
    • NIPS 2012
  • The paper that started it all in 2012
  • Ground breaking for a number of reasons
    • ReLU
    • Dropout
    • Using GPUs

80

81 of 169

VGG

81

82 of 169

VGG

  • Very Deep Convolutional Networks for Large Scale Image Recognition
    • Karen Simonyan, et. al.
    • ICLR 2015
    • Oxford computer vision group
  • Won Imagenet ILSVRC-2014 competition for localization
  • Second Imagenet ILSVRC-2014 competition for classification
    • Error rate of 6.8%

82

83 of 169

VGG16 architecture

83

84 of 169

VGG

  • Much deeper than Alexnet
    • Up to 19 layers instead of 8 layers
    • They can afford it due to the much smaller filters
    • Number of parameters not that many compared to Alexnet
      • VGG 133 – 144 million
      • Alexnet 60 million

84

85 of 169

ResNet!

85

86 of 169

Deep Residual Learning for Image Recognition�

  • Kaiming He, Xiangyu Zhang, Shao
  • arXiv 2015

  • Existing techniques accuracy saturates when going deeper
  • Degradation is not caused by overfitting!
  • Adding more layers leads to higher training error.
  • See experiments below

86

87 of 169

Wouldn’t it be cool!

  • Wouldn’t it be cool to be able shut down entire layers based on the input?
    • So selective keep layers depending on the input

Input A

Input B

87

88 of 169

One Simple Awesome Trick! (deep residual network

  • Using the above approach can produce much deep nets
    • Successfully train over 100 layers and explored over 1000 layers
  • Imagenet
    • 152 layered residual net
    • Lower complexity than VGG
    • 3.57% top-5 error
  • 1st place 2015 results (ImageNet classification, ImageNet detection, ImageNet localization, COCO detection, COCO segmentation)

88

89 of 169

Generalization

  • Data augmentation
  • Dropout

89

90 of 169

Data Augmentation

  • Increase data size to help generalization
  • Types of augmentation
    • Random flip
    • Random rotate
    • Random scale
    • Random noise
    • Random crop

90

91 of 169

Dropout Method

Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I. and Salakhutdinov, R. R. (2012), Improving neural networks by preventing co-adaptation of feature detectors

91

92 of 169

Amount of Dropout

  • Dropout is used initial fully connected layer
    • Initial layer is used to learn low level features
  • Usually dropout is used before activation function
  • Dropout probability is 0.5

Dropout code in pytorch: nn.Dropout(0.5)

92

93 of 169

Used Dropout to Avoid Overfitting

93

94 of 169

Normalization is Important

  • Deep Learning algorithms train much better and fast when activations are normalized.

  • When the data is normalized the weights of the neural network can stay in a more reasonable range and have close to a normal distribution.
    • E.g. Try to predict price of a house.

Features include: number of bedrooms, land size, etc.

Features are of very different scales.

  • Normalization
    • Make data have zero mean
    • Unit variance

  • Two Places to Normalize
    • Normalize input
    • Normalize internal layers

Ioffe, Sergey, and Christian Szegedy. "Batch normalization: Accelerating deep network training by reducing internal covariate shift." arXiv preprint arXiv:1502.03167 (2015).

94

95 of 169

batch size

Number of dimensions

95

96 of 169

96

97 of 169

Results

  • Notice the massive increase in the convergence rate of Batch Normal variants (BN)

97

98 of 169

Batch Normalization Popularity

  • Basically everyone uses batch normalization now!

Pytorch code:

torch.nn.BatchNorm1d(shape_of_output_feature)

98

99 of 169

Both Theano and TensorFlow uses Idea of Computational Graphs

  • The computational graph is created by connecting symbolic expressions that take symbolic variables as inputs.
  • The symbolic expressions are very simple but can be used to build very complex functions.
  • A major benefit of this is that these system can do automatic differentiation.

99

100 of 169

PyTorch

  • Developed by facebook
  • Python frontend
  • Built on top of Torch C++ backend
  • Does automatic differentiation
  • Dynamic graph calculation
  • More pythonic
  • Easy to debug (Python debugger pdb)
  • All most all pretrained model:
    • Alexnet
    • Resnet
    • VGG
    • Faster rcnn detector weights
    • etc
  • Intuitive and easy to use
    • Overall people like it

100

101 of 169

Deep Dive into Pytorch

101

102 of 169

What is a Tensor?

  • A multi-dimensional array of numbers
    • 1D Tensor – Vector (Rank 1)
      • E.g. v = torch.tensor([1.1, 2.2, 3.3])
    • 2D Tensor – Matrix (Rank 2)
      • E.g. m = torch.tensor([ [1, 2, 3], [4, 5, 6], [7, 8, 9] ])
    • 3D Tensor – Cube of numbers ( Rank 3)
      • E.g. t = torch.tensor([[[2], [4], [6]], [[8], [10], [12]], [[14], [16], [18]]])

v = [1.1, 2.2, 3.3]

v = [1.1, 2.2, 3.3]

m = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]

102

103 of 169

Some tensor processing

Permute:

in=torch.tensor([[4,5,8],[1,2,0]])

out=in.permute(1,0)

print(out)

Output: tensor([[4, 1],

[5, 2],

[8, 0]])

Squeezing and unsqueezing:

in=torch.randn(4,5)

out=in.unsqueeze(dim=0)

print(out.shape)

Output: torch.Size([1, 4, 5])

out=out.squeeze(dim=0)

print(out.shape)

Out: torch.Size([ 4, 5])

Stacking tensor list:

in=[]

in.append(torch.tensor([4,5]))

in.append(torch.tensor([2,1]))

in.append(torch.tensor([0,3]))

out = torch.stack(in)

Out: tensor([[4, 5],

[2, 1],

[0, 3]])

Reshaping:

in=torch.randn(4,5)

out = in.view(-1,10)

print(out.shape)

Output: torch.Size([2, 10])

103

104 of 169

How do we train in Pytorch?

    • Prepare dataset

Dataset

    • Batch management
    • Shuffling data

Dataloader

    • Design a NN model

Model

    • Train the model
    • Evaluate

Train

104

105 of 169

Writing code in pytorch

  • Custom dataloader part
  • Model part
  • Training part

105

106 of 169

Custom dataloader structure

Return single batch example with label

Load all annotations of examples and labels

106

107 of 169

Library and configuration

import torch

import os

from PIL import Image

from torch.utils.data.dataset import Dataset

import numpy as np

import csv

from torch.utils.data import DataLoader

from torchvision import transforms

import torch.nn as nn

import torch.optim as optim

import torch.nn.functional as F

TRAIN_IMG_FILE_ANNOTATION = './dataset/train.txt'

TRAIN_IMG_DIR = './dataset/train_img'

TEST_IMG_FILE_ANNOTATION = './dataset/test.txt'

TEST_IMG_DIR = './dataset/test_img'

NLABELS = 5

batch_size = 32

num_epoch=4

107

108 of 169

Custom dataloader part

class DataPreparation(Dataset):

def __init__(self, annotation_file,img_root_path, datatypes):

self.img_root_path = img_root_path

self.images=[]

self.labels=[]

if datatypes=="train":

self.transforms = transforms.Compose([transforms.Resize((64,64)),transforms.RandomRotation((-4,4)),transforms.ToTensor()])

if datatypes=="val":

self.transforms = transforms.Compose([transforms.Resize((64,64)),transforms.ToTensor()])

with open(annotation_file, 'r') as annotation_reader:

annotation = csv.reader(annotation_reader, delimiter=',')

for row in annotation:

self.images.append(row[0])

self.labels.append(int(row[1]))

def __getitem__(self, index):

img = Image.open(os.path.join(self.img_root_path, self.images[index]))

img = img.convert('RGB')

img = self.transforms(img)

labels=self.labels[index]

return {"img":img,"labels":labels}

def __len__(self):

return len(self.images)

train_dataset=DataPreparation(TRAIN_IMG_FILE_ANNOTATION,TRAIN_IMG_DIR,datatypes="train")

train_size=len(train_dataset)

train_loader = DataLoader(train_dataset,batch_size=batch_size,shuffle=True,num_workers=0)

test_dataset=DataPreparation(TEST_IMG_FILE_ANNOTATION,TEST_IMG_DIR,datatypes="val")

test_size=len(test_dataset)

test_loader = DataLoader(test_dataset,batch_size=batch_size,shuffle=True,num_workers=0)

dataloader={"train":train_loader,"val":test_loader}

datasize={"train":train_size,"val":test_size}

108

109 of 169

Model part

class MultiLabelNN(nn.Module):

def __init__(self, nlabel):

super(MultiLabelNN, self).__init__()

self.nlabel = nlabel

self.conv1 = nn.Conv2d(3, 6, 5)

self.pool = nn.MaxPool2d(2, 2)

self.conv2 = nn.Conv2d(6, 16, 5)

self.fc1 = nn.Linear(10816,512)

self.fc2 = nn.Linear(512, nlabel)

def forward(self, x):

x = self.conv1(x)

x = F.relu(x)

x = self.pool(x)

x = self.conv2(x)

x = F.relu(x)

x = x.view(-1, 10816)

x = self.fc1(x)

x = F.relu(x)

x = self.fc2(x)

return x

109

110 of 169

Training part

model = MultiLabelNN(NLABELS)

optimizer = optim.Adam(model.parameters(), lr=0.001) # optimizer

criterion = nn.CrossEntropyLoss() #Loss function

for epoch in range(num_epoch):

print("Epoch: {0}/{1}".format(epoch+1,num_epoch))

for phase in ["train","val"]:

if phase=="train":

model.train()

if phase=="val":

model.eval()

running_loss = 0.0

running_correct = 0.0

for index,data in enumerate(dataloader[phase]):

images=data["img"]

labels=data["labels"]

optimizer.zero_grad()

outputs = model.forward(images)

loss=criterion(outputs,labels)

_, predicted = torch.max(outputs.data, 1)

if phase=="train":

loss.backward()

optimizer.step()

running_loss += loss.item()*labels.shape[0]

running_correct += (predicted == labels).sum().item()

if phase=="train":

train_loss=running_loss/datasize["train"]

train_accuracy=running_correct/datasize["train"]

print("Training loss: {0}".format(train_loss))

print("Training accuracy: {0}".format(train_accuracy))

if phase=="val":

test_loss=running_loss/datasize["train"]

test_accuracy=running_correct/datasize["val"]

print("Validation loss: {0}".format(test_loss))

print("Validation accuracy: {0}".format(test_accuracy))

110

111 of 169

Use GPU in pytorch

  • Load GPU using both model and data

Code:

model=model.cuda()

……….

……….

images=data["img"].cuda()

labels=data["labels"].cuda()

111

112 of 169

Tensorboard code

from tensorboardX import SummaryWriter ## Add Library

train_writer = SummaryWriter("./logs/train") ## Write this two writer on top.

val_writer = SummaryWriter("./logs/val")

if phase=="train":

train_loss=running_loss/datasize[phase]

train_accuracy=running_correct/datasize[phase]

train_writer.add_scalar("loss",train_loss,epoch) ## plot train loss by train_writer

train_writer.add_scalar("accuracy",train_accuracy,epoch) ## plot train accuracy by train_writer

print("Training loss: {0}".format(train_loss))

print("Training accuracy: {0}".format(train_accuracy))

if phase=="val":

validation_loss=running_loss/datasize[phase] ## plot validation loss by val_writer

validation_accuracy=running_correct/datasize[phase] ## plot validation accuracy by val_writer

val_writer.add_scalar("loss",validation_loss,epoch)

val_writer.add_scalar("accuracy",validation_accuracy,epoch)

112

113 of 169

Tensorboard command

tensorboard --logdir="./logs" --port 6006

113

114 of 169

Tensorboard output & Early Stopping

Stop

114

115 of 169

Understand how a model is performing

Overfitting

Underfitting

Good fitting

115

116 of 169

Small data set?

  • What happens when you only have a small labelled data set but you want to use deep learning?

    • Transfer learning
      • Works really well.
      • Used very widely.

116

117 of 169

Transfer learning

Slide source: https://www.slideshare.net/xavigiro/deep-learning-for-computer-vision-transfer-learning-and-domain-adaptation-upc-2016

117

118 of 169

Slide source: https://www.slideshare.net/xavigiro/deep-learning-for-computer-vision-transfer-learning-and-domain-adaptation-upc-2016

118

119 of 169

Transfer Learning Training

  • Two options when transferring
    • Freeze Convolutional layers
    • Fine tune convolutional layers

119

120 of 169

Transfer learning code

class TModel(nn.Module): ## Model

def __init__(self, nlabel):

super(TModel, self).__init__()

self.resnet = models.resnet18(pretrained=True)

self.sliced_resnet = torch.nn.Sequential(*(list(self.resnet.children())[:-1]))

self.fc = nn.Linear(512,nlabel)

def forward(self, x):

x = self.sliced_resnet(x)

x=x.view(-1,x.shape[1])

x = self.fc(x)

return x

for param in model.resnet.parameters():

param.requires_grad=False ## Fridge part of a model

120

121 of 169

How to combine information

  • What if you have two separate pieces of information to feed in
    • E.g. Image and attribute (e.g. age)

  • Four options
    • Gating
    • Addition
    • Concatenation
      • Early fusion
      • Late fusion

121

122 of 169

Gating (one side acts like a switch)

Image

Attribute (e.g. age)

Linear layer to make the dimensionality the same

Gating function

- Element-wise multiply operation

Sigmoid layer (acts like a switch)

- values are between 0 and 1.

This will have a large influence due to it being a switch

1.0

0.2

1.0

0.8

0.1

0.3

0.0

0.2

2.1

4.1

2.2

5.5

6.1

1.2

*

*

*

*

*

*

*

switch

. . . . . . . . ..

. . . . . . . . ..

122

123 of 169

Addition

Image

Attribute (e.g. age)

Linear layer to make the dimensionality the same

. . . . . . . . ..

Elementwise add

. . . . . . . . ..

123

124 of 169

Concatenation�

Image

Attribute (e.g. age)

Concatenate

. . . . . . . . ..

. . . . . . . . ..

124

125 of 169

Types of concatenation

125

126 of 169

Gating, addition and concatenation code

class TModel(nn.Module): ## Model

def __init__(self, nlabel):

super(TModel, self).__init__()

self.resnet = models.resnet18(pretrained=True)

self.sliced_resnet = torch.nn.Sequential(*(list(self.resnet.children())[:-1]))

self.fc = nn.Sequential(nn.Linear(1,512),nn.ReLU(inplace=True))

def forward(self, image,age):

f1 = self.sliced_resnet(image)

f1=f1.view(-1,f1.shape[1]) ##Output

f2 = self.fc(age)

out=f1*f2 ##Gating of feature f1 and f2.

return out

Adding: out=f1+f2,

Concatenation: out=torch.cat((f1,f2),dim=0)

126

127 of 169

  • Forcing the encoder to output a reduced dimensionality representation forces the encoder to learn high level features.
    • A good auto encoder will be able to reproduce the input almost perfectly despite the bottle layer z.
  • L2 loss function often used

Reduced Dimensionality

(bottleneck layer)

(Convolutional Layers)

(UpConv Layers)

127

128 of 169

Autoencoders – Why use

Compressed latent space

Trained weights

Trained weights

Semantic segmentation

128

129 of 169

Autoencoder-Configuration and dataloader

import torch

import torchvision

from torch import nn

from torch.autograd import Variable

from torch.utils.data import DataLoader

from torchvision import transforms

from torchvision.utils import save_image

from torchvision.datasets import MNIST

num_epoch = 100

batch_size = 128

img_transform = transforms.Compose([transforms.ToTensor(),transforms.Normalize((0.5,), (0.5,))])

train_dataset = MNIST('./data', transform=img_transform,train=True,download=True)

train_size=len(train_dataset)

train_loader = DataLoader(train_dataset,batch_size=batch_size,shuffle=True,num_workers=0)

test_dataset = MNIST('./data', transform=img_transform,train=False,download=True)

test_size=len(test_dataset)

test_loader = DataLoader(test_dataset,batch_size=batch_size,shuffle=True,num_workers=0)

dataloader={"train":train_loader,"val":test_loader}

datasize={"train":train_size,"val":test_size}

model=autoencoder()

criterion = nn.MSELoss()

optimizer = torch.optim.Adam(model.parameters(), lr=0.001,weight_decay=1e-5)

def to_img(x):

x = 0.5 * (x + 1)

x = x.clamp(0, 1)

x = x.view(x.size(0), 1, 28, 28)

return x

129

130 of 169

Autoencoder – Model

class autoencoder(nn.Module):

def __init__(self):

super(autoencoder, self).__init__()

self.encoder = nn.Sequential(

nn.Conv2d(1, 16, 3, stride=3, padding=1),

nn.ReLU(True),

nn.MaxPool2d(2, stride=2),

nn.Conv2d(16, 8, 3, stride=2, padding=1),

nn.ReLU(True),

nn.MaxPool2d(2, stride=1)

)

self.decoder = nn.Sequential(

nn.ConvTranspose2d(8, 16, 3, stride=2),

nn.ReLU(True),

nn.ConvTranspose2d(16, 8, 5, stride=3, padding=1),

nn.ReLU(True),

nn.ConvTranspose2d(8, 1, 2, stride=2, padding=1)

nn.Tanh()

)

def forward(self, x):

x = self.encoder(x)

x = self.decoder(x)

return x

130

131 of 169

Autoencoder-Training

for epoch in range(num_epoch):

print("Epoch: {0}/{1}".format(epoch+1,num_epoch))

for phase in ["train","val"]:

if phase=="train":

model.train()

if phase=="val":

model.eval()

running_loss = 0.0

for index,data in enumerate(dataloader[phase]):

img,_=data

optimizer.zero_grad()

output = model.forward(img)

loss = criterion(output,img)

if phase=="train":

loss.backward()

optimizer.step()

running_loss += loss.item()*img.shape[0]

if phase=="train":

train_loss=running_loss/datasize[phase]

print("Training loss: {0}".format(train_loss))

if phase=="val":

validation_loss=running_loss/datasize[phase]

print("Validation loss: {0}".format(validation_loss))

pic = to_img(output.data)

save_image(pic, './dc_img/image_{}.png'.format(epoch))

torch.save(model.state_dict(), './conv_autoencoder.pth')

131

132 of 169

Why use GANs

  • Generative adversarial network

Style transfer

MuseGAN – Music generation

Generate new face

Image inpainting

Deep fake videos

132

133 of 169

How do GANs work?

| Footer text

Page 133

source: https://www.slideshare.net/xavigiro/deep-learning-for-computer-vision-generative-models-and-adversarial-training-upc-2016

133

134 of 169

| Footer text

Page 134

source: https://www.slideshare.net/xavigiro/deep-learning-for-computer-vision-generative-models-and-adversarial-training-upc-2016

134

135 of 169

| Footer text

Page 135

source: https://www.slideshare.net/xavigiro/deep-learning-for-computer-vision-generative-models-and-adversarial-training-upc-2016

135

136 of 169

| Footer text

Page 136

source: https://www.slideshare.net/xavigiro/deep-learning-for-computer-vision-generative-models-and-adversarial-training-upc-2016

136

137 of 169

| Footer text

Page 137

source: https://www.slideshare.net/xavigiro/deep-learning-for-computer-vision-generative-models-and-adversarial-training-upc-2016

137

138 of 169

| Footer text

Page 138

source: https://www.slideshare.net/xavigiro/deep-learning-for-computer-vision-generative-models-and-adversarial-training-upc-2016

Change generator weights to make the generated examples look more real.

138

139 of 169

GAN- configuration and dataloader

import torch

import torch.nn as nn

from torch.autograd import Variable

from torch.utils.data import DataLoader

from torchvision import transforms

from torchvision import datasets

from torchvision.utils import save_image

def to_img(x):

out = 0.5 * (x + 1)

out = out.clamp(0, 1)

out = out.view(-1, 1, 28, 28)

return out

batch_size = 128

num_epoch = 100

z_dimension = 100 # noise dimension

img_transform = transforms.Compose([transforms.ToTensor(),transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

mnist = datasets.MNIST('./data', transform=img_transform)

dataloader = DataLoader(mnist, batch_size=batch_size, shuffle=True,num_workers=0)

139

140 of 169

GAN-model

class discriminator(nn.Module):

def __init__(self):

super(discriminator, self).__init__()

self.conv1 = nn.Sequential(

nn.Conv2d(1, 32, 5, padding=2), # batch, 32, 28, 28

nn.LeakyReLU(0.2, True),

nn.AvgPool2d(2, stride=2), # batch, 32, 14, 14

)

self.conv2 = nn.Sequential(

nn.Conv2d(32, 64, 5, padding=2), # batch, 64, 14, 14

nn.LeakyReLU(0.2, True),

nn.AvgPool2d(2, stride=2) # batch, 64, 7, 7

)

self.fc = nn.Sequential(

nn.Linear(64*7*7, 1024),

nn.LeakyReLU(0.2, True),

nn.Linear(1024, 1),

nn.Sigmoid()

)

def forward(self, x):

'''

x: batch, width, height, channel=1

'''

x = self.conv1(x)

x = self.conv2(x)

x = x.view(x.size(0), -1)

x = self.fc(x)

return x

class generator(nn.Module):

def __init__(self, input_size, num_feature):

super(generator, self).__init__()

self.fc = nn.Linear(input_size, num_feature) # batch, 3136=1x56x56

self.br = nn.Sequential(

nn.BatchNorm2d(1),

nn.ReLU(True)

)

self.downsample1 = nn.Sequential(

nn.Conv2d(1, 50, 3, stride=1, padding=1), # batch, 50, 56, 56

nn.BatchNorm2d(50),

nn.ReLU(True)

)

self.downsample2 = nn.Sequential(

nn.Conv2d(50, 25, 3, stride=1, padding=1), # batch, 25, 56, 56

nn.BatchNorm2d(25),

nn.ReLU(True)

)

self.downsample3 = nn.Sequential(

nn.Conv2d(25, 1, 2, stride=2), # batch, 1, 28, 28

nn.Tanh()

)

def forward(self, x):

x = self.fc(x)

x = x.view(x.size(0), 1, 56, 56)

x = self.br(x)

x = self.downsample1(x)

x = self.downsample2(x)

x = self.downsample3(x)

return x

140

141 of 169

GAN - Training

D = discriminator() # discriminator model

G = generator(z_dimension, 3136) # generator model

criterion = nn.BCELoss() # binary cross entropy

d_optimizer = torch.optim.Adam(D.parameters(), lr=0.0003)

g_optimizer = torch.optim.Adam(G.parameters(), lr=0.0003)

# train

for epoch in range(num_epoch):

for i, (img, _) in enumerate(dataloader):

num_img = img.size(0)

# =================train discriminator

real_img = img

real_label = torch.ones(num_img)

fake_label = torch.zeros(num_img)

# compute loss of real_img

real_out = D.forward(real_img)

d_loss_real = criterion(real_out, real_label)

real_scores = real_out # closer to 1 means better

# compute loss of fake_img

z = torch.randn(num_img, z_dimension)

fake_img = G.forward(z)

fake_out = D.forward(fake_img)

d_loss_fake = criterion(fake_out, fake_label)

fake_scores = fake_out # closer to 0 means better

# bp and optimize

d_loss = d_loss_real + d_loss_fake

d_optimizer.zero_grad()

d_loss.backward()

d_optimizer.step()

# ===============train generator

# compute loss of fake_img

z = Variable(torch.randn(num_img, z_dimension))

fake_img = G(z)

output = D(fake_img)

g_loss = criterion(output, real_label)

# bp and optimize

g_optimizer.zero_grad()

g_loss.backward()

g_optimizer.step()

if (i+1) % 100 == 0:

print('Epoch [{}/{}], d_loss: {:.6f}, g_loss: {:.6f} '

'D real: {:.6f}, D fake: {:.6f}'

.format(epoch, num_epoch, d_loss.data[0], g_loss.data[0],

real_scores.data.mean(), fake_scores.data.mean()))

if epoch == 0:

real_images = to_img(real_img.cpu().data)

save_image(real_images, './dc_img/real_images.png')

fake_images = to_img(fake_img.cpu().data)

save_image(fake_images, './dc_img/fake_images-{}.png'.format(epoch+1))

torch.save(G.state_dict(), './generator.pth')

torch.save(D.state_dict(), './discriminator.pth')

141

142 of 169

Understanding GAN loss

142

143 of 169

Object Detection Algorithms

Object Detector

Region proposal based

Faster R-CNN

(2015)

Mask R-CNN

(2017)

FPN (2017)

Regression based

YOLO (2016)

SSD (2016)

AttentionNet (2015)

143

144 of 169

Intersection over union (IOU)

144

145 of 169

Faster RCNN Detector

145

146 of 169

Faster RCNN Detector

146

147 of 169

Long Short Term Memory (LSTM)

  • I will give a very very simplified description of of how a Long Short Term Memory (LSTM) works

  • In practice when using RNNs, people always use LSTMs or some other variants of it like GRU

147

148 of 169

Recurrent Neural Network Configurations

Like what regular neural networks can do

148

149 of 169

We will start with many to many

149

150 of 169

Many to Many (Generative Model)

  • The RNN can be trained to read lots and lots of text (e.g. Wikipedia) one character at a time
  • Once trained the RNN can be given a few start characters and asked to generate more text.
  • For example below is what an RNN generated after being trained on Shakespeare’s works

h

?

h e

e ?

h

e

e l l o

h e l l

. . . . . . . .

150

151 of 169

Many to Many (Generative Model)

  • Word at a time input instead of character at a time input

The cat sat down

cat sat down .

151

152 of 169

Recurrent Neural Network Configurations

152

153 of 169

One to Many (Image Caption Generation)

153

154 of 169

Recurrent Neural Network Configurations

154

155 of 169

Many to One –example 1

  • Example use
    • Text classification
    • Sentiment analysis

The best acting ever

Positive

155

156 of 169

Many to One – example 2

  • Example use
    • Video classification

Playing Basketball

CNN

CNN

CNN

CNN

156

157 of 169

Many to one- example 3

  • Example use
    • Tracking

157

158 of 169

Recurrent Neural Network Configurations

158

159 of 169

Sequence to Sequence Transformations

159

160 of 169

Many to Many (Sequence to Sequence Learning)

  • Uses two different LSTM RNNs
    • One encoder
    • One decoder
    • Using two different LSTM means we get more parameters at low cost
    • Can also be used to train on multiple languages at the same time.
  • The Thought Vector encodes the concept that the encoder outputs.
  • End of sequence detected when <EOS> is found
  • Feed output of decoder back into itself as input until <EOS> is outputted

Encoder

Decoder

Thought Vector

160

161 of 169

Problem with this model

  • The problem with this model is that all the information is stored in a fixed sized vector.
  • This model has trouble translating sentences that are longer than around 20 words

Encoder

Decoder

Thought Vector

161

162 of 169

The solution is to use an attention model

162

163 of 169

163

164 of 169

END

164

165 of 169

Limitation of deep learning

    • Limitation:
      • Cognitive awareness of the environment
      • Integrated understanding of the environment

165

166 of 169

Deep Learning Team

  • Associate Professor Zhen He

  • Associate Professor Stuart Morgan
    • Expert in computer vision for sports analytics
    • Now working for the AIS

  • Dr Matthias Langer
    • Completed PhD (distributed deep learning)

  • Aiden Nibali
    • Completed PhD (2D and 3D pose estimation)

  • Ash Hall
    • Research assistant (Automatic swimming annotation)

  • Brandon Victor
    • Research assistant (Automatic swimming annotation)

166

167 of 169

Deep Learning Team

  • Haritha Thilakarathne
    • PhD student (Using deep learning to retrieve and detect group movement characteristics in sports)
  • Sohel Rana
    • PhD student (Using deep learning to classify and temporally localize individual actions in sports)

  • Josh Millward
    • Honours (Plant phenotyping and meta learning for semi-supervised learning)
  • Albert Bonela
    • Masters of data science (Linking plant phenotypes to genetics)

  • Neha Neha
    • Masters of IT (Plant phenotyping and Telstra)

  • Richard Morton
    • Bachelor of Computer Science (Plant phenotyping and Telstra)

167

168 of 169

Deep Learning Journal Club

  • At La Trobe we have a deep learning journal club
  • We usually present between 2 to 3 papers per week
  • It started in July 2015.
  • We have presented approximately 210 papers in total so far
  • Topics include
    • Computer vision
    • Natural Language processing
    • Reinforcement learning

168

169 of 169

Question

169