1 of 92

(Artificial) Neural Networks:�From Perceptron to MLP

2 of 92

Binary Linear Classifier

Given weights

2

3 of 92

Binary Linear Classifier

Given weights

3

4 of 92

Binary Linear Classifier with New Data

Forward propagation with new data

4

5 of 92

Binary Linear Classifier

5

6 of 92

Binary Linear Classifier in High Dimension

More neurons means

hyper-plane in a higher dimension

6

7 of 92

From Perceptron to MLP

7

8 of 92

XOR Problem

Not linearly separable
Limitation of linear classifier

Single neuron = one linear classification boundary

8


0	0	0
0	1	1
1	0	1
1	1	0

9 of 92

Nonlinear Curve Approximated by Multiple Lines

Nonlinear regression

Nonlinear classification

9

10 of 92

XOR Problem

At least two lines are required

10

11 of 92

Artificial Neural Networks: MLP

Multi-layer Perceptron (MLP) = Artificial Neural Networks (ANN)

Multi neurons = multiple linear classification boundaries

11

12 of 92

Artificial Neural Networks: Activation Function

Differentiable nonlinear activation function

12

13 of 92

Artificial Neural Networks

In a compact representation

13

14 of 92

Two Ways of Looking at Artificial Neural Networks

Still represent lines

Can represent nonlinear relationship between input and outputs due to nonlinear activation function

14

15 of 92

Common Activation Functions

15

Source: 6.S191 Intro. to Deep Learning at MIT

Discuss later

16 of 92

Artificial Neural Networks

Why do we need multi-layers ?

16

17 of 92

Artificial Neural Networks

Why do we need multi-layers ?

17

18 of 92

Artificial Neural Networks

Why do we need multi-layers ?

18

19 of 92

Another Perspective:�ANN as Kernel Learning

19

20 of 92

Nonlinear Classification

20

https://www.youtube.com/watch?v=3liCbRZPrZA

21 of 92

Neuron

We can represent this “neuron” as follows:

21

22 of 92

XOR Problem

The main weakness of linear predictors is their lack of capacity.
For classiﬁcation, the populations have to be linearly separable.

22

23 of 92

Nonlinear Mapping

The XOR example can be solved by pre-processing the data to make the two populations linearly separable.

23

Source: Dr. Francois Fleuret at EPFL

24 of 92

Nonlinear Mapping

The XOR example can be solved by pre-processing the data to make the two populations linearly separable.

24

Source: Dr. Francois Fleuret at EPFL

25 of 92

Nonlinear Mapping

The XOR example can be solved by pre-processing the data to make the two populations linearly separable.

25

Source: Dr. Francois Fleuret at EPFL

26 of 92

Neuron

We can represent this “neuron” as follows:

Not linearly separable

26

27 of 92

Kernel + Neuron

Nonlinear mapping + neuron

27

28 of 92

Neuron + Neuron

Nonlinear mapping can be represented by another neurons

Nonlinear Kernel

Nonlinear activation functions

28

29 of 92

Multi Layer Perceptron

Nonlinear mapping can be represented by another neurons
We can generalize an MLP

29

30 of 92

Summary

Universal function approximator
Universal function classifier

Parameterized

30

Value propagation point of view
Weight point of view

31 of 92

Deep Artificial Neural Networks

Complex/Nonlinear universal function approximator

Linearly connected networks
Simple nonlinear neurons

31

Feature learning

Classification

Class 1

Class 2

nonlinear

linear

…

Output

Input

32 of 92

Deep Artificial Neural Networks

Complex/Nonlinear universal function approximator

Linearly connected networks
Simple nonlinear neurons

32

Class 1

Class 2

…

nonlinear

linear

Feature learning

Classification

Output

Input

33 of 92

Machine Learning vs. Deep Learning

Feature engineering

Feature learning

33

Artificial intelligence (AI) refers to the ability of machines to mimic human intelligence without explicit programming

34 of 92

Deep Learning

34

35 of 92

Looking at Parameters

35

36 of 92

Logistic Regression in a Form of Neural Network

36

37 of 92

Logistic Regression in a Form of Neural Network

Neural network convention

37

Do not indicate bias units

38 of 92

Nonlinearly Distributed Data

Example to understand network’s behavior

Include a hidden layer

38

39 of 92

Nonlinearly Distributed Data

Example to understand network’s behavior

Include a hidden layer

39

Do not include bias units

40 of 92

Multi Layers

40

Do not include bias units

41 of 92

Multi Layers

41

Do not include bias units

42 of 92

Multi Layers

42

Do not include bias units

43 of 92

Nonlinearly Distributed Data

More neurons in hidden layer

43

44 of 92

Nonlinearly Distributed Data

More neurons in hidden layer

44

Do not include bias units

45 of 92

Multi Layers

Multiple linear classification boundaries

45

Do not include bias units

46 of 92

(Artificial) Neural Networks: �Training

46

47 of 92

Training Neural Networks: Optimization

47

48 of 92

Training Neural Networks: Loss Function

Measures error between target values and predictions

Example

Squared loss (for regression):

Cross entropy (for classification):�

48

49 of 92

Training Neural Networks: Gradient Descent

49

50 of 92

Gradients in ANN

50

51 of 92

Dynamic Programming

51

52 of 92

Recursive Algorithm

52

…

Output

Input

…

Output

Input

Base Case

53 of 92

Dynamic Programming

Dynamic Programming: general, powerful algorithm design technique

Fibonacci numbers:

53

54 of 92

Naïve Recursive Algorithm

It works. Is it good?

54

55 of 92

Memorized Recursive Algorithm

Benefit?

fib(n) only recurses the first time it's called

55

56 of 92

Dynamic Programming Algorithm

56

57 of 92

Backpropagation

57

58 of 92

Gradients in ANN

58

59 of 92

Training Neural Networks: Backpropagation Learning

Forward propagation

the initial information propagates up to the hidden units at each layer and finally produces output

Backpropagation

allows the information from the cost to flow backwards through the network in order to compute the gradients

59

60 of 92

Backpropagation

60

61 of 92

Backpropagation

61

62 of 92

Backpropagation

62

63 of 92

Backpropagation

63

64 of 92

Training Neural Networks with TensorFlow

64

65 of 92

Core Foundation Review

65

Source: 6.S191 Intro. to Deep Learning at MIT

66 of 92

(Artificial) Neural Networks with TensorFlow

66

67 of 92

MNIST database

67

68 of 92

ANN in TensorFlow:�MNIST

68

69 of 92

Our Network Model

69

Input layer

(784)

hidden layer

(100)

output layer

(10)

Input image

(28 X 28)

flattened

digit prediction

in one-hot-encoding

70 of 92

Iterative Optimization

We will use

Mini-batch gradient descent
Adam optimizer

70

71 of 92

Implementation in Python

71

Input layer

(784)

hidden layer

(100)

output layer

(10)

Input image

(28 X 28)

flattened

72 of 92

Evaluation

72

73 of 92

(Artificial) Neural Networks: Advanced

73

74 of 92

Nonlinear Activation Function

74

75 of 92

The Vanishing Gradient Problem

75

76 of 92

Rectifiers

The use of the ReLU activation function was a great improvement compared to the historical tanh.

76

77 of 92

Rectifiers

This can be explained by the derivative of ReLU itself not vanishing, and by the resulting coding being sparse (Glorot et al., 2011).

77

78 of 92

Batch Normalization

78

79 of 92

Batch Normalization

Batch normalization is a technique for improving the performance and stability of artificial neural networks.
It is used to normalize the input layer by adjusting and scaling the activations.

79

S. Ioﬀe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning (ICML), 2015.

80 of 92

Batch Normalization

Batch normalization is a technique for improving the performance and stability of artificial neural networks.
It is used to normalize the input layer by adjusting and scaling the activations.

80

S. Ioﬀe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning (ICML), 2015.

81 of 92

Batch Normalization

During training, batch normalization shifts and rescales according to the mean and variance estimated on the batch.

During test, it simply shifts and rescales according to the empirical moments estimated during training.

81

S. Ioﬀe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning (ICML), 2015.

82 of 92

Implementation of Batch Normalization

82

S. Ioﬀe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning (ICML), 2015.

83 of 92

Dropout as Regularization

83

84 of 92

Regularization (Shrinkage Methods)

Often, overfitting associated with very large estimated parameters
We want to balance

how well function fits data
magnitude of coefficients

multi-objective optimization
𝜆 is a tuning parameter

84

85 of 92

Different Regularization Techniques

Big Data
Data augmentation

The simplest way to reduce overfitting is to increase the size of the training data.

85

86 of 92

Different Regularization Techniques

Early stopping

When we see that the performance on the validation set is getting worse, we immediately stop the training on the model.

86

Training Steps

Error

Testing Error

Training Error

Early stopping

87 of 92

Different Regularization Techniques in Deep Learning

Dropout

This is the one of the most interesting types of regularization techniques.
It also produces very good results and is consequently the most frequently used regularization technique in the field of deep learning.
At every iteration, it randomly selects some nodes and removes them.
It can also be thought of as an ensemble technique in machine learning.

87

N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research (JMLR), 15:1929-1958, 2014.

88 of 92

Dropout Illustration

Effectively, a different architecture at every training epoch
It can also be thought of as an ensemble technique in machine learning.

88

Original model

89 of 92

Dropout Illustration

Effectively, a different architecture at every training epoch
It can also be thought of as an ensemble technique in machine learning.

89

tf.nn.dropout(layer, rate = p)

Epoch 1

rate: the probability that each element is dropped. For example, setting rate = 0.1 would drop 10% of input elements

90 of 92

Dropout Illustration

Effectively, a different architecture at every training epoch
It can also be thought of as an ensemble technique in machine learning.

90

tf.nn.dropout(layer, rate = p)

Epoch 1

rate: the probability that each element is dropped. For example, setting rate = 0.1 would drop 10% of input elements

91 of 92

Dropout

91

92 of 92

Implementation of Dropout

92