1 of 92

(Artificial) Neural Networks:�From Perceptron to MLP

2 of 92

Binary Linear Classifier

  • Given weights

2

3 of 92

Binary Linear Classifier

  • Given weights

3

4 of 92

Binary Linear Classifier with New Data

  • Forward propagation with new data

4

5 of 92

Binary Linear Classifier

5

6 of 92

Binary Linear Classifier in High Dimension

  • More neurons means
    • hyper-plane in a higher dimension

6

7 of 92

From Perceptron to MLP

7

8 of 92

XOR Problem

  • Not linearly separable
  • Limitation of linear classifier

  • Single neuron = one linear classification boundary

8

0

0

0

0

1

1

1

0

1

1

1

0

9 of 92

Nonlinear Curve Approximated by Multiple Lines

  • Nonlinear regression
  • Nonlinear classification

9

10 of 92

XOR Problem

  • At least two lines are required

10

11 of 92

Artificial Neural Networks: MLP

  • Multi-layer Perceptron (MLP) = Artificial Neural Networks (ANN)
    • Multi neurons = multiple linear classification boundaries

11

12 of 92

Artificial Neural Networks: Activation Function

  • Differentiable nonlinear activation function

12

13 of 92

Artificial Neural Networks

  • In a compact representation

13

14 of 92

Two Ways of Looking at Artificial Neural Networks

  • Still represent lines

  • Can represent nonlinear relationship between input and outputs due to nonlinear activation function

14

15 of 92

Common Activation Functions

15

Source: 6.S191 Intro. to Deep Learning at MIT

Discuss later

16 of 92

Artificial Neural Networks

  • Why do we need multi-layers ?

16

17 of 92

Artificial Neural Networks

  • Why do we need multi-layers ?

17

18 of 92

Artificial Neural Networks

  • Why do we need multi-layers ?

18

19 of 92

Another Perspective:�ANN as Kernel Learning

19

20 of 92

Nonlinear Classification

20

https://www.youtube.com/watch?v=3liCbRZPrZA

21 of 92

Neuron

  • We can represent this “neuron” as follows:

21

22 of 92

XOR Problem

  • The main weakness of linear predictors is their lack of capacity.
  • For classification, the populations have to be linearly separable.

22

23 of 92

Nonlinear Mapping

  • The XOR example can be solved by pre-processing the data to make the two populations linearly separable.

23

Source: Dr. Francois Fleuret at EPFL

24 of 92

Nonlinear Mapping

  • The XOR example can be solved by pre-processing the data to make the two populations linearly separable.

24

Source: Dr. Francois Fleuret at EPFL

25 of 92

Nonlinear Mapping

  • The XOR example can be solved by pre-processing the data to make the two populations linearly separable.

25

Source: Dr. Francois Fleuret at EPFL

26 of 92

Neuron

  • We can represent this “neuron” as follows:

  • Not linearly separable

26

27 of 92

Kernel + Neuron

  • Nonlinear mapping + neuron

27

 

28 of 92

Neuron + Neuron

  • Nonlinear mapping can be represented by another neurons

  • Nonlinear Kernel
    • Nonlinear activation functions

28

29 of 92

Multi Layer Perceptron

  • Nonlinear mapping can be represented by another neurons
  • We can generalize an MLP

29

30 of 92

Summary

  • Universal function approximator
  • Universal function classifier

  • Parameterized

30

  1. Value propagation point of view
  2. Weight point of view

31 of 92

Deep Artificial Neural Networks

  • Complex/Nonlinear universal function approximator
    • Linearly connected networks
    • Simple nonlinear neurons

31

Feature learning

Classification

Class 1

Class 2

nonlinear

linear

Output

Input

32 of 92

Deep Artificial Neural Networks

  • Complex/Nonlinear universal function approximator
    • Linearly connected networks
    • Simple nonlinear neurons

32

Class 1

Class 2

nonlinear

linear

Feature learning

Classification

Output

Input

33 of 92

Machine Learning vs. Deep Learning

  • Feature engineering

  • Feature learning

33

  • Artificial intelligence (AI) refers to the ability of machines to mimic human intelligence without explicit programming

34 of 92

Deep Learning

34

35 of 92

Looking at Parameters

35

36 of 92

Logistic Regression in a Form of Neural Network

36

37 of 92

Logistic Regression in a Form of Neural Network

  • Neural network convention

37

Do not indicate bias units

38 of 92

Nonlinearly Distributed Data

  • Example to understand network’s behavior
    • Include a hidden layer

38

39 of 92

Nonlinearly Distributed Data

  • Example to understand network’s behavior
    • Include a hidden layer

39

Do not include bias units

40 of 92

Multi Layers

  •  

40

Do not include bias units

41 of 92

Multi Layers

  •  

41

Do not include bias units

42 of 92

Multi Layers

  •  

42

Do not include bias units

43 of 92

Nonlinearly Distributed Data

  • More neurons in hidden layer

43

44 of 92

Nonlinearly Distributed Data

  • More neurons in hidden layer

44

Do not include bias units

45 of 92

Multi Layers

  • Multiple linear classification boundaries

45

Do not include bias units

46 of 92

(Artificial) Neural Networks: �Training

46

47 of 92

Training Neural Networks: Optimization

  •  

47

48 of 92

Training Neural Networks: Loss Function

  • Measures error between target values and predictions

  • Example
    • Squared loss (for regression):

    • Cross entropy (for classification):�

48

49 of 92

Training Neural Networks: Gradient Descent

  •  

49

50 of 92

Gradients in ANN

  •  

50

51 of 92

Dynamic Programming

51

52 of 92

Recursive Algorithm

  •  

52

Output

Input

Output

Input

Base Case

53 of 92

Dynamic Programming

  • Dynamic Programming: general, powerful algorithm design technique

  • Fibonacci numbers:

53

54 of 92

Naïve Recursive Algorithm

  • It works. Is it good?

54

 

 

 

 

 

 

 

55 of 92

Memorized Recursive Algorithm

  • Benefit?
    • fib(n) only recurses the first time it's called

55

 

 

 

 

 

 

 

56 of 92

Dynamic Programming Algorithm

  •  

56

57 of 92

Backpropagation

57

58 of 92

Gradients in ANN

  •  

58

59 of 92

Training Neural Networks: Backpropagation Learning

  • Forward propagation
    • the initial information propagates up to the hidden units at each layer and finally produces output

  • Backpropagation
    • allows the information from the cost to flow backwards through the network in order to compute the gradients

59

60 of 92

Backpropagation

  •  

60

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

61 of 92

Backpropagation

  •  

61

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

62 of 92

Backpropagation

  •  

62

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

63 of 92

Backpropagation

  •  

63

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

64 of 92

Training Neural Networks with TensorFlow

  •  

64

65 of 92

Core Foundation Review

65

Source: 6.S191 Intro. to Deep Learning at MIT

66 of 92

(Artificial) Neural Networks with TensorFlow

66

67 of 92

MNIST database

  •  

67

68 of 92

ANN in TensorFlow:�MNIST

68

69 of 92

Our Network Model

69

Input layer

(784)

hidden layer

(100)

output layer

(10)

Input image

(28 X 28)

flattened

digit prediction

in one-hot-encoding

70 of 92

Iterative Optimization

  • We will use
    • Mini-batch gradient descent
    • Adam optimizer

70

71 of 92

Implementation in Python

71

Input layer

(784)

hidden layer

(100)

output layer

(10)

Input image

(28 X 28)

flattened

72 of 92

Evaluation

72

73 of 92

(Artificial) Neural Networks: Advanced

73

74 of 92

Nonlinear Activation Function

74

75 of 92

The Vanishing Gradient Problem

  •  

75

76 of 92

Rectifiers

  • The use of the ReLU activation function was a great improvement compared to the historical tanh.

76

77 of 92

Rectifiers

  • This can be explained by the derivative of ReLU itself not vanishing, and by the resulting coding being sparse (Glorot et al., 2011).

77

78 of 92

Batch Normalization

78

79 of 92

Batch Normalization

  • Batch normalization is a technique for improving the performance and stability of artificial neural networks.
  • It is used to normalize the input layer by adjusting and scaling the activations.

79

S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning (ICML), 2015.

80 of 92

Batch Normalization

  • Batch normalization is a technique for improving the performance and stability of artificial neural networks.
  • It is used to normalize the input layer by adjusting and scaling the activations.

80

S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning (ICML), 2015.

81 of 92

Batch Normalization

  • During training, batch normalization shifts and rescales according to the mean and variance estimated on the batch.

  • During test, it simply shifts and rescales according to the empirical moments estimated during training.

81

S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning (ICML), 2015.

82 of 92

Implementation of Batch Normalization

82

S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning (ICML), 2015.

83 of 92

Dropout as Regularization

83

84 of 92

Regularization (Shrinkage Methods)

  • Often, overfitting associated with very large estimated parameters
  • We want to balance
    • how well function fits data
    • magnitude of coefficients 

    • multi-objective optimization
    • 𝜆 is a tuning parameter

84

85 of 92

Different Regularization Techniques

  • Big Data
  • Data augmentation
    • The simplest way to reduce overfitting is to increase the size of the training data.

85

86 of 92

Different Regularization Techniques

  • Early stopping
    • When we see that the performance on the validation set is getting worse, we immediately stop the training on the model.

86

Training Steps

Error

Testing Error

Training Error

Early stopping

87 of 92

Different Regularization Techniques in Deep Learning

  • Dropout
    • This is the one of the most interesting types of regularization techniques.
    • It also produces very good results and is consequently the most frequently used regularization technique in the field of deep learning.
    • At every iteration, it randomly selects some nodes and removes them.
    • It can also be thought of as an ensemble technique in machine learning.

87

N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research (JMLR), 15:1929-1958, 2014.

88 of 92

Dropout Illustration

  • Effectively, a different architecture at every training epoch
  • It can also be thought of as an ensemble technique in machine learning.

88

Original model

89 of 92

Dropout Illustration

  • Effectively, a different architecture at every training epoch
  • It can also be thought of as an ensemble technique in machine learning.

89

tf.nn.dropout(layer, rate = p)

Epoch 1

rate: the probability that each element is dropped. For example, setting rate = 0.1 would drop 10% of input elements

90 of 92

Dropout Illustration

  • Effectively, a different architecture at every training epoch
  • It can also be thought of as an ensemble technique in machine learning.

90

tf.nn.dropout(layer, rate = p)

Epoch 1

rate: the probability that each element is dropped. For example, setting rate = 0.1 would drop 10% of input elements

91 of 92

Dropout

  •  

91

92 of 92

Implementation of Dropout

92