1 of 71

Lecture 10: Anatomy of NN, basé sur le cours

CS109A Introduction to Data Science

Pavlos Protopapas, Kevin Rader and Chris Tanner

2 of 71

Outline

Anatomy of a NN

Design choices

Activation function
Loss function
Output units
Architecture

CS109A, Protopapas, Rader, Tanner

3 of 71

Outline

Anatomy of a NN

Design choices

Activation function
Loss function
Output units
Architecture

CS109A, Protopapas, Rader, Tanner

4 of 71

Anatomy of artificial neural network (ANN)

input

neuron

node

output

CS109A, Protopapas, Rader, Tanner

5 of 71

Anatomy of artificial neural network (ANN)

input

neuron

node

output

Affine transformation

We will talk later about the choice of activation function. So far we have only talked about sigmoid as an activation function but there are other choices.

Activation

CS109A, Protopapas, Rader, Tanner

6 of 71

Anatomy of artificial neural network (ANN)

Input layer

hidden layer

output layer

Output function

Loss function

We will talk later about the choice of the output layer and the loss function. So far we consider sigmoid as the output and log-bernouli.

CS109A, Protopapas, Rader, Tanner

7 of 71

Anatomy of artificial neural network (ANN)

Input layer

hidden layer 1

output layer

hidden layer 2

CS109A, Protopapas, Rader, Tanner

8 of 71

Anatomy of artificial neural network (ANN)

Input layer

hidden layer 1

output layer

hidden layer n

…

We will talk later about the choice of the number of layers.

CS109A, Protopapas, Rader, Tanner

9 of 71

Anatomy of artificial neural network (ANN)

Input layer

hidden layer 1, 3 nodes

output layer

hidden layer n

3 nodes

…

CS109A, Protopapas, Rader, Tanner

10 of 71

Anatomy of artificial neural network (ANN)

Input layer

hidden layer 1,

output layer

hidden layer n

…

m nodes

We will talk later about the choice of the number of nodes.

CS109A, Protopapas, Rader, Tanner

11 of 71

Anatomy of artificial neural network (ANN)

Input layer

hidden layer 1,

output layer

hidden layer n

…

m nodes

Number of inputs is specified by the data

Number of inputs d

CS109A, Protopapas, Rader, Tanner

12 of 71

Anatomy of artificial neural network (ANN)

hidden layer 1

hidden layer 2

output layer

input layer

CS109A, Protopapas, Rader, Tanner

13 of 71

Anatomy of artificial neural network (ANN)

hidden layer 1

hidden layer 2

input layer

output layer

CS109A, Protopapas, Rader, Tanner

14 of 71

Why layers? Representation

Representation matters!

CS109A, Protopapas, Rader, Tanner

15 of 71

Learning Multiple Components

CS109A, Protopapas, Rader, Tanner

16 of 71

Depth = Repeated Compositions

CS109A, Protopapas, Rader, Tanner

17 of 71

Neural Networks

Hand-written digit recognition: MNIST data

CS109A, Protopapas, Rader, Tanner

18 of 71

Depth = Repeated Compositions

CS109A, Protopapas, Rader, Tanner

19 of 71

Beyond Linear Models

CS109A, Protopapas, Rader, Tanner

20 of 71

Traditional ML

CS109A, Protopapas, Rader, Tanner

21 of 71

Deep Learning

Non-convex optimization
Can encode prior beliefs, generalizes well

CS109A, Protopapas, Rader, Tanner

22 of 71

Outline

Anatomy of a NN

Design choices

Activation function
Loss function
Output units
Architecture

CS109A, Protopapas, Rader, Tanner

23 of 71

Outline

Anatomy of a NN

Design choices

Activation function
Loss function
Output units
Architecture

CS109A, Protopapas, Rader, Tanner

24 of 71

Sigmoid (aka Logistic)

Derivative is zero for much of the domain. This leads to “vanishing gradients” in backpropagation.

CS109A, Protopapas, Rader, Tanner

25 of 71

Hyperbolic Tangent (Tanh)

Same problem of “vanishing gradients” as sigmoid.

CS109A, Protopapas, Rader, Tanner

26 of 71

Rectified Linear Unit (ReLU)

Two major advantages:

No vanishing gradient when x > 0
Provides sparsity (regularization) since y = 0 when x < 0

CS109A, Protopapas, Rader, Tanner

27 of 71

Leaky ReLU

Tries to fix “dying ReLU” problem: derivative is non-zero everywhere.
Some people report success with this form of activation function, but the results are not always consistent

CS109A, Protopapas, Rader, Tanner

28 of 71

Generalized ReLU

CS109A, Protopapas, Rader, Tanner

29 of 71

softplus

The logistic sigmoid function is a smooth approximation of the derivative of the rectifier

CS109A, Protopapas, Rader, Tanner

30 of 71

Maxout

Max of k linear functions. Directly learn the activation function.

CS109A, Protopapas, Rader, Tanner

31 of 71

Swish: A Self-Gated Activation Function �

Currently, the most successful and widely-used activation function is the ReLU. Swish tends to work better than ReLU on deeper models across a number of challenging datasets.

CS109A, Protopapas, Rader, Tanner

32 of 71

Outline

Anatomy of a NN

Design choices

Activation function
Loss function
Output units
Architecture

CS109A, Protopapas, Rader, Tanner

33 of 71

Loss Function

CS109A, Protopapas, Rader, Tanner

34 of 71

Loss Function

Cross-Entropy

CS109A, Protopapas, Rader, Tanner

35 of 71

Design Choices

Activation function

Loss function

Output units

Architecture

Optimizer

CS109A, Protopapas, Rader, Tanner

36 of 71

Output Units

Output Type	Output Distribution	Output layer	Loss Function
Binary

CS109A, Protopapas, Rader, Tanner

37 of 71

Output Units

Output Type	Output Distribution	Output layer	Loss Function
Binary	Bernoulli

CS109A, Protopapas, Rader, Tanner

38 of 71

Output Units

Output Type	Output Distribution	Output layer	Loss Function
Binary	Bernoulli		Binary Cross Entropy

CS109A, Protopapas, Rader, Tanner

39 of 71

Output Units

Output Type	Output Distribution	Output layer	Loss Function
Binary	Bernoulli	?	Binary Cross Entropy

CS109A, Protopapas, Rader, Tanner

40 of 71

Output unit for binary classification

OUTPUT UNIT

CS109A, Protopapas, Rader, Tanner

41 of 71

Output Units

Output Type	Output Distribution	Output layer	Cost Function
Binary	Bernoulli	Sigmoid	Binary Cross Entropy

CS109A, Protopapas, Rader, Tanner

42 of 71

Output Units

Output Type	Output Distribution	Output layer	Cost Function
Binary	Bernoulli	Sigmoid	Binary Cross Entropy
Discrete

CS109A, Protopapas, Rader, Tanner

43 of 71

Output Units

Output Type	Output Distribution	Output layer	Cost Function
Binary	Bernoulli	Sigmoid	Binary Cross Entropy
Discrete	Multinouli

CS109A, Protopapas, Rader, Tanner

44 of 71

Output Units

Output Type	Output Distribution	Output layer	Cost Function
Binary	Bernoulli	Sigmoid	Binary Cross Entropy
Discrete	Multinouli		Cross Entropy

CS109A, Protopapas, Rader, Tanner

45 of 71

Output Units

Output Type	Output Distribution	Output layer	Cost Function
Binary	Bernoulli	Sigmoid	Binary Cross Entropy
Discrete	Multinouli	?	Cross Entropy

CS109A, Protopapas, Rader, Tanner

46 of 71

Output unit for multi-class classification

OUTPUT UNIT

CS109A, Protopapas, Rader, Tanner

47 of 71

SoftMax

rest of the network

OUTPUT UNIT

A score

B score

C score

Probability of A

Probability of B

Probability of C

CS109A, Protopapas, Rader, Tanner

48 of 71

SoftMax

rest of the network

OUTPUT UNIT

A score

B score

C score

Probability of A

Probability of B

Probability of C

SoftMax

CS109A, Protopapas, Rader, Tanner

49 of 71

SoftMax

rest of the network

OUTPUT UNIT

Probability of A

Probability of B

Probability of C

SoftMax

CS109A, Protopapas, Rader, Tanner

50 of 71

Output Units

Output Type	Output Distribution	Output layer	Cost Function
Binary	Bernoulli	Sigmoid	Binary Cross Entropy
Discrete	Multinoulli	Softmax	Cross Entropy

CS109A, Protopapas, Rader, Tanner

51 of 71

Output Units

Output Type	Output Distribution	Output layer	Cost Function
Binary	Bernoulli	Sigmoid	Binary Cross Entropy
Discrete	Multinoulli	Softmax	Cross Entropy
Continuous

CS109A, Protopapas, Rader, Tanner

52 of 71

Output Units

Output Type	Output Distribution	Output layer	Cost Function
Binary	Bernoulli	Sigmoid	Binary Cross Entropy
Discrete	Multinoulli	Softmax	Cross Entropy
Continuous	Gaussian

CS109A, Protopapas, Rader, Tanner

53 of 71

Output Units

Output Type	Output Distribution	Output layer	Cost Function
Binary	Bernoulli	Sigmoid	Binary Cross Entropy
Discrete	Multinoulli	Softmax	Cross Entropy
Continuous	Gaussian		MSE

CS109A, Protopapas, Rader, Tanner

54 of 71

Output Units

Output Type	Output Distribution	Output layer	Cost Function
Binary	Bernoulli	Sigmoid	Binary Cross Entropy
Discrete	Multinoulli	Softmax	Cross Entropy
Continuous	Gaussian	?	MSE

CS109A, Protopapas, Rader, Tanner

55 of 71

Output unit for regression

OUTPUT UNIT

CS109A, Protopapas, Rader, Tanner

56 of 71

Output Units

Output Type	Output Distribution	Output layer	Cost Function
Binary	Bernoulli	Sigmoid	Binary Cross Entropy
Discrete	Multinoulli	Softmax	Cross Entropy
Continuous	Gaussian	Linear	MSE

CS109A, Protopapas, Rader, Tanner

57 of 71

Output Units

Output Type	Output Distribution	Output layer	Cost Function
Binary	Bernoulli	Sigmoid	Binary Cross Entropy
Discrete	Multinoulli	Softmax	Cross Entropy
Continuous	Gaussian	Linear	MSE
Continuous	Arbitrary	-

CS109A, Protopapas, Rader, Tanner

58 of 71

Output Units

Output Type	Output Distribution	Output layer	Cost Function
Binary	Bernoulli	Sigmoid	Binary Cross Entropy
Discrete	Multinoulli	Softmax	Cross Entropy
Continuous	Gaussian	Linear	MSE
Continuous	Arbitrary	-	GANS

Lectures 18-19 in CS109B

CS109A, Protopapas, Rader, Tanner

59 of 71

Loss Function

Example: sigmoid output + squared loss

Flat surfaces

CS109A, Protopapas, Rader, Tanner

60 of 71

Cost Function

Example: sigmoid output + cross-entropy loss

Saturates only when the model makes correct predictions

CS109A, Protopapas, Rader, Tanner

61 of 71

Design Choices

Activation function

Loss function

Output units

Architecture

Optimizer

CS109A, Protopapas, Rader, Tanner

62 of 71

NN in action

CS109A, Protopapas, Rader, Tanner

63 of 71

NN in action

CS109A, Protopapas, Rader, Tanner

64 of 71

NN in action

CS109A, Protopapas, Rader, Tanner

65 of 71

NN in action

…

CS109A, Protopapas, Rader, Tanner

66 of 71

NN in action

CS109A, Protopapas, Rader, Tanner

67 of 71

NN in action

CS109A, Protopapas, Rader, Tanner

68 of 71

NN in action

CS109A, Protopapas, Rader, Tanner

69 of 71

Universal Approximation Theorem

width

depth

CS109A, Protopapas, Rader, Tanner

70 of 71

Better Generalization with Depth

(Goodfellow 2017)

CS109A, Protopapas, Rader, Tanner

71 of 71

Shallow Nets Overfit More

(Goodfellow 2017)

The 3-layer nets perform worse on the test set, even with similar number of total parameters.

The 11-layer net generalizes better on the test set when controlling for number of parameters.

Depth helps, and it’s not just because of more parameters

Don’t worry about this word “convolutional”. It’s just a special type of neural network, often used for images.

CS109A, Protopapas, Rader, Tanner