1 of 71

Lecture 10: Anatomy of NN, basé sur le cours

CS109A Introduction to Data Science

Pavlos Protopapas, Kevin Rader and Chris Tanner

2 of 71

Outline

Anatomy of a NN

Design choices

    • Activation function
    • Loss function
    • Output units
    • Architecture

CS109A, Protopapas, Rader, Tanner

3 of 71

Outline

Anatomy of a NN

Design choices

    • Activation function
    • Loss function
    • Output units
    • Architecture

CS109A, Protopapas, Rader, Tanner

4 of 71

Anatomy of artificial neural network (ANN)

X

Y

input

neuron

node

output

W

CS109A, Protopapas, Rader, Tanner

5 of 71

Anatomy of artificial neural network (ANN)

X

Y

 

 

input

neuron

node

output

Affine transformation

We will talk later about the choice of activation function. So far we have only talked about sigmoid as an activation function but there are other choices.

Activation

CS109A, Protopapas, Rader, Tanner

6 of 71

Anatomy of artificial neural network (ANN)

 

 

 

 

 

Input layer

hidden layer

output layer

 

 

 

Output function

 

Loss function

We will talk later about the choice of the output layer and the loss function. So far we consider sigmoid as the output and log-bernouli.

CS109A, Protopapas, Rader, Tanner

7 of 71

Anatomy of artificial neural network (ANN)

 

 

 

 

 

Input layer

hidden layer 1

output layer

 

 

hidden layer 2

CS109A, Protopapas, Rader, Tanner

8 of 71

Anatomy of artificial neural network (ANN)

 

 

 

 

 

Input layer

hidden layer 1

output layer

 

 

hidden layer n

We will talk later about the choice of the number of layers.

CS109A, Protopapas, Rader, Tanner

9 of 71

Anatomy of artificial neural network (ANN)

 

 

 

 

 

 

Input layer

hidden layer 1, 3 nodes

output layer

 

hidden layer n

3 nodes

 

 

CS109A, Protopapas, Rader, Tanner

10 of 71

Anatomy of artificial neural network (ANN)

 

 

 

 

 

 

Input layer

hidden layer 1,

output layer

 

hidden layer n

m nodes

m nodes

We will talk later about the choice of the number of nodes.

CS109A, Protopapas, Rader, Tanner

11 of 71

Anatomy of artificial neural network (ANN)

 

 

 

 

 

 

Input layer

hidden layer 1,

output layer

 

hidden layer n

m nodes

m nodes

Number of inputs is specified by the data

Number of inputs d

CS109A, Protopapas, Rader, Tanner

12 of 71

Anatomy of artificial neural network (ANN)

hidden layer 1

hidden layer 2

output layer

input layer

CS109A, Protopapas, Rader, Tanner

13 of 71

Anatomy of artificial neural network (ANN)

hidden layer 1

hidden layer 2

input layer

output layer

CS109A, Protopapas, Rader, Tanner

14 of 71

Why layers? Representation

Representation matters!

CS109A, Protopapas, Rader, Tanner

15 of 71

Learning Multiple Components

CS109A, Protopapas, Rader, Tanner

16 of 71

Depth = Repeated Compositions

CS109A, Protopapas, Rader, Tanner

17 of 71

Neural Networks

Hand-written digit recognition: MNIST data

CS109A, Protopapas, Rader, Tanner

18 of 71

Depth = Repeated Compositions

CS109A, Protopapas, Rader, Tanner

19 of 71

Beyond Linear Models

 

CS109A, Protopapas, Rader, Tanner

20 of 71

Traditional ML

 

CS109A, Protopapas, Rader, Tanner

21 of 71

Deep Learning

 

 

  • Non-convex optimization
  • Can encode prior beliefs, generalizes well

CS109A, Protopapas, Rader, Tanner

22 of 71

Outline

Anatomy of a NN

Design choices

    • Activation function
    • Loss function
    • Output units
    • Architecture

CS109A, Protopapas, Rader, Tanner

23 of 71

Outline

Anatomy of a NN

Design choices

    • Activation function
    • Loss function
    • Output units
    • Architecture

CS109A, Protopapas, Rader, Tanner

24 of 71

Sigmoid (aka Logistic)

Derivative is zero for much of the domain. This leads to “vanishing gradients” in backpropagation.

 

CS109A, Protopapas, Rader, Tanner

25 of 71

Hyperbolic Tangent (Tanh)

Same problem of “vanishing gradients” as sigmoid.

 

CS109A, Protopapas, Rader, Tanner

26 of 71

Rectified Linear Unit (ReLU)

Two major advantages:

  1. No vanishing gradient when x > 0
  2. Provides sparsity (regularization) since y = 0 when x < 0

 

CS109A, Protopapas, Rader, Tanner

27 of 71

Leaky ReLU

  • Tries to fix “dying ReLU” problem: derivative is non-zero everywhere.
  • Some people report success with this form of activation function, but the results are not always consistent

 

CS109A, Protopapas, Rader, Tanner

28 of 71

Generalized ReLU

 

CS109A, Protopapas, Rader, Tanner

29 of 71

softplus

The logistic sigmoid function is a smooth approximation of the derivative of the rectifier

 

CS109A, Protopapas, Rader, Tanner

30 of 71

Maxout

Max of k linear functions. Directly learn the activation function.

 

CS109A, Protopapas, Rader, Tanner

31 of 71

Swish: A Self-Gated Activation Function �

 

Currently, the most successful and widely-used activation function is the ReLU. Swish tends to work better than ReLU on deeper models across a number of challenging datasets.

CS109A, Protopapas, Rader, Tanner

32 of 71

Outline

Anatomy of a NN

Design choices

    • Activation function
    • Loss function
    • Output units
    • Architecture

CS109A, Protopapas, Rader, Tanner

33 of 71

Loss Function

 

CS109A, Protopapas, Rader, Tanner

34 of 71

Loss Function

 

 

Cross-Entropy

CS109A, Protopapas, Rader, Tanner

35 of 71

Design Choices

Activation function

Loss function

Output units

Architecture

Optimizer

CS109A, Protopapas, Rader, Tanner

36 of 71

Output Units

Output Type

Output Distribution

Output layer

Loss Function

Binary

CS109A, Protopapas, Rader, Tanner

37 of 71

Output Units

Output Type

Output Distribution

Output layer

Loss Function

Binary

Bernoulli

CS109A, Protopapas, Rader, Tanner

38 of 71

Output Units

Output Type

Output Distribution

Output layer

Loss Function

Binary

Bernoulli

Binary Cross Entropy

CS109A, Protopapas, Rader, Tanner

39 of 71

Output Units

Output Type

Output Distribution

Output layer

Loss Function

Binary

Bernoulli

?

Binary Cross Entropy

CS109A, Protopapas, Rader, Tanner

40 of 71

Output unit for binary classification

X

 

 

 

OUTPUT UNIT

 

X

 

CS109A, Protopapas, Rader, Tanner

41 of 71

Output Units

Output Type

Output Distribution

Output layer

Cost Function

Binary

Bernoulli

Sigmoid

Binary Cross Entropy

CS109A, Protopapas, Rader, Tanner

42 of 71

Output Units

Output Type

Output Distribution

Output layer

Cost Function

Binary

Bernoulli

Sigmoid

Binary Cross Entropy

Discrete

CS109A, Protopapas, Rader, Tanner

43 of 71

Output Units

Output Type

Output Distribution

Output layer

Cost Function

Binary

Bernoulli

Sigmoid

Binary Cross Entropy

Discrete

Multinouli

CS109A, Protopapas, Rader, Tanner

44 of 71

Output Units

Output Type

Output Distribution

Output layer

Cost Function

Binary

Bernoulli

Sigmoid

Binary Cross Entropy

Discrete

Multinouli

Cross Entropy

CS109A, Protopapas, Rader, Tanner

45 of 71

Output Units

Output Type

Output Distribution

Output layer

Cost Function

Binary

Bernoulli

Sigmoid

Binary Cross Entropy

Discrete

Multinouli

?

Cross Entropy

CS109A, Protopapas, Rader, Tanner

46 of 71

Output unit for multi-class classification

X

 

OUTPUT UNIT

CS109A, Protopapas, Rader, Tanner

47 of 71

SoftMax

 

rest of the network

OUTPUT UNIT

A score

B score

C score

Probability of A

Probability of B

Probability of C

 

CS109A, Protopapas, Rader, Tanner

48 of 71

SoftMax

 

rest of the network

OUTPUT UNIT

A score

B score

C score

Probability of A

Probability of B

Probability of C

 

SoftMax

CS109A, Protopapas, Rader, Tanner

49 of 71

SoftMax

 

rest of the network

OUTPUT UNIT

Probability of A

Probability of B

Probability of C

 

SoftMax

CS109A, Protopapas, Rader, Tanner

50 of 71

Output Units

Output Type

Output Distribution

Output layer

Cost Function

Binary

Bernoulli

Sigmoid

Binary Cross Entropy

Discrete

Multinoulli

Softmax

Cross Entropy

CS109A, Protopapas, Rader, Tanner

51 of 71

Output Units

Output Type

Output Distribution

Output layer

Cost Function

Binary

Bernoulli

Sigmoid

Binary Cross Entropy

Discrete

Multinoulli

Softmax

Cross Entropy

Continuous

CS109A, Protopapas, Rader, Tanner

52 of 71

Output Units

Output Type

Output Distribution

Output layer

Cost Function

Binary

Bernoulli

Sigmoid

Binary Cross Entropy

Discrete

Multinoulli

Softmax

Cross Entropy

Continuous

Gaussian

CS109A, Protopapas, Rader, Tanner

53 of 71

Output Units

Output Type

Output Distribution

Output layer

Cost Function

Binary

Bernoulli

Sigmoid

Binary Cross Entropy

Discrete

Multinoulli

Softmax

Cross Entropy

Continuous

Gaussian

MSE

CS109A, Protopapas, Rader, Tanner

54 of 71

Output Units

Output Type

Output Distribution

Output layer

Cost Function

Binary

Bernoulli

Sigmoid

Binary Cross Entropy

Discrete

Multinoulli

Softmax

Cross Entropy

Continuous

Gaussian

?

MSE

CS109A, Protopapas, Rader, Tanner

55 of 71

Output unit for regression

X

 

 

OUTPUT UNIT

 

X

 

 

CS109A, Protopapas, Rader, Tanner

56 of 71

Output Units

Output Type

Output Distribution

Output layer

Cost Function

Binary

Bernoulli

Sigmoid

Binary Cross Entropy

Discrete

Multinoulli

Softmax

Cross Entropy

Continuous

Gaussian

Linear

MSE

CS109A, Protopapas, Rader, Tanner

57 of 71

Output Units

Output Type

Output Distribution

Output layer

Cost Function

Binary

Bernoulli

Sigmoid

Binary Cross Entropy

Discrete

Multinoulli

Softmax

Cross Entropy

Continuous

Gaussian

Linear

MSE

Continuous

Arbitrary

-

CS109A, Protopapas, Rader, Tanner

58 of 71

Output Units

Output Type

Output Distribution

Output layer

Cost Function

Binary

Bernoulli

Sigmoid

Binary Cross Entropy

Discrete

Multinoulli

Softmax

Cross Entropy

Continuous

Gaussian

Linear

MSE

Continuous

Arbitrary

-

GANS

Lectures 18-19 in CS109B

CS109A, Protopapas, Rader, Tanner

59 of 71

Loss Function

Example: sigmoid output + squared loss

Flat surfaces

 

CS109A, Protopapas, Rader, Tanner

60 of 71

Cost Function

Example: sigmoid output + cross-entropy loss

Saturates only when the model makes correct predictions

 

CS109A, Protopapas, Rader, Tanner

61 of 71

Design Choices

Activation function

Loss function

Output units

Architecture

Optimizer

CS109A, Protopapas, Rader, Tanner

62 of 71

NN in action

CS109A, Protopapas, Rader, Tanner

63 of 71

NN in action

CS109A, Protopapas, Rader, Tanner

64 of 71

NN in action

CS109A, Protopapas, Rader, Tanner

65 of 71

NN in action

CS109A, Protopapas, Rader, Tanner

66 of 71

NN in action

CS109A, Protopapas, Rader, Tanner

67 of 71

NN in action

CS109A, Protopapas, Rader, Tanner

68 of 71

NN in action

CS109A, Protopapas, Rader, Tanner

69 of 71

Universal Approximation Theorem

 

width

 

 

 

 

 

depth

CS109A, Protopapas, Rader, Tanner

70 of 71

Better Generalization with Depth

(Goodfellow 2017)

CS109A, Protopapas, Rader, Tanner

71 of 71

Shallow Nets Overfit More

(Goodfellow 2017)

The 3-layer nets perform worse on the test set, even with similar number of total parameters.

The 11-layer net generalizes better on the test set when controlling for number of parameters.

Depth helps, and it’s not just because of more parameters

Don’t worry about this word “convolutional”. It’s just a special type of neural network, often used for images.

CS109A, Protopapas, Rader, Tanner