Lecture 10: Anatomy of NN, basé sur le cours
CS109A Introduction to Data Science
Pavlos Protopapas, Kevin Rader and Chris Tanner
Outline
Anatomy of a NN
Design choices
CS109A, Protopapas, Rader, Tanner
Outline
Anatomy of a NN
Design choices
CS109A, Protopapas, Rader, Tanner
Anatomy of artificial neural network (ANN)
X
Y
input
neuron
node
output
W
CS109A, Protopapas, Rader, Tanner
Anatomy of artificial neural network (ANN)
X
Y
input
neuron
node
output
Affine transformation
We will talk later about the choice of activation function. So far we have only talked about sigmoid as an activation function but there are other choices.
Activation
CS109A, Protopapas, Rader, Tanner
Anatomy of artificial neural network (ANN)
Input layer
hidden layer
output layer
Output function
Loss function
We will talk later about the choice of the output layer and the loss function. So far we consider sigmoid as the output and log-bernouli.
CS109A, Protopapas, Rader, Tanner
Anatomy of artificial neural network (ANN)
Input layer
hidden layer 1
output layer
hidden layer 2
CS109A, Protopapas, Rader, Tanner
Anatomy of artificial neural network (ANN)
Input layer
hidden layer 1
output layer
hidden layer n
…
…
We will talk later about the choice of the number of layers.
CS109A, Protopapas, Rader, Tanner
Anatomy of artificial neural network (ANN)
Input layer
hidden layer 1, 3 nodes
output layer
hidden layer n
3 nodes
…
CS109A, Protopapas, Rader, Tanner
Anatomy of artificial neural network (ANN)
Input layer
hidden layer 1,
output layer
hidden layer n
…
…
m nodes
m nodes
We will talk later about the choice of the number of nodes.
CS109A, Protopapas, Rader, Tanner
Anatomy of artificial neural network (ANN)
Input layer
hidden layer 1,
output layer
hidden layer n
…
…
m nodes
m nodes
Number of inputs is specified by the data
Number of inputs d
CS109A, Protopapas, Rader, Tanner
Anatomy of artificial neural network (ANN)
hidden layer 1
hidden layer 2
output layer
input layer
CS109A, Protopapas, Rader, Tanner
Anatomy of artificial neural network (ANN)
hidden layer 1
hidden layer 2
input layer
output layer
CS109A, Protopapas, Rader, Tanner
Why layers? Representation
Representation matters!
CS109A, Protopapas, Rader, Tanner
Learning Multiple Components
CS109A, Protopapas, Rader, Tanner
Depth = Repeated Compositions
CS109A, Protopapas, Rader, Tanner
Neural Networks
Hand-written digit recognition: MNIST data
CS109A, Protopapas, Rader, Tanner
Depth = Repeated Compositions
CS109A, Protopapas, Rader, Tanner
Beyond Linear Models
CS109A, Protopapas, Rader, Tanner
Traditional ML
CS109A, Protopapas, Rader, Tanner
Deep Learning
CS109A, Protopapas, Rader, Tanner
Outline
Anatomy of a NN
Design choices
CS109A, Protopapas, Rader, Tanner
Outline
Anatomy of a NN
Design choices
CS109A, Protopapas, Rader, Tanner
Sigmoid (aka Logistic)
Derivative is zero for much of the domain. This leads to “vanishing gradients” in backpropagation.
CS109A, Protopapas, Rader, Tanner
Hyperbolic Tangent (Tanh)
Same problem of “vanishing gradients” as sigmoid.
CS109A, Protopapas, Rader, Tanner
Rectified Linear Unit (ReLU)
Two major advantages:
CS109A, Protopapas, Rader, Tanner
Leaky ReLU
CS109A, Protopapas, Rader, Tanner
Generalized ReLU
CS109A, Protopapas, Rader, Tanner
softplus
The logistic sigmoid function is a smooth approximation of the derivative of the rectifier
CS109A, Protopapas, Rader, Tanner
Maxout
Max of k linear functions. Directly learn the activation function.
CS109A, Protopapas, Rader, Tanner
Swish: A Self-Gated Activation Function �
Currently, the most successful and widely-used activation function is the ReLU. Swish tends to work better than ReLU on deeper models across a number of challenging datasets.
CS109A, Protopapas, Rader, Tanner
Outline
Anatomy of a NN
Design choices
CS109A, Protopapas, Rader, Tanner
Loss Function
CS109A, Protopapas, Rader, Tanner
Loss Function
Cross-Entropy
CS109A, Protopapas, Rader, Tanner
Design Choices
Activation function
Loss function
Output units
Architecture
Optimizer
CS109A, Protopapas, Rader, Tanner
Output Units
Output Type | Output Distribution | Output layer | Loss Function |
Binary | | | |
| | | |
| | | |
| | | |
CS109A, Protopapas, Rader, Tanner
Output Units
Output Type | Output Distribution | Output layer | Loss Function |
Binary | Bernoulli | | |
| | | |
| | | |
| | | |
CS109A, Protopapas, Rader, Tanner
Output Units
Output Type | Output Distribution | Output layer | Loss Function |
Binary | Bernoulli | | Binary Cross Entropy |
| | | |
| | | |
| | | |
CS109A, Protopapas, Rader, Tanner
Output Units
Output Type | Output Distribution | Output layer | Loss Function |
Binary | Bernoulli | ? | Binary Cross Entropy |
| | | |
| | | |
| | | |
CS109A, Protopapas, Rader, Tanner
Output unit for binary classification
X
OUTPUT UNIT
X
CS109A, Protopapas, Rader, Tanner
Output Units
Output Type | Output Distribution | Output layer | Cost Function |
Binary | Bernoulli | Sigmoid | Binary Cross Entropy |
| | | |
| | | |
| | | |
CS109A, Protopapas, Rader, Tanner
Output Units
Output Type | Output Distribution | Output layer | Cost Function |
Binary | Bernoulli | Sigmoid | Binary Cross Entropy |
Discrete | | | |
| | | |
| | | |
CS109A, Protopapas, Rader, Tanner
Output Units
Output Type | Output Distribution | Output layer | Cost Function |
Binary | Bernoulli | Sigmoid | Binary Cross Entropy |
Discrete | Multinouli | | |
| | | |
| | | |
CS109A, Protopapas, Rader, Tanner
Output Units
Output Type | Output Distribution | Output layer | Cost Function |
Binary | Bernoulli | Sigmoid | Binary Cross Entropy |
Discrete | Multinouli | | Cross Entropy |
| | | |
| | | |
CS109A, Protopapas, Rader, Tanner
Output Units
Output Type | Output Distribution | Output layer | Cost Function |
Binary | Bernoulli | Sigmoid | Binary Cross Entropy |
Discrete | Multinouli | ? | Cross Entropy |
| | | |
| | | |
CS109A, Protopapas, Rader, Tanner
Output unit for multi-class classification
X
OUTPUT UNIT
CS109A, Protopapas, Rader, Tanner
SoftMax
rest of the network
OUTPUT UNIT
A score
B score
C score
Probability of A
Probability of B
Probability of C
CS109A, Protopapas, Rader, Tanner
SoftMax
rest of the network
OUTPUT UNIT
A score
B score
C score
Probability of A
Probability of B
Probability of C
SoftMax
CS109A, Protopapas, Rader, Tanner
SoftMax
rest of the network
OUTPUT UNIT
Probability of A
Probability of B
Probability of C
SoftMax
CS109A, Protopapas, Rader, Tanner
Output Units
Output Type | Output Distribution | Output layer | Cost Function |
Binary | Bernoulli | Sigmoid | Binary Cross Entropy |
Discrete | Multinoulli | Softmax | Cross Entropy |
| | | |
| | | |
CS109A, Protopapas, Rader, Tanner
Output Units
Output Type | Output Distribution | Output layer | Cost Function |
Binary | Bernoulli | Sigmoid | Binary Cross Entropy |
Discrete | Multinoulli | Softmax | Cross Entropy |
Continuous | | | |
| | | |
CS109A, Protopapas, Rader, Tanner
Output Units
Output Type | Output Distribution | Output layer | Cost Function |
Binary | Bernoulli | Sigmoid | Binary Cross Entropy |
Discrete | Multinoulli | Softmax | Cross Entropy |
Continuous | Gaussian | | |
| | | |
CS109A, Protopapas, Rader, Tanner
Output Units
Output Type | Output Distribution | Output layer | Cost Function |
Binary | Bernoulli | Sigmoid | Binary Cross Entropy |
Discrete | Multinoulli | Softmax | Cross Entropy |
Continuous | Gaussian | | MSE |
| | | |
CS109A, Protopapas, Rader, Tanner
Output Units
Output Type | Output Distribution | Output layer | Cost Function |
Binary | Bernoulli | Sigmoid | Binary Cross Entropy |
Discrete | Multinoulli | Softmax | Cross Entropy |
Continuous | Gaussian | ? | MSE |
| | | |
CS109A, Protopapas, Rader, Tanner
Output unit for regression
X
OUTPUT UNIT
X
CS109A, Protopapas, Rader, Tanner
Output Units
Output Type | Output Distribution | Output layer | Cost Function |
Binary | Bernoulli | Sigmoid | Binary Cross Entropy |
Discrete | Multinoulli | Softmax | Cross Entropy |
Continuous | Gaussian | Linear | MSE |
| | | |
CS109A, Protopapas, Rader, Tanner
Output Units
Output Type | Output Distribution | Output layer | Cost Function |
Binary | Bernoulli | Sigmoid | Binary Cross Entropy |
Discrete | Multinoulli | Softmax | Cross Entropy |
Continuous | Gaussian | Linear | MSE |
Continuous | Arbitrary | - |
|
CS109A, Protopapas, Rader, Tanner
Output Units
Output Type | Output Distribution | Output layer | Cost Function |
Binary | Bernoulli | Sigmoid | Binary Cross Entropy |
Discrete | Multinoulli | Softmax | Cross Entropy |
Continuous | Gaussian | Linear | MSE |
Continuous | Arbitrary | - | GANS |
Lectures 18-19 in CS109B
CS109A, Protopapas, Rader, Tanner
Loss Function
Example: sigmoid output + squared loss
Flat surfaces
CS109A, Protopapas, Rader, Tanner
Cost Function
Example: sigmoid output + cross-entropy loss
Saturates only when the model makes correct predictions
CS109A, Protopapas, Rader, Tanner
Design Choices
Activation function
Loss function
Output units
Architecture
Optimizer
CS109A, Protopapas, Rader, Tanner
NN in action
CS109A, Protopapas, Rader, Tanner
NN in action
CS109A, Protopapas, Rader, Tanner
NN in action
CS109A, Protopapas, Rader, Tanner
NN in action
…
CS109A, Protopapas, Rader, Tanner
NN in action
CS109A, Protopapas, Rader, Tanner
NN in action
CS109A, Protopapas, Rader, Tanner
NN in action
CS109A, Protopapas, Rader, Tanner
Universal Approximation Theorem
width
depth
CS109A, Protopapas, Rader, Tanner
Better Generalization with Depth
(Goodfellow 2017)
CS109A, Protopapas, Rader, Tanner
Shallow Nets Overfit More
(Goodfellow 2017)
The 3-layer nets perform worse on the test set, even with similar number of total parameters.
The 11-layer net generalizes better on the test set when controlling for number of parameters.
Depth helps, and it’s not just because of more parameters
Don’t worry about this word “convolutional”. It’s just a special type of neural network, often used for images.
CS109A, Protopapas, Rader, Tanner