Intro to Deep Learning
Pascal Mettes
University of Amsterdam
Who am I
Deep learning, a “recent revolution”
Deep learning, a “recent revolution”
Deep learning in one slide
Historical perspective on deep learning
1958
Perceptrons, Rosenblatt
1960
Adaline, Widrow and Hoff
Perceptrons, Minsky and Papert
1969
1970
Backpropagation, Linnainmaa
1974
Backpropagation, Werbos
Backpropagation, Rumelhart, Hinton and Williams
1986
LSTM, Hochreiter and Schmidhuber
1997
OCR, LeCun, Bottou, Bengio and Haffner
1998
2006
2013
Alexnet, LeCun, Bottou, Bengio and Haffner
today
GO, Deepmind
2009
Imagenet, Deng et al.
Deep Learning, Hinton, Osindero, Teh
Resnet (154 layers), MSRA
2015
The perceptron
Single layer perceptron for binary classification.
Training a perceptron
Perceptron learning algorithm | Comments |
| |
| New train image, label |
| |
| Score too low. Increase weights! |
| Score too high. Decrease weights! |
6. Go to 2 | Repeat till happy ☺ |
Problems with the perceptron
Rosenblatt (1958) at US Navy press conference:
“[The perceptron is] the embryo of an electronic computer that [the Navy]
expects will be able to walk, talk, see, write, reproduce itself and be conscious
of its existence.”
Perceptrons turned out to only solve linearly separable problems.
Moravec’s paradox
Reasoning requires little computation, perception from sensors a lot.
Historical perspective on deep learning
1958
Perceptrons, Rosenblatt
1960
Adaline, Widrow and Hoff
Perceptrons, Minsky and Papert
1969
1970
Backpropagation, Linnainmaa
1974
Backpropagation, Werbos
Backpropagation, Rumelhart, Hinton and Williams
1986
LSTM, Hochreiter and Schmidhuber
1997
OCR, LeCun, Bottou, Bengio and Haffner
1998
2006
2013
Alexnet, LeCun, Bottou, Bengio and Haffner
today
GO, Deepmind
2009
Imagenet, Deng et al.
Deep Learning, Hinton, Osindero, Teh
Resnet (154 layers), MSRA
2015
Limitations of the perceptron
Input 1 | Input 2 | XOR |
1 | 1 | -1 |
1 | 0 | +1 |
0 | 1 | +1 |
0 | 0 | -1 |
Input 1
Input 2
Output
No line can separate the white from the black
Inconsistent
Crossroads in machine learning
Path 1:
Fix perceptrons by making better features.
Path 2:
Fix perceptrons by making them more complex.
World
Data
Test
data
Evaluation
Features
Labels
Optimization
Objective Function
Learning model
Training
data
Better features, easier machine learning
World
Data
Test
data
Evaluation
Features
Labels
Optimization
Objective Function
Learning model
Training
data
Multi-layer perceptrons
Multi-layered perceptrons
Learns the value of the parameters θ that
result in the best function approximation.
Multi-layer perceptrons
Activation functions
Non-linear activations
Non-linear activations
Which activation function is better?
Pros of sigmoid:
Bounded (usefulness depends on application)
Pleasing math
Pros of ReLU:
Easy to implement
Strong gradient signal
So ReLU’s are the final answer?
The end of a network: the loss function
Binary classification
Now, we want the output to give a decision, by clamping between 0 and 1.
We can do so using the sigmoid function.
h21
h22
h20
ẏ
h11
h12
h10
x1
x2
x0
Binary cross-entropy loss
Multi-class classification
Going back: gradient descent
No closed form solution to update all parameters based on samples.
Best course of action: take ”steps” in the right direction following the laws of calculus.
Start with w0
For t=1,..,T
wt+1 = wt - 𝜸 d/dwt f(wt)
with 𝜸 a small value
Backpropagation
The neural network loss is a composite function of modules.
We want the gradient w.r.t. to the parameters of the l layer.
Backpropagation is an algorithm that computes the chain rule, with a specific order of operations that is highly efficient.
Forward-backward by example
Credit to hmkcode.github.io for example
Forward-backward by example
Step 1: Initialize parameters with random values.
Credit to hmkcode.github.io for example
Forward-backward by example
Step 2: Forward propagation given training example.
Credit to hmkcode.github.io for example
Forward-backward by example
Step 3: Calculate error at the output.
Credit to hmkcode.github.io for example
Forward-backward by example
Step 4: Backpropagate error.
Credit to hmkcode.github.io for example
Forward-backward by example
Step 4: Backpropagate error.
Credit to hmkcode.github.io for example
Forward-backward by example
Step 4: Backpropagate error.
Credit to hmkcode.github.io for example
Forward-backward by example
Step 5: Update weights.
Credit to hmkcode.github.io for example
Forward-backward by example
Step 6: Repeat.
Credit to hmkcode.github.io for example
Gradient descent, a greedy approach
Stochastic gradient descent
Calculate gradients for entire dataset and perform a single update.
Perform parameter update for each sample (or batch of samples).
Momentum
Setting the step-size in gradient descent
When should we stop the gradient descent algorithm?
Visual overview of gradient descent variants
Cyan = gradient descent, magenta = w/ momentum, white = AdaGrad, green = RMSProp, blue = Adam
https://towardsdatascience.com/a-visual-explanation-of-gradient-descent-methods-momentum-adagrad-rmsprop-adam-f898b102325c
Break
Summary to far
Historical perspective on deep learning
1958
Perceptrons, Rosenblatt
1960
Adaline, Widrow and Hoff
Perceptrons, Minsky and Papert
1969
1970
Backpropagation, Linnainmaa
1974
Backpropagation, Werbos
Backpropagation, Rumelhart, Hinton and Williams
1986
LSTM, Hochreiter and Schmidhuber
1997
OCR, LeCun, Bottou, Bengio and Haffner
1998
2006
2013
Alexnet, LeCun, Bottou, Bengio and Haffner
2020s
Transformers, diffusion, foundation models
2009
Imagenet, Deng et al.
Deep Learning, Hinton, Osindero, Teh
Resnet (154 layers), MSRA
2015
Why did deep learning work in the end?
Foundational building block: the convolution
Consider. an image of size 224x224x3 and 1024 dimensions in hidden layer 1.
How many parameters will layer 1 have?
The convolutional operator
2D convolutions step-by-step
Input image: 7x7
Filter size: 3x3
Do the convolution by sliding the filter over all possible image locations.
What is the size of the output?
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
Let’s test your convolution intuition
Source: D. Lowe
0
0
0
0
1
0
0
0
0
Original
Filtered
(no change)
Let’s test your convolution intuition
Original
Filtered
(shift left)
0
0
0
1
0
0
0
0
0
Let’s test your convolution intuition
Original
Filtered
(blur)
1
1
1
1
1
1
1
1
1
Let’s test your convolution intuition
Original
Filtered
(sharpening)
1
1
1
1
1
1
1
1
1
0
0
0
0
2
0
0
0
0
-
Convolutional networks
Image as input, go through several layers of convolutional filters, predict label.
We want to “learn” the filters that help us recognize classes.
Cow
The convolutional layer
Q1: What does the 3 mean?
RGB
Q2: What is the output size for D filters (with padding)?
32x32xD
Q3: Does the output size depend on the input size or the filter size?
Input size
57
Convolutional networks
2015+: Is there a limit to gradient learning?
2020+: The era of scale
Scaling vision: self-supervised learning
Self-supervised learning
Procedure of self-supervised learning
Example proxy tasks:
1. Transform input image.
2. Pull image and transformation in embedding space, push other images in the batch.
Scaling text: the web + parameters
Scaling everything: transformers
Where are we now?
Deep learning: updating neurons over layers with backpropagation.
Early-stage DL: develop the architectures and tricks to make deep learning work.
Late-stage DL: scale parameters, scale data, scale GPU usage.
Result: remarkable outcomes, but what is the limit of scale?