1 of 68

Intro to Deep Learning

Pascal Mettes

University of Amsterdam

2 of 68

Who am I

3 of 68

Deep learning, a “recent revolution”

4 of 68

Deep learning, a “recent revolution”

5 of 68

Deep learning in one slide

6 of 68

Historical perspective on deep learning

1958

Perceptrons, Rosenblatt

1960

Adaline, Widrow and Hoff

Perceptrons, Minsky and Papert

1969

1970

Backpropagation, Linnainmaa

1974

Backpropagation, Werbos

Backpropagation, Rumelhart, Hinton and Williams

1986

LSTM, Hochreiter and Schmidhuber

1997

OCR, LeCun, Bottou, Bengio and Haffner

1998

2006

2013

Alexnet, LeCun, Bottou, Bengio and Haffner

today

GO, Deepmind

2009

Imagenet, Deng et al.

Deep Learning, Hinton, Osindero, Teh

Resnet (154 layers), MSRA

2015

7 of 68

8 of 68

The perceptron

Single layer perceptron for binary classification.

One weight w_i per input x_i
Multiple each input with its weight, sum, and add bias
If result larger than threshold, return 1, otherwise return 0

9 of 68

Training a perceptron

Perceptron learning algorithm	Comments

	New train image, label

	Score too low. Increase weights!
	Score too high. Decrease weights!
6. Go to 2	Repeat till happy ☺

10 of 68

Problems with the perceptron

Rosenblatt (1958) at US Navy press conference:

“[The perceptron is] the embryo of an electronic computer that [the Navy]

expects will be able to walk, talk, see, write, reproduce itself and be conscious

of its existence.”

Perceptrons turned out to only solve linearly separable problems.

11 of 68

Moravec’s paradox

Reasoning requires little computation, perception from sensors a lot.

12 of 68

Historical perspective on deep learning

1958

Perceptrons, Rosenblatt

1960

Adaline, Widrow and Hoff

Perceptrons, Minsky and Papert

1969

1970

Backpropagation, Linnainmaa

1974

Backpropagation, Werbos

Backpropagation, Rumelhart, Hinton and Williams

1986

LSTM, Hochreiter and Schmidhuber

1997

OCR, LeCun, Bottou, Bengio and Haffner

1998

2006

2013

Alexnet, LeCun, Bottou, Bengio and Haffner

today

GO, Deepmind

2009

Imagenet, Deng et al.

Deep Learning, Hinton, Osindero, Teh

Resnet (154 layers), MSRA

2015

13 of 68

Limitations of the perceptron

Input 1	Input 2	XOR
1	1	-1
1	0	+1
0	1	+1
0	0	-1

Input 1

Input 2

Output

No line can separate the white from the black

Inconsistent

14 of 68

Crossroads in machine learning

Path 1:

Fix perceptrons by making better features.

Path 2:

Fix perceptrons by making them more complex.

15 of 68

World

Data

Test

data

Evaluation

Features

Labels

Optimization

Objective Function

Learning model

Training

data

16 of 68

Better features, easier machine learning

17 of 68

World

Data

Test

data

Evaluation

Features

Labels

Optimization

Objective Function

Learning model

Training

data

18 of 68

Multi-layer perceptrons

Multi-layered perceptrons

Also called multi-layer perceptrons (MLPs)
A series of linear layers with non-linear activation functions

Learns the value of the parameters θ that

result in the best function approximation.

19 of 68

Multi-layer perceptrons

20 of 68

Activation functions

21 of 68

Non-linear activations

22 of 68

Non-linear activations

23 of 68

Which activation function is better?

Pros of sigmoid:

Bounded (usefulness depends on application)

Pleasing math

Pros of ReLU:

Easy to implement

Strong gradient signal

24 of 68

So ReLU’s are the final answer?

25 of 68

The end of a network: the loss function

26 of 68

Binary classification

Now, we want the output to give a decision, by clamping between 0 and 1.

We can do so using the sigmoid function.

h₂₁

h₂₂

h₂₀

ẏ

h₁₁

h₁₂

h₁₀

x₁

x₂

x₀

27 of 68

Binary cross-entropy loss

28 of 68

Multi-class classification

29 of 68

Going back: gradient descent

No closed form solution to update all parameters based on samples.

Best course of action: take ”steps” in the right direction following the laws of calculus.

Start with w₀

For t=1,..,T

w_t+1= w_t - 𝜸 d/dw_t f(w_t)

with 𝜸 a small value

30 of 68

Backpropagation

The neural network loss is a composite function of modules.

We want the gradient w.r.t. to the parameters of the l layer.

Backpropagation is an algorithm that computes the chain rule, with a specific order of operations that is highly efficient.

31 of 68

Forward-backward by example