1 of 68

Intro to Deep Learning

Pascal Mettes

University of Amsterdam

2 of 68

Who am I

3 of 68

Deep learning, a “recent revolution”

4 of 68

Deep learning, a “recent revolution”

5 of 68

Deep learning in one slide

  •  

6 of 68

Historical perspective on deep learning

1958

Perceptrons, Rosenblatt

1960

Adaline, Widrow and Hoff

Perceptrons, Minsky and Papert

1969

1970

Backpropagation, Linnainmaa

1974

Backpropagation, Werbos

Backpropagation, Rumelhart, Hinton and Williams

1986

LSTM, Hochreiter and Schmidhuber

1997

OCR, LeCun, Bottou, Bengio and Haffner

1998

2006

2013

Alexnet, LeCun, Bottou, Bengio and Haffner

today

GO, Deepmind

2009

Imagenet, Deng et al.

Deep Learning, Hinton, Osindero, Teh

Resnet (154 layers), MSRA

2015

7 of 68

8 of 68

The perceptron

Single layer perceptron for binary classification.

  • One weight wi per input xi
  • Multiple each input with its weight, sum, and add bias
  • If result larger than threshold, return 1, otherwise return 0

9 of 68

Training a perceptron

Perceptron learning algorithm

Comments

New train image, label

Score too low. Increase weights!

Score too high. Decrease weights!

6. Go to 2

Repeat till happy ☺

10 of 68

Problems with the perceptron

Rosenblatt (1958) at US Navy press conference:

“[The perceptron is] the embryo of an electronic computer that [the Navy]

expects will be able to walk, talk, see, write, reproduce itself and be conscious

of its existence.”

Perceptrons turned out to only solve linearly separable problems.

11 of 68

Moravec’s paradox

Reasoning requires little computation, perception from sensors a lot.

12 of 68

Historical perspective on deep learning

1958

Perceptrons, Rosenblatt

1960

Adaline, Widrow and Hoff

Perceptrons, Minsky and Papert

1969

1970

Backpropagation, Linnainmaa

1974

Backpropagation, Werbos

Backpropagation, Rumelhart, Hinton and Williams

1986

LSTM, Hochreiter and Schmidhuber

1997

OCR, LeCun, Bottou, Bengio and Haffner

1998

2006

2013

Alexnet, LeCun, Bottou, Bengio and Haffner

today

GO, Deepmind

2009

Imagenet, Deng et al.

Deep Learning, Hinton, Osindero, Teh

Resnet (154 layers), MSRA

2015

13 of 68

Limitations of the perceptron

Input 1

Input 2

XOR

1

1

-1

1

0

+1

0

1

+1

0

0

-1

Input 1

Input 2

Output

 

 

 

 

No line can separate the white from the black

Inconsistent

14 of 68

Crossroads in machine learning

Path 1:

Fix perceptrons by making better features.

Path 2:

Fix perceptrons by making them more complex.

15 of 68

World

Data

Test

data

Evaluation

Features

Labels

Optimization

Objective Function

Learning model

Training

data

16 of 68

Better features, easier machine learning

17 of 68

World

Data

Test

data

Evaluation

Features

Labels

Optimization

Objective Function

Learning model

Training

data

18 of 68

Multi-layer perceptrons

Multi-layered perceptrons

  • Also called multi-layer perceptrons (MLPs)
  • A series of linear layers with non-linear activation functions

Learns the value of the parameters θ that

result in the best function approximation.

19 of 68

Multi-layer perceptrons

20 of 68

Activation functions

  •  

21 of 68

Non-linear activations

  •  

22 of 68

Non-linear activations

 

23 of 68

Which activation function is better?

Pros of sigmoid:

Bounded (usefulness depends on application)

Pleasing math

Pros of ReLU:

Easy to implement

Strong gradient signal

24 of 68

So ReLU’s are the final answer?

25 of 68

The end of a network: the loss function

  •  

26 of 68

Binary classification

Now, we want the output to give a decision, by clamping between 0 and 1.

We can do so using the sigmoid function.

h21

h22

h20

h11

h12

h10

x1

x2

x0

27 of 68

Binary cross-entropy loss

  •  

28 of 68

Multi-class classification

  •  

29 of 68

Going back: gradient descent

No closed form solution to update all parameters based on samples.

Best course of action: take ”steps” in the right direction following the laws of calculus.

Start with w0

For t=1,..,T

wt+1 = wt - 𝜸 d/dwt f(wt)

with 𝜸 a small value

30 of 68

Backpropagation

The neural network loss is a composite function of modules.

We want the gradient w.r.t. to the parameters of the l layer.

Backpropagation is an algorithm that computes the chain rule, with a specific order of operations that is highly efficient.

31 of 68

Forward-backward by example

Credit to hmkcode.github.io for example

32 of 68

Forward-backward by example

Step 1: Initialize parameters with random values.

Credit to hmkcode.github.io for example

33 of 68

Forward-backward by example

Step 2: Forward propagation given training example.

Credit to hmkcode.github.io for example

34 of 68

Forward-backward by example

Step 3: Calculate error at the output.

Credit to hmkcode.github.io for example

35 of 68

Forward-backward by example

Step 4: Backpropagate error.

Credit to hmkcode.github.io for example

36 of 68

Forward-backward by example

Step 4: Backpropagate error.

Credit to hmkcode.github.io for example

37 of 68

Forward-backward by example

Step 4: Backpropagate error.

Credit to hmkcode.github.io for example

38 of 68

Forward-backward by example

Step 5: Update weights.

Credit to hmkcode.github.io for example

39 of 68

Forward-backward by example

Step 6: Repeat.

Credit to hmkcode.github.io for example

40 of 68

Gradient descent, a greedy approach

  •  

41 of 68

Stochastic gradient descent

Calculate gradients for entire dataset and perform a single update.

Perform parameter update for each sample (or batch of samples).

42 of 68

Momentum

43 of 68

Setting the step-size in gradient descent

When should we stop the gradient descent algorithm?

44 of 68

Visual overview of gradient descent variants

Cyan = gradient descent, magenta = w/ momentum, white = AdaGrad, green = RMSProp, blue = Adam

https://towardsdatascience.com/a-visual-explanation-of-gradient-descent-methods-momentum-adagrad-rmsprop-adam-f898b102325c

45 of 68

Break

46 of 68

Summary to far

47 of 68

Historical perspective on deep learning

1958

Perceptrons, Rosenblatt

1960

Adaline, Widrow and Hoff

Perceptrons, Minsky and Papert

1969

1970

Backpropagation, Linnainmaa

1974

Backpropagation, Werbos

Backpropagation, Rumelhart, Hinton and Williams

1986

LSTM, Hochreiter and Schmidhuber

1997

OCR, LeCun, Bottou, Bengio and Haffner

1998

2006

2013

Alexnet, LeCun, Bottou, Bengio and Haffner

2020s

Transformers, diffusion, foundation models

2009

Imagenet, Deng et al.

Deep Learning, Hinton, Osindero, Teh

Resnet (154 layers), MSRA

2015

48 of 68

Why did deep learning work in the end?

  1. Data
  2. Hardware
  3. Open-source software
  4. Tricks
  5. Deep learning beyond recognition

49 of 68

Foundational building block: the convolution

Consider. an image of size 224x224x3 and 1024 dimensions in hidden layer 1.

How many parameters will layer 1 have?

50 of 68

The convolutional operator

  •  

51 of 68

2D convolutions step-by-step

Input image: 7x7

Filter size: 3x3

Do the convolution by sliding the filter over all possible image locations.

What is the size of the output?

52 of 68

Let’s test your convolution intuition

Source: D. Lowe

0

0

0

0

1

0

0

0

0

Original

Filtered

(no change)

53 of 68

Let’s test your convolution intuition

Original

Filtered

(shift left)

0

0

0

1

0

0

0

0

0

54 of 68

Let’s test your convolution intuition

Original

Filtered

(blur)

1

1

1

1

1

1

1

1

1

55 of 68

Let’s test your convolution intuition

Original

Filtered

(sharpening)

1

1

1

1

1

1

1

1

1

0

0

0

0

2

0

0

0

0

-

56 of 68

Convolutional networks

Image as input, go through several layers of convolutional filters, predict label.

We want to “learn” the filters that help us recognize classes.

Cow

57 of 68

The convolutional layer

Q1: What does the 3 mean?

RGB

Q2: What is the output size for D filters (with padding)?

32x32xD

Q3: Does the output size depend on the input size or the filter size?

Input size

57

58 of 68

Convolutional networks

59 of 68

2015+: Is there a limit to gradient learning?

60 of 68

2020+: The era of scale

61 of 68

62 of 68

Scaling vision: self-supervised learning

63 of 68

Self-supervised learning

64 of 68

Procedure of self-supervised learning

Example proxy tasks:

1. Transform input image.

2. Pull image and transformation in embedding space, push other images in the batch.

65 of 68

Scaling text: the web + parameters

66 of 68

67 of 68

Scaling everything: transformers

68 of 68

Where are we now?

Deep learning: updating neurons over layers with backpropagation.

Early-stage DL: develop the architectures and tricks to make deep learning work.

Late-stage DL: scale parameters, scale data, scale GPU usage.

Result: remarkable outcomes, but what is the limit of scale?