1 of 69

Convolutional neural networks

CS5670: Computer Vision

Slides from Fei-Fei Li, Justin Johnson, Serena Yeung

http://vision.stanford.edu/teaching/cs231n/

2 of 69

Readings

Neural networks

Convolutional neural networks

http://cs231n.github.io/convolutional-networks/

3 of 69

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

(Cornell University)

4 of 69

Hinton and Salakhutdinov. Reducing the Dimensionality of Data with Neural Networks. Science, 2016.

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

5 of 69

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

6 of 69

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

7 of 69

Fast-forward to today: ConvNets* are everywhere

* and other recent architectures, like Transformers

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

8 of 69

Fast-forward to today: ConvNets are everywhere

Self-driving cars (video courtesy Tesla)

https://www.tesla.com/AI

https://cloud.google.com/tpu/

Cloud TPU v4 Pods

9 of 69

Text-to-image

https://openai.com/dall-e-2/

10 of 69

“A computer vision class watching a cool lecture, crayon drawing”

“A computer vision class watching a cool lecture, album cover”

11 of 69

What is a ConvNet?

Version of deep neural networks designed for signals

1D signals (e.g., speech waveforms)

2D signals (e.g., images)

12 of 69

Motivation – Feature Learning

13 of 69

Life Before Deep Learning

Input Pixels

Extract Hand-Crafted Features

Figure: Karpathy 2016

Concatenate into a vector x

Linear Classifier

Ans

SVM

14 of 69

Why use features? Why not pixels?

Slide from Karpathy 2016

Q: What would be a

very hard set of classes for a linear classifier to distinguish?

(assuming x = pixels)

15 of 69

Goal: linearly separable classes

16 of 69

Aside: Image Features

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

17 of 69

Image Features: Motivation

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

18 of 69

Image Features: Motivation

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

19 of 69

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

20 of 69

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

21 of 69

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

22 of 69

Last layer of many CNNs is a linear classifier

Input Pixels

Ans

Perform everything with a big neural network, trained end-to-end

This piece is just a linear classifier

Key: perform enough processing so that by the time you get to the end of the network, the classes are linearly separable

(GoogLeNet)

23 of 69

Visualizing AlexNet in 2D with t-SNE

[Donahue, “DeCAF: DeCAF: A Deep Convolutional …”, arXiv 2013]

(2D visualization using t-SNE)

Linear Classifier

24 of 69

Convolutional neural networks

Layer types:

Convolutional layer
Pooling layer
Fully-connected layer

25 of 69

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

26 of 69

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

27 of 69

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

28 of 69

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

29 of 69

Number of weights: 5 x 5 x 3 + 1 = 76

(vs. 3072 for a fully-connected layer)

(+1 for bias term)

Adapted from Fei-Fei Li & Andrej Karpathy & Serena Leung

30 of 69

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

31 of 69

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

32 of 69

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

33 of 69

(total number of parameters to learn: 6 x (75 + 1) = 456)

34 of 69

How many parameters are in a convolution layer consisting of 3 3x3x1 filters (each with bias term)?

ⓘ Start presenting to display the poll results on this slide.

35 of 69

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

37 of 69

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

38 of 69

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

39 of 69

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

40 of 69

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

41 of 69

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

42 of 69

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

43 of 69

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

44 of 69

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

45 of 69

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

46 of 69

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

47 of 69

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

48 of 69

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

49 of 69

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

50 of 69

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

51 of 69

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

52 of 69

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

53 of 69

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

54 of 69

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

55 of 69

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

56 of 69

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

57 of 69

“1x1 convolutions”

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

58 of 69

Convolutional layer—properties

Small number of parameters to learn compared to a fully connected layer
Preserves spatial structure—output of a convolutional layer is shaped like an image
Translation equivariant: passing a translated image through a convolutional layer is (almost) equivalent to translating the convolution output (but be careful of image boundaries)

59 of 69

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

60 of 69

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

61 of 69

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

62 of 69

https://cs.stanford.edu/people/karpathy/convnetjs/demo/cifar10.html

63 of 69

AlexNet (2012)

Output: 1,000-D vector (probabilities over 1,000 ImageNet categories)

Elgendy, Deep Learning for Vision Systems, https://livebook.manning.com/book/grokking-deep-learning-for-computer-vision/chapter-5/v-3/

6M parameters in total

65 of 69

Big picture

A convolutional neural network can be thought of as a function from images to class scores

With millions of adjustable weights…
… leading to a very non-linear mapping from images to features / class scores.
We will set these weights based on classification accuracy on training data…
… and hopefully our network will generalize to new images at test time

66 of 69

Data is key—enter ImageNet

ImageNet (and the ImageNet Large-Scale Visual Recognition Challege, aka ILSVRC) has been key to training deep learning methods

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li and L. Fei-Fei, ImageNet: A Large-Scale Hierarchical Image Database. CVPR, 2009.

ILSVRC: 1,000 object categories, each with ~700-1300 training images. Test set has 100 images per categories (100,000 total).
Standard ILSVRC error metric: top-5 error

if the correct answer for a given test image is in the top 5 categories, your answer is judged to be correct

67 of 69

Performance improvements on ILSVRC

ImageNet Large-Scale Visual Recognition Challenge
Held from 2011-2017
1000 categories, 1000 training images per category
Test performance on held-out test set of images

AlexNet

Pre-deep learning era

{

Deep learning era

1 of 69

2 of 69

3 of 69

4 of 69

5 of 69

6 of 69

7 of 69

8 of 69

9 of 69

10 of 69

11 of 69

12 of 69

13 of 69

14 of 69

15 of 69

16 of 69

17 of 69

18 of 69

19 of 69

20 of 69

21 of 69

22 of 69

23 of 69

24 of 69

25 of 69

26 of 69

27 of 69

28 of 69

29 of 69

30 of 69

31 of 69

32 of 69

33 of 69

34 of 69

35 of 69

36 of 69

37 of 69

38 of 69

39 of 69

40 of 69

41 of 69

42 of 69

43 of 69

44 of 69

45 of 69

46 of 69

47 of 69

48 of 69

49 of 69

50 of 69

51 of 69

52 of 69

53 of 69

54 of 69

55 of 69

56 of 69

57 of 69

58 of 69

59 of 69

60 of 69

61 of 69

62 of 69

63 of 69

64 of 69

65 of 69

66 of 69

67 of 69

68 of 69

69 of 69