1 of 69

Convolutional neural networks

CS5670: Computer Vision

Slides from Fei-Fei Li, Justin Johnson, Serena Yeung

http://vision.stanford.edu/teaching/cs231n/

2 of 69

Readings

  • Convolutional neural networks

3 of 69

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

(Cornell University)

4 of 69

Hinton and Salakhutdinov. Reducing the Dimensionality of Data with Neural Networks. Science, 2016.

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

5 of 69

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

6 of 69

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

7 of 69

Fast-forward to today: ConvNets* are everywhere

* and other recent architectures, like Transformers

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

8 of 69

Fast-forward to today: ConvNets are everywhere

Self-driving cars (video courtesy Tesla)

https://www.tesla.com/AI

9 of 69

Text-to-image

10 of 69

“A computer vision class watching a cool lecture, crayon drawing”

“A computer vision class watching a cool lecture, album cover”

11 of 69

What is a ConvNet?

  • Version of deep neural networks designed for signals
    • 1D signals (e.g., speech waveforms)

    • 2D signals (e.g., images)

12 of 69

Motivation – Feature Learning

13 of 69

Life Before Deep Learning

Input Pixels

Extract Hand-Crafted Features

Figure: Karpathy 2016

Concatenate into a vector x

Linear Classifier

Ans

SVM

14 of 69

Why use features? Why not pixels?

Slide from Karpathy 2016

Q: What would be a

very hard set of classes for a linear classifier to distinguish?

(assuming x = pixels)

15 of 69

Goal: linearly separable classes

16 of 69

Aside: Image Features

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

17 of 69

Image Features: Motivation

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

18 of 69

Image Features: Motivation

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

19 of 69

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

20 of 69

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

21 of 69

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

22 of 69

Last layer of many CNNs is a linear classifier

Input Pixels

Ans

Perform everything with a big neural network, trained end-to-end

This piece is just a linear classifier

Key: perform enough processing so that by the time you get to the end of the network, the classes are linearly separable

(GoogLeNet)

23 of 69

Visualizing AlexNet in 2D with t-SNE

[Donahue, “DeCAF: DeCAF: A Deep Convolutional …”, arXiv 2013]

(2D visualization using t-SNE)

Linear Classifier

24 of 69

Convolutional neural networks

Layer types:

  • Convolutional layer
  • Pooling layer
  • Fully-connected layer

25 of 69

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

26 of 69

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

27 of 69

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

28 of 69

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

29 of 69

Number of weights: 5 x 5 x 3 + 1 = 76

(vs. 3072 for a fully-connected layer)

(+1 for bias term)

Adapted from Fei-Fei Li & Andrej Karpathy & Serena Leung

30 of 69

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

31 of 69

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

32 of 69

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

33 of 69

(total number of parameters to learn: 6 x (75 + 1) = 456)

34 of 69

How many parameters are in a convolution layer consisting of 3 3x3x1 filters (each with bias term)?

Start presenting to display the poll results on this slide.

35 of 69

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

36 of 69

37 of 69

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

38 of 69

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

39 of 69

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

40 of 69

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

41 of 69

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

42 of 69

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

43 of 69

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

44 of 69

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

45 of 69

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

46 of 69

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

47 of 69

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

48 of 69

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

49 of 69

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

50 of 69

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

51 of 69

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

52 of 69

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

53 of 69

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

54 of 69

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

55 of 69

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

56 of 69

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

57 of 69

“1x1 convolutions”

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

58 of 69

Convolutional layer—properties

  • Small number of parameters to learn compared to a fully connected layer
  • Preserves spatial structure—output of a convolutional layer is shaped like an image
  • Translation equivariant: passing a translated image through a convolutional layer is (almost) equivalent to translating the convolution output (but be careful of image boundaries)

59 of 69

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

60 of 69

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

61 of 69

Slide credit: Fei-Fei Li & Andrej Karpathy & Serena Leung

62 of 69

63 of 69

AlexNet (2012)

Output: 1,000-D vector (probabilities over 1,000 ImageNet categories)

6M parameters in total

64 of 69

65 of 69

Big picture

  • A convolutional neural network can be thought of as a function from images to class scores
    • With millions of adjustable weights…
    • … leading to a very non-linear mapping from images to features / class scores.
    • We will set these weights based on classification accuracy on training data…
    • … and hopefully our network will generalize to new images at test time

66 of 69

Data is key—enter ImageNet

  • ImageNet (and the ImageNet Large-Scale Visual Recognition Challege, aka ILSVRC) has been key to training deep learning methods
    • J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li and L. Fei-Fei, ImageNet: A Large-Scale Hierarchical Image Database. CVPR, 2009. 
  • ILSVRC: 1,000 object categories, each with ~700-1300 training images. Test set has 100 images per categories (100,000 total).
  • Standard ILSVRC error metric: top-5 error
    • if the correct answer for a given test image is in the top 5 categories, your answer is judged to be correct

67 of 69

Performance improvements on ILSVRC

  • ImageNet Large-Scale Visual Recognition Challenge
  • Held from 2011-2017
  • 1000 categories, 1000 training images per category
  • Test performance on held-out test set of images

AlexNet

Pre-deep learning era

{

Deep learning era

68 of 69

Image credit: Zaid Alyafeai, Lahouari Ghouti

69 of 69

Questions?