1 of 62

CSE 5524: �Foundation of learning - 3

2 of 62

Homework assignment - 1

  • Will be released on 2/2 or 2/3

  • Homework – 2 will very likely be released two weeks after the above date

3 of 62

Today

  • Recap
  • Generalization
  • Neural networks overview
  • Neural networks

3

4 of 62

Recap: Key ingredients

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

5 of 62

Case study – 1: regression

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

6 of 62

Case study – 2: classification

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

7 of 62

Gradient-based learning algorithm

  • The goal:

8 of 62

Gradient-based learning algorithm

  • The goal:

  • Illustration:

Derivative (-)

GOAL: minimum error

 

 

 

 

 

 

9 of 62

Basic gradient descent

  • The goal:

  • Pseudo-code:

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

10 of 62

Today

  • Recap
  • Generalization
  • Neural networks overview
  • Neural networks

10

11 of 62

Training vs. testing

  • Models learned from the training data should be applicable to test data

11

 

 

x

y

In training, we only see training data!

Choosing a more complicated hypothesis class does not necessarily lead to lower test errors!

12 of 62

Under-fitting vs. over-fitting

  • Background: polynomial regression

13 of 62

Under-fitting vs. over-fitting

  • Training data are generated from a sin function + noise
    • Under-fitting

K too small: simple

    • Over-fitting

K too large: complicated

training error = 0

13

[Slides: from USC CSCI567]

 

 

 

14 of 62

Under-fitting vs. over-fitting

  • Training data are generated from a sin function + noise
    • Just good enough

14

[Slides: from USC CSCI567]

 

15 of 62

Under-fitting vs. over-fitting

  • Over-fitting:
    • Small training error
    • Large test error (even larger than some other “simpler” models)

  • How to quantify hypothesis class’ complexity?
    • Polynomial regression: larger K, larger complexity

  • How to choose K?

 

16 of 62

 

Training data

Train

Val

Treating “Val” as the “pseudo” test data!

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

17 of 62

Questions?

18 of 62

Regularization

  • Searching for “simpler” solution

 

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

19 of 62

Finding a good regularizer is not always easy

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

20 of 62

Occam’s razor principle

20

All things being equal,

the simplest explanation is usually the best!

21 of 62

Three tools in the search for Truth

  • Data: what we observe

  • Prior: what we prefer & believe

  • Hypotheses: what the true function may be

22 of 62

Data (likelihood) & Prior

23 of 62

Three tools in the search for Truth

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

24 of 62

Effect of data

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

25 of 62

More data, less over-fitting

 

 

 

[Figure: Bishop, PRML]

Green: true data distribution

Blue: training data

Red: learned model

26 of 62

Effect of priors

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

27 of 62

Effect of hypothesis space

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

28 of 62

Remark & Plan

  • We finish section 11

  • Section 12: Neural networks

  • Section 13: Neural networks as distribution transformers

  • Section 14: Back-propagation (skip in class!)

29 of 62

Today

  • Recap
  • Generalization
  • Neural networks overview
  • Neural networks

29

30 of 62

The progress of deep learning for classification

30

ImageNet-1K (ILSVRC)

  • 1,000 object classes
  • 1,000 training images/class
  • Each image contains just one class of object!

Metric: Top-k accuracy

  • For each image, return a list of top-k possible classes
  • If the true class is within the list, the classification is correct

31 of 62

The progress of deep learning for classification

[Simonyan et al., 2015]

[Szegedy et al., 2015]

[Huang et al., 2017]

[He et al., 2016]

[Krizhevsky et al., 2012]

Top-5 error rate

32 of 62

General formulation for all these variants

32

 

Image (pixels)

 

 

 

 

33 of 62

Recap: classification

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

 

 

 

34 of 62

Deep neural networks (DNN)

35 of 62

Convolution

A special computation between layers

  • A current node is not directly affected by “all nodes in the previous layer”
  • The network “weights” on the edges can be “re-used”

35

36 of 62

Convolution

36

0

0

0

0

1

0

0

0

1

1

0

0

1

1

1

0

1

1

1

1

1

1

1

1

1

0

0

1

0

1

1

1

1

1

Feature map (nodes) at layer t

Feature map at layer t+1

“Filter” weights

(3-by-3)

Inner product

Element-wise multiplication and sum

1

37 of 62

Convolution

37

0

0

0

0

1

0

0

0

1

1

0

0

1

1

1

0

1

1

1

1

1

1

1

1

1

0

0

1

0

1

1

1

1

1

“Filter” weights

(3-by-3)

Inner product

6

Feature map (nodes) at layer t

Feature map at layer t+1

38 of 62

Convolution

38

0

0

0

0

1

0

0

0

1

1

0

0

1

1

1

0

1

1

1

1

1

1

1

1

1

0

0

1

0

1

1

1

1

1

“Filter” weights

(3-by-3)

Inner product

1

Zero-padding: Set the missing values to be 0

Feature map (nodes) at layer t

Feature map at layer t+1

39 of 62

Convolution example

39

0

0

0

1

1

1

0

0

0

1

1

1

0

0

0

1

1

1

40 of 62

Convolution

40

“Filter” weights

(3-by-3)

“Filter” weights

(3-by-3-by-“2”)

0

0

0

0

1

0

0

0

1

1

0

0

1

1

1

0

1

1

1

1

1

1

1

1

1

0

0

1

0

1

1

1

1

1

Inner product

Feature map (nodes) at layer t

Feature map at layer t+1

41 of 62

Convolution

41

“Filter” weights

(3-by-3-by-“2”)

0

0

0

0

1

0

0

0

1

1

0

0

1

1

1

0

1

1

1

1

1

1

1

1

1

0

0

1

0

1

1

1

1

1

Inner product

1

1

1

0

0

0

1

1

1

Feature map (nodes) at layer t

Feature map at layer t+1

One filter for one output “channel” to capture a different “pattern” (e.g., edges, circles, eyes, etc.)

42 of 62

Convolution: properties

  • Process nearby pixels together
  • Translation invariant: “local patterns” can show up at different pixel locations
  • Can process arbitrary-size images

42

Top-left, Top right: has ears

Middle: has eyes

43 of 62

Convolutional neural networks (CNN)

43

Shared weights

Vectorization + FC layers

Max pooling + down-sampling

  • Remove redundancy
  • Translation-invariant
  • Enlarge receptive filed

44 of 62

Representative CNN networks

  • AlexNet

[Krizhevsky et al., 2012]

  • VGGnet

[Simonyan et al., 2015]

44

  • A block: computation
  • Edge: nodes/tensors

45 of 62

Representative CNN networks

  • GoogleNet [Szegedy et al., 2014]
  • Inception

46 of 62

Representative CNN networks

  • ResNet

[He et al, 2016]

  • DenseNet

[Huang et al, 2017]

46

 

  • A block: computation
  • Edge: nodes/tensors

 

 

Advantages:

  • Optimization
  • Collect more information

47 of 62

Representative CNN networks

A general architecture involves

  • Multiple layers of convolutions + ReLU (nonlinearity) + pooling + striding
  • These result in a (final) feature map
    • Positions on the map correspond to the image
  • The map goes through FC layers (MLP)
  • Usually, we keep the network till the feature map
    • For feature extraction
    • For down-stream tasks
    • For image-to-image search

47

48 of 62

Training a DNN for classification

  •  

48

100: elephant

 

Minimize the empirical risk

49 of 62

Four factors behind deep learning developments

  • Data

  • Neural network architecture

  • Powerful “learning” algorithms and loss

  • Computational resource

50 of 62

Accessibility to large amount of data

50

51 of 62

Flexible neural networks for modeling

Visual transformers

[Liu et al., 2021]

[Battaglia et al., 2018]

Graph neural networks

[Qi et al., 2017]

PointNet

ConvNet

[Huang et al., 2017]

[Gu et al., 2024]

Recurrent neural networks

52 of 62

Powerful algorithms + losses to learn from data

Bi-level optimization

[Finn et al., 2017]

Adversarial learning

[Ganin et al., 2016]

[He et al., 2020]

Contrastive learning

Diffusion (denoising)

[Ho et al., 2020]

Autoregressive

[El-Nouby et al., 2024]

Preference learning

[Rafailov et al., 2023]

53 of 62

Computational resource

54 of 62

Today

  • Recap
  • Generalization
  • Neural networks overview
  • Neural networks

54

55 of 62

Deep neural networks (DNN)

56 of 62

Re-Introduction

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

57 of 62

Perceptron

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

58 of 62

Perceptron as classifiers

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

59 of 62

Learning a classifier

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

60 of 62

Multi-layer perceptron (MLP)

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

61 of 62

Activations vs. Parameters

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

62 of 62

Fast activation and slow parameters

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]