1 of 62

CSE 5524: �Foundation of learning - 3

2 of 62

Homework assignment - 1

Will be released on 2/2 or 2/3

Homework – 2 will very likely be released two weeks after the above date

3 of 62

Today

Recap
Generalization
Neural networks overview
Neural networks

3

4 of 62

Recap: Key ingredients

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

5 of 62

Case study – 1: regression

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

6 of 62

Case study – 2: classification

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

7 of 62

Gradient-based learning algorithm

The goal:

8 of 62

Gradient-based learning algorithm

The goal:

Illustration:

Derivative (-)

GOAL: minimum error

9 of 62

Basic gradient descent

The goal:

Pseudo-code:

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

10 of 62

Today

Recap
Generalization
Neural networks overview
Neural networks

10

11 of 62

Training vs. testing

Models learned from the training data should be applicable to test data

11

x

y

In training, we only see training data!

Choosing a more complicated hypothesis class does not necessarily lead to lower test errors!

12 of 62

Under-fitting vs. over-fitting

Background: polynomial regression

13 of 62

Under-fitting vs. over-fitting

Training data are generated from a sin function + noise

Under-fitting

K too small: simple

Over-fitting

K too large: complicated

training error = 0

13

[Slides: from USC CSCI567]

14 of 62

Under-fitting vs. over-fitting

Training data are generated from a sin function + noise

Just good enough

14

[Slides: from USC CSCI567]

15 of 62

Under-fitting vs. over-fitting

Over-fitting:

Small training error
Large test error (even larger than some other “simpler” models)

How to quantify hypothesis class’ complexity?

Polynomial regression: larger K, larger complexity

How to choose K?

16 of 62

Training data

Train

Val

Treating “Val” as the “pseudo” test data!

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

17 of 62

Questions?

18 of 62

Regularization

Searching for “simpler” solution

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

19 of 62

Finding a good regularizer is not always easy

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

20 of 62

Occam’s razor principle

20

All things being equal,

the simplest explanation is usually the best!

21 of 62

Three tools in the search for Truth

Data: what we observe

Prior: what we prefer & believe

Hypotheses: what the true function may be

22 of 62

Data (likelihood) & Prior

23 of 62

Three tools in the search for Truth

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

24 of 62

Effect of data

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

25 of 62

More data, less over-fitting

[Figure: Bishop, PRML]

Green: true data distribution

Blue: training data

Red: learned model

26 of 62

Effect of priors

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

27 of 62

Effect of hypothesis space

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

28 of 62

Remark & Plan

We finish section 11

Section 12: Neural networks

Section 13: Neural networks as distribution transformers

Section 14: Back-propagation (skip in class!)

29 of 62

Today

Recap
Generalization
Neural networks overview
Neural networks

29

30 of 62

The progress of deep learning for classification

30

ImageNet-1K (ILSVRC)

1,000 object classes
1,000 training images/class
Each image contains just one class of object!

Metric: Top-k accuracy

For each image, return a list of top-k possible classes
If the true class is within the list, the classification is correct

31 of 62

The progress of deep learning for classification

[Simonyan et al., 2015]

[Szegedy et al., 2015]

[Huang et al., 2017]

[He et al., 2016]

[Krizhevsky et al., 2012]

Top-5 error rate

32 of 62

General formulation for all these variants

32

Image (pixels)

33 of 62

Recap: classification

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

34 of 62

Deep neural networks (DNN)

35 of 62

Convolution

A special computation between layers

A current node is not directly affected by “all nodes in the previous layer”
The network “weights” on the edges can be “re-used”

35

36 of 62

Convolution

36

0	0	0	0	1
0	0	0	1	1
0	0	1	1	1
0	1	1	1	1
1	1	1	1	1

0	0	1
0	1	1
1	1	1

Feature map (nodes) at layer t

Feature map at layer t+1

“Filter” weights

(3-by-3)

Inner product

Element-wise multiplication and sum

1

37 of 62

Convolution

37

0	0	0	0	1
0	0	0	1	1
0	0	1	1	1
0	1	1	1	1
1	1	1	1	1

0	0	1
0	1	1
1	1	1

“Filter” weights

(3-by-3)

Inner product

6

Feature map (nodes) at layer t

Feature map at layer t+1

38 of 62

Convolution

38

0	0	0	0	1
0	0	0	1	1
0	0	1	1	1
0	1	1	1	1
1	1	1	1	1

0	0	1
0	1	1
1	1	1

“Filter” weights

(3-by-3)

Inner product

1

Zero-padding: Set the missing values to be 0

Feature map (nodes) at layer t

Feature map at layer t+1

39 of 62

Convolution example

39

0	0	0
1	1	1
0	0	0

1	1	1
0	0	0
1	1	1

40 of 62

Convolution

40

“Filter” weights

(3-by-3)

“Filter” weights

(3-by-3-by-“2”)

0	0	0	0	1
0	0	0	1	1
0	0	1	1	1
0	1	1	1	1
1	1	1	1	1

0	0	1
0	1	1
1	1	1

Inner product

Feature map (nodes) at layer t

Feature map at layer t+1

41 of 62

Convolution

41

“Filter” weights

(3-by-3-by-“2”)

0	0	0	0	1
0	0	0	1	1
0	0	1	1	1
0	1	1	1	1
1	1	1	1	1

0	0	1
0	1	1
1	1	1

Inner product

1	1	1
0	0	0
1	1	1

Feature map (nodes) at layer t

Feature map at layer t+1

One filter for one output “channel” to capture a different “pattern” (e.g., edges, circles, eyes, etc.)

42 of 62

Convolution: properties

Process nearby pixels together
Translation invariant: “local patterns” can show up at different pixel locations
Can process arbitrary-size images

42

Top-left, Top right: has ears

Middle: has eyes

43 of 62

Convolutional neural networks (CNN)

43

Shared weights

Vectorization + FC layers

Max pooling + down-sampling

Remove redundancy
Translation-invariant
Enlarge receptive filed

44 of 62

Representative CNN networks

AlexNet

[Krizhevsky et al., 2012]

VGGnet

[Simonyan et al., 2015]

44

A block: computation
Edge: nodes/tensors

45 of 62

Representative CNN networks

GoogleNet [Szegedy et al., 2014]
Inception

46 of 62

Representative CNN networks

ResNet

[He et al, 2016]

DenseNet

[Huang et al, 2017]

46

A block: computation
Edge: nodes/tensors

Advantages:

Optimization
Collect more information

47 of 62

Representative CNN networks

A general architecture involves

Multiple layers of convolutions + ReLU (nonlinearity) + pooling + striding
These result in a (final) feature map

Positions on the map correspond to the image

The map goes through FC layers (MLP)
Usually, we keep the network till the feature map

For feature extraction
For down-stream tasks
For image-to-image search

47

48 of 62

Training a DNN for classification

48

100: elephant

Minimize the empirical risk

49 of 62

Four factors behind deep learning developments

Data

Neural network architecture

Powerful “learning” algorithms and loss

Computational resource

50 of 62

Accessibility to large amount of data

50

51 of 62

Flexible neural networks for modeling

Visual transformers

[Liu et al., 2021]

[Battaglia et al., 2018]

Graph neural networks

[Qi et al., 2017]

PointNet

ConvNet

[Huang et al., 2017]

[Gu et al., 2024]

Recurrent neural networks

52 of 62

Powerful algorithms + losses to learn from data

Bi-level optimization

[Finn et al., 2017]

Adversarial learning

[Ganin et al., 2016]

[He et al., 2020]

Contrastive learning

Diffusion (denoising)

[Ho et al., 2020]

Autoregressive

[El-Nouby et al., 2024]

Preference learning

[Rafailov et al., 2023]

53 of 62

Computational resource

54 of 62

Today

Recap
Generalization
Neural networks overview
Neural networks

54

55 of 62

Deep neural networks (DNN)

56 of 62

Re-Introduction

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

57 of 62

Perceptron

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

58 of 62

Perceptron as classifiers

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

59 of 62

Learning a classifier

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

60 of 62

Multi-layer perceptron (MLP)

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

61 of 62

Activations vs. Parameters

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

62 of 62

Fast activation and slow parameters

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]