1 of 66

CSE 5524: �Transfer learning & Stereo

1

2 of 66

HW 3 & HW 4 & quizzes

HW 3

Caution: Please re-download the data
Due: 4/11/2025

HW 4

Plan: A lighter homework
Due: 4/21/2025

Quizzes:

Two quizzes to be released in the next two weeks --- True/False, multiple choices, unlimited tries

3 of 66

Today (37 & 40)

Recap
Transfer learning & adaptation
Stereo vision

3

4 of 66

Recap: Domain gap

Domain gap

5 of 66

Recap: Domain gap

6 of 66

Recap: Data augmentation

Increate the amount, diversity, and coverage of the data

7 of 66

Recap: Data augmentation

8 of 66

Today (37 & 40)

Recap
Transfer learning & adaptation
Stereo vision

8

9 of 66

Data is important …

The existence of domain gaps implies that we need to “re-collect” training data

product images

ImageNet

web images

10 of 66

Data is important …

The existence of new tasks implies that we need to “re-collect” training data

Bird species

Dog breeds

Car brands & styles

11 of 66

Neural networks are data hungry…

Sufficient labeled data

ImageNet-1K (ILSVRC)

1,000 object classes

1,000 training images per class

12 of 66

Humans do not re-learn from scratch …

Once we learn certain skills (e.g., recognizing bird species), those skills can usually be “transferred” to other related tasks (e.g., recognizing dog breeds).

We typically only need a few examples (some images of different dog breeds) to equip us with the capability for the new tasks

Can neural networks do so?

13 of 66

Transfer learning

Transfer knowledge learned from prior tasks to new tasks

14 of 66

Transfer learning

Transfer knowledge learned from prior tasks to new tasks

15 of 66

Transfer learning

Transfer knowledge learned from prior tasks to new tasks

How to achieve “adaptation”?

Different “distributions”

Different “labels”

16 of 66

Pre-training and fine-tuning

Probably don’t need to change a lot!

17 of 66

CNN revisit

17

18 of 66

Pre-training and fine-tuning

Probably don’t need to change a lot!

19 of 66

Pre-training and fine-tuning

20 of 66

Pre-training and fine-tuning

21 of 66

Algorithm

22 of 66

Questions?

22

23 of 66

This paradigm “pre-training + fine-tuning” is everywhere

24 of 66

This paradigm “pre-training + fine-tuning” is everywhere

25 of 66

This paradigm “pre-training + fine-tuning” is everywhere

25

Ficus auriculata

Onoclea

sensibilis

Onoclea hintonii

Plantae	Tracheophyta	Polpodiopsida	Polypodiales	Onocleaceae	Onoclea	sensibilis

Kingdom

Phylum

Class

Order

Family

Genus

Species

Plantae	Tracheophyta	Polpodiopsida	Polypodiales	Onocleaceae	Onoclea	hintonii

Plantae	Tracheophyta	Rosids	Rosales	Moraceae	Ficus	F. auriculata

Autoregressive text representation

Vision encoder

CLIP (Contrastive Language-Image Pretraining) matches images to text,

capture hierarchy relationship between classes
enable few-shot or even zero-shot learning

26 of 66

This paradigm “pre-training + fine-tuning” is everywhere

27 of 66

Fine-tuning a subset of neural networks

Feature encoder

Prediction head

Feature vector

Image

Label

28 of 66

Fine-tuning a subset of neural networks

Feature encoder

Feature vector

Image

Label

Prediction head

29 of 66

Fine-tuning a subset of neural networks

Feature encoder

Feature vector

Image

Label

Prediction head

30 of 66

Fine-tuning a subset of neural networks

Feature encoder

Feature vector

Image

Label

Full fine-tuning: Update everything

(Linear) probing: Only update the (linear) prediction head

Parameter-efficient fine-tuning: Update a subset of parameters

Prediction head

31 of 66

Parameter efficient fine-tuning

MLP

MLP - new

MLP

MLP - new

32 of 66

Parameter efficient fine-tuning

MLP

MLP - new

MLP

MLP - new

K

Q

V

key, query, value

“learnable” matrices

K’

K

LoRA: Low-Rank Adaptation

33 of 66

Parameter efficient fine-tuning

34 of 66

Questions?

34

35 of 66

Learning from a teacher

Typically: learning from data

What if we have a very powerful but computationally expensive pre-trained model?

36 of 66

Learning from a teacher: knowledge distillation

37 of 66

Learning from a teacher: knowledge distillation

38 of 66

Learning from a teacher: knowledge distillation

39 of 66

Learning from a teacher: knowledge distillation

Teachers could:

Instruct semantic relatedness
Denoise

40 of 66

Questions?

40

41 of 66

Prompting

42 of 66

Prompting

43 of 66

Prompting

44 of 66

Prompting

45 of 66

Prompting

46 of 66

Prompting

47 of 66

Prompting

Provide the “context of the task” in the input data!

48 of 66

Visual Prompt Tuning

48

Visual Prompt Tuning [Menglin Jia et al.]

49 of 66

Questions?

49

50 of 66

3D reconstruction

How can we reconstruct 3D from image(s)?

Depth estimation and 3D reconstruction

51 of 66

3D reconstruction

Making “assumptions”

Parallel projections
Flat surfaces
…

52 of 66

3D reconstruction

How do humans do?

Learning from “experience” – monocular depth estimation
Leverage “two eyes” – stereo depth estimation

53 of 66

Stereo depth estimation

53

=

I_l

I_r

D

Z

disparity

depth

Focal length Baseline

54 of 66

Stereo depth estimation

54

Disparity Map

Left

Right

disparity

Depth Map

55 of 66

Stereo depth estimation

55

Left

Right

Similarity

Disparity

56 of 66

Stereo depth estimation

56

Left

Right

Neural

Net

Prob.

Similarity

Disparity

57 of 66

Pyramid stereo matching network (PSMNet)

57

[Chang et al., Pyramid stereo matching network, 2018]

58 of 66

General architecture

Cost volume

Left feature map

59 of 66

General architecture

Left feature map

Cost volume

60 of 66

Continuous Disparity Network

60

Left

Right

Neural

Net

Prob.

Probability

Disparity

[Garg et al., Wasserstein Distances for Stereo Disparity Estimation, 2020]

61 of 66

Continuous Disparity Network

61

Left

Right

Neural

Net

Prob.

Offset

Probability

Disparity

Output disparity

= shifted mode

[Garg et al., Wasserstein Distances for Stereo Disparity Estimation, 2020]

62 of 66

62

Without Offset

With Offset

Output disparity

= mode

Probability

Disparity

Output disparity

= shifted mode

Probability

Disparity

0

63 of 66

Questions?

63

64 of 66

What if we have more images?

65 of 66

What if we have more images?

66 of 66

Can we synthesize images from other views?