1 of 66

CSE 5524: �Transfer learning & Stereo

1

2 of 66

HW 3 & HW 4 & quizzes

  • HW 3
    • Caution: Please re-download the data
    • Due: 4/11/2025

  • HW 4
    • Plan: A lighter homework
    • Due: 4/21/2025

  • Quizzes:
    • Two quizzes to be released in the next two weeks --- True/False, multiple choices, unlimited tries

3 of 66

Today (37 & 40)

  • Recap
  • Transfer learning & adaptation
  • Stereo vision

3

4 of 66

Recap: Domain gap

Domain gap

5 of 66

Recap: Domain gap

6 of 66

Recap: Data augmentation

  • Increate the amount, diversity, and coverage of the data

7 of 66

Recap: Data augmentation

8 of 66

Today (37 & 40)

  • Recap
  • Transfer learning & adaptation
  • Stereo vision

8

9 of 66

Data is important …

The existence of domain gaps implies that we need to “re-collect” training data

product images

ImageNet

web images

10 of 66

Data is important …

The existence of new tasks implies that we need to “re-collect” training data

Bird species

Dog breeds

Car brands & styles

11 of 66

Neural networks are data hungry…

Sufficient labeled data

ImageNet-1K (ILSVRC)

1,000 object classes

1,000 training images per class

12 of 66

Humans do not re-learn from scratch …

  • Once we learn certain skills (e.g., recognizing bird species), those skills can usually be “transferred” to other related tasks (e.g., recognizing dog breeds).

  • We typically only need a few examples (some images of different dog breeds) to equip us with the capability for the new tasks

  • Can neural networks do so?

13 of 66

Transfer learning

  • Transfer knowledge learned from prior tasks to new tasks

14 of 66

Transfer learning

  • Transfer knowledge learned from prior tasks to new tasks

15 of 66

Transfer learning

  • Transfer knowledge learned from prior tasks to new tasks

How to achieve “adaptation”?

Different “distributions”

Different “labels”

16 of 66

Pre-training and fine-tuning

Probably don’t need to change a lot!

17 of 66

CNN revisit

17

18 of 66

Pre-training and fine-tuning

Probably don’t need to change a lot!

19 of 66

Pre-training and fine-tuning

20 of 66

Pre-training and fine-tuning

21 of 66

Algorithm

22 of 66

Questions?

22

23 of 66

This paradigm “pre-training + fine-tuning” is everywhere

24 of 66

This paradigm “pre-training + fine-tuning” is everywhere

25 of 66

This paradigm “pre-training + fine-tuning” is everywhere

25

Ficus auriculata

Onoclea

sensibilis

Onoclea hintonii

Plantae

Tracheophyta

Polpodiopsida

Polypodiales

Onocleaceae

Onoclea

sensibilis

Kingdom

Phylum

Class

Order

Family

Genus

Species

Plantae

Tracheophyta

Polpodiopsida

Polypodiales

Onocleaceae

Onoclea

hintonii

Plantae

Tracheophyta

Rosids

Rosales

Moraceae

Ficus

F. auriculata

Autoregressive text representation

Vision encoder

CLIP (Contrastive Language-Image Pretraining) matches images to text,

  • capture hierarchy relationship between classes
  • enable few-shot or even zero-shot learning

26 of 66

This paradigm “pre-training + fine-tuning” is everywhere

27 of 66

Fine-tuning a subset of neural networks

Feature encoder

Prediction head

Feature vector

Image

Label

28 of 66

Fine-tuning a subset of neural networks

Feature encoder

Feature vector

Image

Label

Prediction head

29 of 66

Fine-tuning a subset of neural networks

Feature encoder

Feature vector

Image

Label

Prediction head

30 of 66

Fine-tuning a subset of neural networks

Feature encoder

Feature vector

Image

Label

  • Full fine-tuning: Update everything

  • (Linear) probing: Only update the (linear) prediction head

  • Parameter-efficient fine-tuning: Update a subset of parameters

Prediction head

31 of 66

Parameter efficient fine-tuning

MLP

MLP - new

MLP

MLP - new

32 of 66

Parameter efficient fine-tuning

MLP

MLP - new

MLP

MLP - new

K

Q

V

key, query, value

“learnable” matrices

K’

K

LoRA: Low-Rank Adaptation

33 of 66

Parameter efficient fine-tuning

34 of 66

Questions?

34

35 of 66

Learning from a teacher

  • Typically: learning from data

  • What if we have a very powerful but computationally expensive pre-trained model?

36 of 66

Learning from a teacher: knowledge distillation

37 of 66

Learning from a teacher: knowledge distillation

38 of 66

Learning from a teacher: knowledge distillation

39 of 66

Learning from a teacher: knowledge distillation

Teachers could:

  • Instruct semantic relatedness
  • Denoise

40 of 66

Questions?

40

41 of 66

Prompting

42 of 66

Prompting

43 of 66

Prompting

44 of 66

Prompting

45 of 66

Prompting

46 of 66

Prompting

47 of 66

Prompting

  • Provide the “context of the task” in the input data!

48 of 66

Visual Prompt Tuning

48

Visual Prompt Tuning [Menglin Jia et al.]

49 of 66

Questions?

49

50 of 66

3D reconstruction

  • How can we reconstruct 3D from image(s)?

Depth estimation and 3D reconstruction

51 of 66

3D reconstruction

  • Making “assumptions”

    • Parallel projections
    • Flat surfaces

52 of 66

3D reconstruction

  • How do humans do?

    • Learning from “experience” – monocular depth estimation
    • Leverage “two eyes” – stereo depth estimation

53 of 66

Stereo depth estimation

53

=

Il

Ir

D

Z

disparity

depth

Focal length Baseline

54 of 66

Stereo depth estimation

54

Disparity Map

Left

Right

disparity

Depth Map

55 of 66

Stereo depth estimation

55

Left

Right

 

 

 

Similarity

 

Disparity

56 of 66

Stereo depth estimation

56

Left

Right

Neural

Net

Prob.

 

 

 

Similarity

 

Disparity

57 of 66

Pyramid stereo matching network (PSMNet)

57

[Chang et al., Pyramid stereo matching network, 2018]

58 of 66

General architecture

Cost volume

Left feature map

Left feature map

59 of 66

General architecture

Left feature map

Left feature map

Cost volume

60 of 66

Continuous Disparity Network

60

Left

Right

Neural

Net

Prob.

 

 

 

Probability

 

Disparity

[Garg et al., Wasserstein Distances for Stereo Disparity Estimation, 2020]

61 of 66

Continuous Disparity Network

61

Left

Right

Neural

Net

Prob.

Offset

 

 

 

Probability

 

Disparity

Output disparity

= shifted mode

[Garg et al., Wasserstein Distances for Stereo Disparity Estimation, 2020]

62 of 66

62

Without Offset

With Offset

Output disparity

= mode

 

 

 

Probability

 

Disparity

Output disparity

= shifted mode

 

 

 

Probability

 

Disparity

0

0

63 of 66

Questions?

63

64 of 66

What if we have more images?

65 of 66

What if we have more images?

66 of 66

Can we synthesize images from other views?