1 of 42

CSE 5524: �ConvNet & Transformer

1

2 of 42

Midterm exam and HW # 2

  • Midterm
    • 3/4, in class
    • Not open book, but you can bring 5 “papers” (10 pages) of cheat sheets. Printing is fine.
    • This means, you can practice how to make concise notes for yourself!
    • More information will be announced via Carmen!

  • HW # 2
    • Due: 3/2

3 of 42

Homework and quiz plan

  • HW 3: (10%)
    • Release: 3/18
    • Due: 4/1

  • HW 4: (10 – 20%)
    • Release: 4/1
    • Due 1: 4/15
    • Due 2: 4/22

  • Quiz: 6% coming soon

4 of 42

Final project (20 – 30%)

  • Team forming:
    • 2 – 3 students: same expectation

  • Tentative plan:
    • Team forming: 2/28
    • Project sketch: 3/7 (2%)
    • Project proposal: 3/20 (3%)
    • Project presentation: 4/22 & 4/23 (10%)
    • Project report & code release: 4/25 (15%)

5 of 42

What will be in the midterm

  • Chapters 2, 5, 6, 7, 8
  • Chapters 9, 10, 11, 12, 13 14 is skipped
  • Chapters 15, 16, 17, 18 (partially) 19 is skipped
  • Chapters 20, 21, 22, 23 (partially)
  • Chapters 24

6 of 42

What is this? How do you tell?

6

7 of 42

Filters to capture these “frequency” components

7

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

Fourier domain

Spatial domain

8 of 42

Gabor filters

8

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

9 of 42

9

[Figure credit: A. Torralba, P. Isola, and W. T. Freeman, Foundations of Computer Vision.]

10 of 42

How many do I need? How separate do I need?

10

11 of 42

It seems we need many!

11

12 of 42

What if we resize the image?

12

13 of 42

Questions?

14 of 42

Besides translation, scale is another factor to be invariant

14

15 of 42

Typically, a single filter can only detect one size

15

16 of 42

However, we can resize the image

16

17 of 42

Image pyramid

17

18 of 42

Feature pyramid networks

18

[Feature Pyramid Networks for Object Detection, CVPR 2017]

19 of 42

Questions?

20 of 42

Revisit the bird image

20

How to apply?

21 of 42

Convolutional neural networks (CNN) for detection

21

22 of 42

Convolutional neural networks (CNN) for segmentation

22

23 of 42

CNN revisit

23

Shared weights

Vectorization + FC layers

Max pooling + down-sampling

24 of 42

Convolution revisit

24

“Filter” weights

(3-by-3-by-“2”)

0

0

0

0

1

0

0

0

1

1

0

0

1

1

1

0

1

1

1

1

1

1

1

1

1

0

0

1

0

1

1

1

1

1

Inner product

1

1

1

0

0

0

1

1

1

Feature map (nodes) at layer t

Feature map at layer t+1

One filter for one output “channel” to capture a different “pattern” (e.g., edges, circles, eyes, etc.)

25 of 42

Convolution with [?] filters

25

26 of 42

Stride (down-sampled) convolution

26

27 of 42

Stride (down-sampled) convolution

27

28 of 42

Dilated convolution

28

29 of 42

CNN revisit

29

Shared weights

Vectorization + FC layers

Max pooling + down-sampling

30 of 42

A simple CNN classifier

30

31 of 42

Spatial output

31

32 of 42

Spatial output

32

33 of 42

Popular CNN architectures

  • Encoder, decoder

33

34 of 42

Popular CNN architectures

  • Encoder + decoder for segmentation

34

35 of 42

Popular CNN architectures

  • U-Net: Encoder + decoder + skip links

35

36 of 42

Questions?

37 of 42

Vision transformer

  •  

37

 

Image (pixels)

 

 

38 of 42

CNN vs. Vision transformer

38

CNN

Convolution

Vision transformer

Transformer

39 of 42

Vision transformer

39

(2) Vectorize each of them

+ encode each with a shared MLP

+ “spatial” encoding

 

 

 

 

(1) Split an image into patches

 

 

 

 

1-layer of Transformer Encoder

[Dosovitskiy et al., An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, ICLR 2021]

40 of 42

Position encoding

[https://erdem.pl/2021/05/understanding-positional-encoding-in-transformers]

41 of 42

1-layer of transformer encoder

41

K

Q

V

key, query, value

“learnable” matrices

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Relatedness of patch-5 to others (after softmax)

 

 

 

 

 

 

 

 

 

 

 

 

Weighted value vectors

 

 

 

 

 

 

 

 

 

 

 

 

Single-head case

42 of 42

CNN vs. Vision transformer

42

CNN

Convolutions

Vision transformer

Transformer