1 of 54

Introduction to computer vision 12

Jean Ponce

jean.ponce@ens.fr

Zuhaib Akhtar za2023@nyu.edu

Ayush Jain aj3152@nyu.edu

Slides will be available after classes

2 of 54

Visual recognition

  • A bit of history

  • And where we stand

3 of 54

Variability:

Camera position

Illumination

Internal parameters

Within-class variations

4 of 54

Variability:

Camera position

Illumination

Internal parameters

θ

Roberts (1963); Lowe (1987); Faugeras & Hebert (1986); Grimson &

Lozano-Perez (1986); Huttenlocher & Ullman (1987)

5 of 54

Origins of computer vision

L. G. Roberts, Machine Perception of Three Dimensional Solids, Ph.D. thesis, MIT Department of Electrical Engineering, 1963.

photo credit: Joe Mundy

6 of 54

1,1

1,2

1,n

1,nil

2,2

2,3

2,n

2,nil

Matching as tree search

7 of 54

1,1

1,2

1,n

1,nil

2,2

2,3

2,n

2,nil

Matching as tree search

a priori factorial cost

8 of 54

1,1

1,2

1,n

1,nil

2,2

2,3

2,n

2,nil

Matching as tree search

a priori factorial cost

But geometric

consistency helps!

(Ayache & Faugeras, 1982; Faugeras & Hebert 1986)

9 of 54

Huttenlocher & Ullman (1987)

10 of 54

Variability Invariance to:

Camera position

Illumination

Internal parameters

Duda & Hart ( 1972); Weiss (1987); Mundy et al. (1992-94);

Rothwell et al. (1992); Burns et al. (1993)

11 of 54

BUT: True 3D objects do not admit monocular

viewpoint invariants (Burns et al., 1993) !!

Projective invariants (Rothwell et al., 1992):

Example: affine invariants of coplanar points

2/3

1/2

2/3

1/2

12 of 54

Empirical models of image variability:

Appearance-based techniques

Turk & Pentland (1991); Murase & Nayar (1995); etc.

13 of 54

Eigenfaces (Turk & Pentland, 1991)

14 of 54

Appearance manifolds

(Murase & Nayar, 1995)

15 of 54

Appearance manifolds

(Murase & Nayar, 1995)

 

16 of 54

Correlation-based template matching (60s)

Ballard & Brown (1980, Fig. 3.3). Courtesy Bob Fisher

and Ballard & Brown on-line.

• Automated target recognition

• Industrial inspection

• Optical character recognition

• Stereo matching

• Pattern recognition

17 of 54

It may work in the right conditions

Slide credit: A. Torralba

18 of 54

But then again it may not

Slide credit: A. Torralba

19 of 54

Lowe’02

Mahamud & Hebert’03

In the late 1990s, a new approach emerges:

Combining local appearance, spatial constraints, invariants,

and classification techniques from machine learning.

Query

Retrieved (10o off)

Schmid & Mohr’97

20 of 54

Late 1990s: Local appearance models

(Image courtesy of C. Schmid)

21 of 54

(Image courtesy of C. Schmid)

  • Find covariant features (interest points).

Late 1990s: Local appearance models

22 of 54

(Image courtesy of C. Schmid)

  • Find covariant features (interest points).
  • Match them using local invariant descriptors (jets, SIFT).

(Lowe 2004)

Late 1990s: Local appearance models

23 of 54

(Image courtesy of C. Schmid)

  • Find covariant features (interest points).
  • Match them using local invariant descriptors (jets, SIFT).
  • Optional: Filter out outliers using geometric consistency.

Late 1990s: Local appearance models

24 of 54

(Image courtesy of C. Schmid)

  • Find covariant features (interest points).
  • Match them using local invariant descriptors (jets, SIFT).
  • Optional: Filter out outliers using geometric consistency.
  • Vote.

See, for example, Schmid & Mohr (1996); Lowe (1999);Tuytelaars & Van Gool,

(2002); Rothganger et al. (2003); Ferrari et al., (2004).

Late 1990s: Local appearance models

25 of 54

Image retrieval in videos

Bags of words:

Visual “Google”

(Sivic & Zisserman, ICCV’03)

“Visual word” clusters

Vector quantization into histogram

(the “bag of words”)

26 of 54

Modeling Texton Distributions

Training

image

Filter Responses

Texton Map

Model = Histogram of textons in the image

27 of 54

Analogy with Text Analysis

The ZH-20 unit is a 200Gigahertz processor with 2Gigabyte memory. Its strength is its bus and high-speed memory……

Political

Government

Gigabyte

Observers

Election

Memory

Gigahertz

Bus

Word from vocabulary

Frequency of occurrence

Government

Observers

Histogram from input fragment

Political

Government

Gigabyte

Observers

Election

Memory

Gigahertz

Bus

Frequency of occurrence

Histogram from training “computer” fragments

Political

Gigabyte

Election

Memory

Gigahertz

Bus

Frequency of occurrence

Histogram from training “political” fragments

Compare

28 of 54

Bags of words:

Visual “Google”

(Sivic & Zisserman, ICCV’03)

Retrieved shots

Select a region

29 of 54

Bags of words:

Visual “Google”

(Sivic & Zisserman, ICCV’03)

Select a region

Interesting question:

Why does it work?

Objects are not

textures

There is no good reason

(that I know of at least)

why the distribution of

features in the region

should match that of the

features in the whole

image

30 of 54

Image categorization is harder

31 of 54

Structural part-based models

(aka “generalized cylinders”)

(Binford, 1971; Marr & Nishihara, 1978)

(Nevatia & Binford, 1972)

32 of 54

Idea: GCs should (roughly) project onto “ribbons”

33 of 54

Zhu and Yuille (1996)

Ponce et al. (1989)

Ioffe and Forsyth (2000)

Helas, this is hard to operationalize

34 of 54

Ultimate GCs: ACRONYM

(Brooks & Binford, 1981)

35 of 54

Labelled training examples

Beavers

Chairs

Trees

Test image

??

Categorization as supervised classification

36 of 54

Categorization as supervised classification

n

Image “space”

(Functional?) feature space

 

Prediction function

Label

Training datum

37 of 54

Image “space”

n

linear (well, affine)

Categorization as supervised classification

 

(Functional?) feature space

38 of 54

Image “space”

n

linear (well, affine)

Categorization as supervised classification

 

loss function

regularizer

(Functional?) feature space

39 of 54

A linear classifier: Support vector machines�(Boser, Guyon, Vapnik, 1992)

 

linear (well, affine)

40 of 54

A linear classifier: Support vector machines�(Boser, Guyon, Vapnik, 1992)

 

41 of 54

Other linear classifiers

 

Prefer convex losses!

42 of 54

Spatial pyramids (Lazebnik et al., 2006)

(Swain & Ballard’91, Grauman & Darrell’05, Zhang et al.’06, Felzenszwalb’08)

(Koenderink & Van Doorn’99; Dalal

& Triggs’05; Lazebnik, Schmid,

Ponce’06; Chum & Zisserman’07)

(Koenderink & van Doorn’99)

BoW (Csurka et al.’04)

HOG (Dalal & Triggs’05)

  • Bags of words=orderless models=histograms of visual words
  • Spatial pyramids=locally orderless models
  • Classifier: support vector machine=a linear classifier

43 of 54

(Felzenszwalb, Girshick, McAllester, Ramanan, 2008)

Discriminatively trained part-based models

44 of 54

Detection by Classification

  • Detect objects in clutter by search

Car/non-car�Classifier

  • Sliding window: exhaustive search over position and scale

45 of 54

Detection by Classification

  • Detect objects in clutter by search

Car/non-car�Classifier

  • Sliding window: exhaustive search over position and scale

46 of 54

Detection by Classification

  • Detect objects in clutter by search

Car/non-car�Classifier

  • Sliding window: exhaustive search over position and scale

(can use same size window over a spatial pyramid of images)

47 of 54

The “revolution” of deep learning in 2012

(Krizhevsky, Sutskever, Hinton, 2012)

Take with a

grain of salt

(And ResNets, GANs, RNNs, LSTMs, Transformers, etc.)

48 of 54

The “revolution” of deep learning in 2012

(Krizhevsky, Sutskever, Hinton, 2012)

(And ResNets, GANs, RNNs, LSTMs, Transformers, etc.)

Convolutional nets early 90s (LeCun et al.’98)

(And one should not forget Pomerleau 1980s.)

49 of 54

SIFT at keypoints

dense gradients

dense SIFT

vector quantization

vector quantization

sparse coding

whole image, mean

coarse grid, mean

pyramid, max

Filtering

Coding

Pooling

A common architecture for image classification

Bags of words

HOG

Spatial pyramids

(Sivic & Zisserman, 2003; Dalal & Triggs, 2005; Lazebnik et al., 2006; Boureau et al., 2010)

50 of 54

SIFT at keypoints

dense gradients

Convolutions

vector quantization

vector quantization

Nonlinearities

whole image, mean

coarse grid, mean

multi-scale pooling

Filtering

Coding

Pooling

A common architecture for image classification

Bags of words

HOG

CNNs

(Sivic & Zisserman, 2003; Dalal & Triggs, 2005; Lazebnik et al., 2006; Boureau et al., 2010)

51 of 54

Deep learning

  • Representation learning

  • History of neural networks

  • Training

  • CNNs

52 of 54

Image categorization as

representation learning

Image “space”

Feature (Hilbert) space

n

θ

θ

θ

 

53 of 54

Deep learning

54 of 54

Deep learning

Layer 1

Layer 2

Layer n

Linear head

Learned

representation