Introduction to computer vision 12
Jean Ponce
Zuhaib Akhtar za2023@nyu.edu
Ayush Jain aj3152@nyu.edu
Slides will be available after classes
Visual recognition
Variability:
Camera position
Illumination
Internal parameters
Within-class variations
Variability:
Camera position
Illumination
Internal parameters
θ
Roberts (1963); Lowe (1987); Faugeras & Hebert (1986); Grimson &
Lozano-Perez (1986); Huttenlocher & Ullman (1987)
Origins of computer vision
L. G. Roberts, Machine Perception of Three Dimensional Solids, Ph.D. thesis, MIT Department of Electrical Engineering, 1963.
photo credit: Joe Mundy
1,1
1,2
1,n
1,nil
2,2
2,3
2,n
2,nil
Matching as tree search
1,1
1,2
1,n
1,nil
2,2
2,3
2,n
2,nil
Matching as tree search
a priori factorial cost
1,1
1,2
1,n
1,nil
2,2
2,3
2,n
2,nil
Matching as tree search
a priori factorial cost
But geometric
consistency helps!
(Ayache & Faugeras, 1982; Faugeras & Hebert 1986)
Huttenlocher & Ullman (1987)
Variability Invariance to:
Camera position
Illumination
Internal parameters
Duda & Hart ( 1972); Weiss (1987); Mundy et al. (1992-94);
Rothwell et al. (1992); Burns et al. (1993)
BUT: True 3D objects do not admit monocular
viewpoint invariants (Burns et al., 1993) !!
Projective invariants (Rothwell et al., 1992):
Example: affine invariants of coplanar points
2/3
1/2
2/3
1/2
Empirical models of image variability:
Appearance-based techniques
Turk & Pentland (1991); Murase & Nayar (1995); etc.
Eigenfaces (Turk & Pentland, 1991)
Appearance manifolds
(Murase & Nayar, 1995)
Appearance manifolds
(Murase & Nayar, 1995)
Correlation-based template matching (60s)
Ballard & Brown (1980, Fig. 3.3). Courtesy Bob Fisher
and Ballard & Brown on-line.
• Automated target recognition
• Industrial inspection
• Optical character recognition
• Stereo matching
• Pattern recognition
It may work in the right conditions
Slide credit: A. Torralba
But then again it may not
Slide credit: A. Torralba
Lowe’02
Mahamud & Hebert’03
In the late 1990s, a new approach emerges:
Combining local appearance, spatial constraints, invariants,
and classification techniques from machine learning.
Query
Retrieved (10o off)
Schmid & Mohr’97
Late 1990s: Local appearance models
(Image courtesy of C. Schmid)
(Image courtesy of C. Schmid)
Late 1990s: Local appearance models
(Image courtesy of C. Schmid)
(Lowe 2004)
Late 1990s: Local appearance models
(Image courtesy of C. Schmid)
Late 1990s: Local appearance models
(Image courtesy of C. Schmid)
See, for example, Schmid & Mohr (1996); Lowe (1999);Tuytelaars & Van Gool,
(2002); Rothganger et al. (2003); Ferrari et al., (2004).
Late 1990s: Local appearance models
Image retrieval in videos
Bags of words:
Visual “Google”
(Sivic & Zisserman, ICCV’03)
“Visual word” clusters
Vector quantization into histogram
(the “bag of words”)
Modeling Texton Distributions
Training
image
Filter Responses
Texton Map
Model = Histogram of textons in the image
Analogy with Text Analysis
The ZH-20 unit is a 200Gigahertz processor with 2Gigabyte memory. Its strength is its bus and high-speed memory……
Political
Government
Gigabyte
Observers
Election
Memory
Gigahertz
Bus
Word from vocabulary
Frequency of occurrence
Government
Observers
Histogram from input fragment
Political
Government
Gigabyte
Observers
Election
Memory
Gigahertz
Bus
Frequency of occurrence
Histogram from training “computer” fragments
Political
Gigabyte
Election
Memory
Gigahertz
Bus
Frequency of occurrence
Histogram from training “political” fragments
Compare
Bags of words:
Visual “Google”
(Sivic & Zisserman, ICCV’03)
Retrieved shots
Select a region
Bags of words:
Visual “Google”
(Sivic & Zisserman, ICCV’03)
Select a region
Interesting question:
Why does it work?
Objects are not
textures
There is no good reason
(that I know of at least)
why the distribution of
features in the region
should match that of the
features in the whole
image
Image categorization is harder
�
Structural part-based models
(aka “generalized cylinders”)
(Binford, 1971; Marr & Nishihara, 1978)
(Nevatia & Binford, 1972)
Idea: GCs should (roughly) project onto “ribbons”
Zhu and Yuille (1996)
Ponce et al. (1989)
Ioffe and Forsyth (2000)
Helas, this is hard to operationalize
Ultimate GCs: ACRONYM
(Brooks & Binford, 1981)
Labelled training examples
Beavers
Chairs
Trees
Test image
??
Categorization as supervised classification
Categorization as supervised classification
n
Image “space”
(Functional?) feature space
Prediction function
Label
Training datum
Image “space”
n
linear (well, affine)
Categorization as supervised classification
(Functional?) feature space
Image “space”
n
linear (well, affine)
Categorization as supervised classification
loss function
regularizer
(Functional?) feature space
A linear classifier: Support vector machines�(Boser, Guyon, Vapnik, 1992)
linear (well, affine)
A linear classifier: Support vector machines�(Boser, Guyon, Vapnik, 1992)
Other linear classifiers
Prefer convex losses!
Spatial pyramids (Lazebnik et al., 2006)
(Swain & Ballard’91, Grauman & Darrell’05, Zhang et al.’06, Felzenszwalb’08)
(Koenderink & Van Doorn’99; Dalal
& Triggs’05; Lazebnik, Schmid,
Ponce’06; Chum & Zisserman’07)
(Koenderink & van Doorn’99)
BoW (Csurka et al.’04)
HOG (Dalal & Triggs’05)
(Felzenszwalb, Girshick, McAllester, Ramanan, 2008)
Discriminatively trained part-based models
Detection by Classification
Car/non-car�Classifier
Detection by Classification
Car/non-car�Classifier
Detection by Classification
Car/non-car�Classifier
(can use same size window over a spatial pyramid of images)
The “revolution” of deep learning in 2012
(Krizhevsky, Sutskever, Hinton, 2012)
Take with a
grain of salt
(And ResNets, GANs, RNNs, LSTMs, Transformers, etc.)
The “revolution” of deep learning in 2012
(Krizhevsky, Sutskever, Hinton, 2012)
(And ResNets, GANs, RNNs, LSTMs, Transformers, etc.)
Convolutional nets early 90s (LeCun et al.’98)
(And one should not forget Pomerleau 1980s.)
SIFT at keypoints
dense gradients
dense SIFT
vector quantization
vector quantization
sparse coding
whole image, mean
coarse grid, mean
pyramid, max
Filtering
Coding
Pooling
A common architecture for image classification
Bags of words
HOG
Spatial pyramids
(Sivic & Zisserman, 2003; Dalal & Triggs, 2005; Lazebnik et al., 2006; Boureau et al., 2010)
SIFT at keypoints
dense gradients
Convolutions
vector quantization
vector quantization
Nonlinearities
whole image, mean
coarse grid, mean
multi-scale pooling
Filtering
Coding
Pooling
A common architecture for image classification
Bags of words
HOG
CNNs
(Sivic & Zisserman, 2003; Dalal & Triggs, 2005; Lazebnik et al., 2006; Boureau et al., 2010)
Deep learning
Image categorization as
representation learning
Image “space”
Feature (Hilbert) space
n
θ
θ
θ
Deep learning
Deep learning
Layer 1
Layer 2
Layer n
Linear head
Learned
representation