1 of 69

Lecture 22

Deployable CV

6.8300/1 Advances in Computer Vision

Spring 2024

Sara Beery, Kaiming He, Vincent Sitzmann, Mina Konaković Luković

2 of 69

Deployable Computer Vision

Computer vision systems - what’s important?
Human-in-the-loop CV

Evaluation and calibration
Selective prediction
Active learning
In-context “learning”

Efficiency

Distillation
Compression

3 of 69

Impactful CV systems are:

Useful (not perfect!)
Accessible
Collaborative
Well-communicated

4 of 69

Let’s look at this with a model that has positive impact in ecology

Efficient Pipeline for Camera Trap Image Review, Beery, et al., DMAIC @ KDD

gif by Aaron Greenville

https://github.com/agentmorris/MegaDetector

5 of 69

6 of 69

Useful: Used to process data for hundreds of conservation organizations globally

7 of 69

“In our application, MegaDetector detected human and animal images with 99% and 82% precision, and 95% and 92% recall respectively, at a confidence threshold of 90%. The overall time required to process the dataset was reduced by over 500%, and the manual processing component was reduced by 840%. The index of human detection events from MegaDetector matched the output from manual classification, with a mean 0.45% difference in estimated human detections across site-weeks.”

8 of 69

Accessible: Hosted via an open-source API and integrated into existing widely-used data tools

Cloud�(Camelot, Zamba Cloud)

It’s complicated�(Zooniverse, eMammal)

9 of 69

Collaborative: Clear path for feedback, iterative improvements

10 of 69

Well-communicated: Define risks, known failure modes, and best practices for validation and use

Remove salient & static

false positives

Use humans to catch

errors efficiently

11 of 69

Good performance on realistic benchmarks ≠ impact

People

Model

Model Retraining

Decision Support

Verification and Correction

Data Infrastructure

Data Collection

Software Tooling

12 of 69

Bringing humans into the “loop”

Diagrams courtesy of industry PR:

SalesForce

HumanLoop

Humans in the Loop

13 of 69

How to incorporate human input?

Evaluation and calibration

14 of 69

Curating evaluation sets and selecting metrics

15 of 69

Diagnosing bias and errors

Dataset Interfaces: Diagnosing Model Failures Using Controllable Counterfactual Generation, Vendrow et al., 2023

16 of 69

Model calibration

https://arxiv.org/abs/1706.04599

17 of 69

Why calibrate?

A continuous-score occupancy model that incorporates uncertain machine learning output from autonomous biodiversity surveys,

Rhinehardt et al., Methods in Ecology and Evolution, 2022

18 of 69

No model is perfect: Selecting an operating point

Precision

Recall

0

1

0.9

0.6

19 of 69

How to incorporate human input?

Evaluation and calibration
Selective prediction

20 of 69

dolphin

cat

grizzly bear

angel fish

chameleon

iguana

elephant

clown fish

…

Prediction

Score

0

“clown fish”

1

21 of 69

dolphin

cat

grizzly bear

angel fish

chameleon

iguana

elephant

clown fish

…

Prediction

Score

0

Selective Prediction

1

0.7

“I don’t know”

Selective prediction gives an abstain option, it doesn’t force a decision but instead takes model confidence into consideration

In practice, a human would then identify images that a model abstains

22 of 69

Accuracy vs human effort in selective prediction

Low thresholds mean the model is trusted more, thus less human effort needed to identify all the data but there is more possibility of error
High thresholds mean the model is trusted less, thus humans ID more data but quality is easier to guarantee
Threshold selection is an active area of research, calibrated models make this easier

0

1

Baseline accuracy

# images

Selective prediction threshold

Correct images

Human labeled images

23 of 69

How to incorporate human input?

Evaluation and calibration
Selective prediction
Active learning

24 of 69

Active learning

Learn to sample next data for human labeling automatically to optimize performance while minimizing human effort

Human-in-the-loop machine learning: a state of the art, Mosqueira-Rey et al., Artificial Intelligence Review 2022

Human-in-the-loop machine learning, Munro, Manning Publishing 2020

Sampling criteria:

Random
Uncertainty (Exploit)
Diversity (Explore)

25 of 69

Active Learning via Selective Prediction

Iterative human and automated identification of wildlife images, Miao et al., Nature Machine Intelligence 2021

26 of 69

Active Learning via Clustering

A deep active learning system for species identification and counting in camera trap images, Norouzzadeh, Morris, Beery, et al., Methods in Ecology and Evolution 2021

Uses the MegaDetector to crop
Cluster animals based on visual similarity in new cameras
Humans ID examples from each cluster (active learning criteria)
Gets same accuracy with 99.5% fewer labels

27 of 69

Open Challenges

Long-tailed distributions
Fine-grained categories
High-confidence mistakes

28 of 69

Human input to foundation models

29 of 69

Tons of data

Learn foundation models

Learner

CLIP

GPT

AlexNet

SimCLR

BERT

DALL-E

VQGAN

StyleGAN

BigGAN

SimCLR

WaveNet

Use/adapt foundations to solve new problems

Little or no data

Adaptor

App

30 of 69

New paradigm: “Deep Learning” with No Data

“Just ask”

(of course, this isn’t really machine learning anymore)

https://evjang.com/2021/10/23/generalization.html

31 of 69

GPT-3

[Brown et al., 2020]

https://arxiv.org/pdf/2005.14165.pdf

1. Learner: Generative Pretraining

“Colorless green ideas sleep ____”

Language model

“furiously”

(Predict next characters)

32 of 69

GPT-3

[Brown et al., 2020]

https://arxiv.org/pdf/2005.14165.pdf

1. Adaptor: Prompt engineering

[Review] + “The sentiment in this review is ____”

Language model

“positive”

(Predict next characters)

33 of 69

New capabilities by just asking

34 of 69

New capabilities by just asking

35 of 69

Few-shot learning ability

36 of 69

Few-shot learning ability: extrapolation

37 of 69

GPT-3

[Brown et al., 2020]

https://arxiv.org/pdf/2005.14165.pdf

38 of 69

Prompting

39 of 69

Hand-crafting good prompts is tricky…

Slide credit: Hyojin Bahng

40 of 69

Soft Prompts (Prompt Tuning)

Optimizes a small continuous task-specific vector while having language model parameters frozen

Prefix-tuning: Optimizing continuous prompts for generation. ACL 2021.

The Power of Scale for Parameter-Efﬁcient Prompt Tuning. EMNLP 2021.

Slide credit: Hyojin Bahng

41 of 69

Chain-of-thought prompting

[Wei, Wang, Schuurmans et al. 2022]

42 of 69

Chain-of-thought prompting

[Wei, Wang, Schuurmans et al. 2022]

43 of 69

CLIP

[https://openai.com/blog/clip/]

[Radford et al., 2021]

https://arxiv.org/pdf/2103.00020.pdf

1. Learner: Contrastive Pretraining

44 of 69

CLIP

[https://openai.com/blog/clip/]

[Radford et al., 2021]

https://arxiv.org/pdf/2103.00020.pdf

2. Adaptor: Linear classifer on top of image encodings

45 of 69

Where pre-taining fails

Pre-training can help with extrapolation, but does not address failures that stem from dataset biases

Ask Your Distribution Shift if Pre-Training is Right for You, B. Cohen-Wang, J. Vendrow, A. Madry, 2024

https://arxiv.org/abs/2403.00194

46 of 69

CLIP

[https://openai.com/blog/clip/]

[Radford et al., 2021]

https://arxiv.org/pdf/2103.00020.pdf

2. Adaptor: Just ask

47 of 69

“A California condor tagged with a green 26”

Where CLIP zero-shot fails

“A bird displaying allofeeding behavior”

No relevant examples in top 50 retrievals

Some relevant results, with low precision

Slide courtesy of Edward Vendrow

48 of 69

Segment Anything Model

https://ai.meta.com/blog/segment-anything-foundation-model-image-segmentation/

49 of 69

Segment Anything Model

https://ai.meta.com/blog/segment-anything-foundation-model-image-segmentation/

https://learnopencv.com/segment-anything/

50 of 69

Segment Anything Model

https://ai.meta.com/blog/segment-anything-foundation-model-image-segmentation/

51 of 69

Efficiency: Making models smaller

https://www.datacamp.com/blog/what-is-tinyml-tiny-machine-learning

52 of 69

https://openai.com/blog/ai-and-compute/

53 of 69

Ahmed and Thompson (2022)

54 of 69

Climate cost

Making AI Less "Thirsty": Uncovering and Addressing the Secret Water Footprint of AI Models, Li et al., 2023

55 of 69

Not all users have equal resources

https://github.com/visipedia/caltech-fish-counting

56 of 69

Recall: training a classifier with cross-entropy

…

elephant

monkey

cat

giraffee

dog

tiger

duck

seal

0

1.0

…

0

1.0

elephant

monkey

cat

giraffee

dog

tiger

duck

seal

“Teacher”

“Student”

57 of 69

Knowledge Distillation

c

Teacher

Student

…

0

1.0

elephant

monkey

cat

giraffee

dog

tiger

duck

seal

…

elephant

monkey

cat

giraffee

dog

tiger

duck

seal

0

1.0

[Hinton, Vinyals, Dean, 2015]

58 of 69

Knowledge Distillation

The “label” you get for each input is more informative than just the ground truth class
The “label” (soft target) says not just “this is a dog” but “this is a dog that looks a bit like a cat and very different from a bear” (so maybe it’s a small dog).

[Hinton, Vinyals, Dean, 2015]

Hinton paper shows lower train

accuracy but higher test accuracy on

speech recognition with soft targets

59 of 69

Cross-Modal Distillation

Teacher

[Gupta, Hoffman, Malik, 2015]

c

Student

Useful if you have labeled data for one domain but not the other.

60 of 69

Cross-Modal Distillation: SoundNet

[Aytar*, Vondrick*, Torralba, 2016]

61 of 69

Distilling an ensemble

Teachers

Student

Ensembling model outputs creates more accurate estimates.
Why? Each model is imperfect in different random ways. Errors cancel out, truth is shared.
Distill ensemble into a single model for fast inference.

62 of 69

c

63 of 69

Contrastive Representation Distillation

c

Teacher

Student

[Tian, Krishnan, Isola, 2020]

Contrastive learning supervises embeddings to be invariant to some viewing transformation.
The “viewing transformation” can be “teacher’s view” (big net) vs “students view” (small net).
Then the student will output embeddings that are the same as the teacher’s.

64 of 69

Contrastive Representation Distillation

c

Teacher

Student

[Tian, Krishnan, Isola, 2020]

Correlation coefficients between class output logits

— Student’s minus Teacher’s

(CIFAR100: plots are 100 classes by 100 classes)

KD

CRD

65 of 69

Compression vs. Distillation

Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning, Frantar et al., 2023

Both give us smaller models with similar accuracy to large ones
Distillation uses SGD to train a smaller model to match characteristics of a larger one
Compression uses pruning or quantization to reduce the computational cost of a large model

66 of 69

We almost never need a truly generalist model in practice. Specialization can enable much higher rates of compression with the same accuracy.

Vision Models Can Be Efficiently Specialized via Few-Shot Task-Aware Compression, Kuznedelev et al., 2023

67 of 69

Bringing it full circle:

Human input on the edge

Figure developed while brainstorming with Pietro Perona

68 of 69

Accessibility of CV solutions

Requires knowledge
Requires resources
Requires sustained maintenance

69 of 69

Deployable Computer Vision

Computer vision systems - what’s important?
Human-in-the-loop CV

Evaluation and calibration
Selective prediction
Active learning
In-context “learning”

Efficiency

Distillation
Compression