Lecture 22
Deployable CV
6.8300/1 Advances in Computer Vision
Spring 2024
Sara Beery, Kaiming He, Vincent Sitzmann, Mina Konaković Luković
Deployable Computer Vision
Impactful CV systems are:
Let’s look at this with a model that has positive impact in ecology
Efficient Pipeline for Camera Trap Image Review, Beery, et al., DMAIC @ KDD
gif by Aaron Greenville
Useful: Used to process data for hundreds of conservation organizations globally
“In our application, MegaDetector detected human and animal images with 99% and 82% precision, and 95% and 92% recall respectively, at a confidence threshold of 90%. The overall time required to process the dataset was reduced by over 500%, and the manual processing component was reduced by 840%. The index of human detection events from MegaDetector matched the output from manual classification, with a mean 0.45% difference in estimated human detections across site-weeks.”
Accessible: Hosted via an open-source API and integrated into existing widely-used data tools
Cloud�(Camelot, Zamba Cloud)
It’s complicated�(Zooniverse, eMammal)
Collaborative: Clear path for feedback, iterative improvements
Well-communicated: Define risks, known failure modes, and best practices for validation and use
Remove salient & static
false positives
Use humans to catch
errors efficiently
Good performance on realistic benchmarks ≠ impact
People
Model
Model Retraining
Decision Support
Verification and Correction
Data Infrastructure
Data Collection
Software Tooling
Bringing humans into the “loop”
Diagrams courtesy of industry PR:
SalesForce
HumanLoop
Humans in the Loop
How to incorporate human input?
Curating evaluation sets and selecting metrics
Diagnosing bias and errors
Dataset Interfaces: Diagnosing Model Failures Using Controllable Counterfactual Generation, Vendrow et al., 2023
Model calibration
https://arxiv.org/abs/1706.04599
Why calibrate?
A continuous-score occupancy model that incorporates uncertain machine learning output from autonomous biodiversity surveys,
Rhinehardt et al., Methods in Ecology and Evolution, 2022
No model is perfect: Selecting an operating point
Precision
Recall
0
0
1
1
0.9
0.6
How to incorporate human input?
dolphin
cat
grizzly bear
angel fish
chameleon
iguana
elephant
clown fish
…
Prediction
Score
0
“clown fish”
1
dolphin
cat
grizzly bear
angel fish
chameleon
iguana
elephant
clown fish
…
Prediction
Score
0
Selective Prediction
1
0.7
“I don’t know”
Selective prediction gives an abstain option, it doesn’t force a decision but instead takes model confidence into consideration
In practice, a human would then identify images that a model abstains
Accuracy vs human effort in selective prediction
0
1
Baseline accuracy
# images
Selective prediction threshold
Correct images
Human labeled images
How to incorporate human input?
Active learning
Learn to sample next data for human labeling automatically to optimize performance while minimizing human effort
Human-in-the-loop machine learning: a state of the art, Mosqueira-Rey et al., Artificial Intelligence Review 2022
Human-in-the-loop machine learning, Munro, Manning Publishing 2020
Sampling criteria:
Active Learning via Selective Prediction
Iterative human and automated identification of wildlife images, Miao et al., Nature Machine Intelligence 2021
Active Learning via Clustering
A deep active learning system for species identification and counting in camera trap images, Norouzzadeh, Morris, Beery, et al., Methods in Ecology and Evolution 2021
Open Challenges
Human input to foundation models
Tons of data
Learn foundation models
Learner
CLIP
GPT
AlexNet
SimCLR
BERT
DALL-E
VQGAN
StyleGAN
BigGAN
SimCLR
WaveNet
Use/adapt foundations to solve new problems
Little or no data
Adaptor
App
New paradigm: “Deep Learning” with No Data
“Just ask”
(of course, this isn’t really machine learning anymore)
https://evjang.com/2021/10/23/generalization.html
GPT-3
[Brown et al., 2020]
https://arxiv.org/pdf/2005.14165.pdf
1. Learner: Generative Pretraining
“Colorless green ideas sleep ____”
Language model
“furiously”
(Predict next characters)
GPT-3
[Brown et al., 2020]
https://arxiv.org/pdf/2005.14165.pdf
1. Adaptor: Prompt engineering
[Review] + “The sentiment in this review is ____”
Language model
“positive”
(Predict next characters)
New capabilities by just asking
New capabilities by just asking
Few-shot learning ability
Few-shot learning ability: extrapolation
GPT-3
[Brown et al., 2020]
https://arxiv.org/pdf/2005.14165.pdf
Prompting
Hand-crafting good prompts is tricky…
Slide credit: Hyojin Bahng
Soft Prompts (Prompt Tuning)
Prefix-tuning: Optimizing continuous prompts for generation. ACL 2021.
The Power of Scale for Parameter-Efficient Prompt Tuning. EMNLP 2021.
Slide credit: Hyojin Bahng
Chain-of-thought prompting
[Wei, Wang, Schuurmans et al. 2022]
Chain-of-thought prompting
[Wei, Wang, Schuurmans et al. 2022]
CLIP
[Radford et al., 2021]
https://arxiv.org/pdf/2103.00020.pdf
1. Learner: Contrastive Pretraining
CLIP
[Radford et al., 2021]
https://arxiv.org/pdf/2103.00020.pdf
2. Adaptor: Linear classifer on top of image encodings
Where pre-taining fails
Pre-training can help with extrapolation, but does not address failures that stem from dataset biases
Ask Your Distribution Shift if Pre-Training is Right for You, B. Cohen-Wang, J. Vendrow, A. Madry, 2024
CLIP
[Radford et al., 2021]
https://arxiv.org/pdf/2103.00020.pdf
2. Adaptor: Just ask
“A California condor tagged with a green 26”
Where CLIP zero-shot fails
“A bird displaying allofeeding behavior”
No relevant examples in top 50 retrievals
Some relevant results, with low precision
Slide courtesy of Edward Vendrow
Segment Anything Model
https://ai.meta.com/blog/segment-anything-foundation-model-image-segmentation/
Segment Anything Model
https://ai.meta.com/blog/segment-anything-foundation-model-image-segmentation/
https://learnopencv.com/segment-anything/
Segment Anything Model
https://ai.meta.com/blog/segment-anything-foundation-model-image-segmentation/
Efficiency: Making models smaller
https://openai.com/blog/ai-and-compute/
Ahmed and Thompson (2022)
Climate cost
Making AI Less "Thirsty": Uncovering and Addressing the Secret Water Footprint of AI Models, Li et al., 2023
Not all users have equal resources
https://github.com/visipedia/caltech-fish-counting
Recall: training a classifier with cross-entropy
…
elephant
monkey
cat
giraffee
dog
tiger
duck
seal
0
1.0
…
0
1.0
elephant
monkey
cat
giraffee
dog
tiger
duck
seal
“Teacher”
“Student”
Knowledge Distillation
c
Teacher
Student
…
0
1.0
elephant
monkey
cat
giraffee
dog
tiger
duck
seal
…
elephant
monkey
cat
giraffee
dog
tiger
duck
seal
0
1.0
[Hinton, Vinyals, Dean, 2015]
Knowledge Distillation
[Hinton, Vinyals, Dean, 2015]
Hinton paper shows lower train
accuracy but higher test accuracy on
speech recognition with soft targets
Cross-Modal Distillation
Teacher
[Gupta, Hoffman, Malik, 2015]
c
Student
Useful if you have labeled data for one domain but not the other.
Cross-Modal Distillation: SoundNet
[Aytar*, Vondrick*, Torralba, 2016]
Distilling an ensemble
Teachers
Student
c
c
c
Contrastive Representation Distillation
c
Teacher
Student
[Tian, Krishnan, Isola, 2020]
Contrastive Representation Distillation
c
Teacher
Student
[Tian, Krishnan, Isola, 2020]
Correlation coefficients between class output logits
— Student’s minus Teacher’s
(CIFAR100: plots are 100 classes by 100 classes)
KD
CRD
Compression vs. Distillation
Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning, Frantar et al., 2023
We almost never need a truly generalist model in practice. Specialization can enable much higher rates of compression with the same accuracy.
Vision Models Can Be Efficiently Specialized via Few-Shot Task-Aware Compression, Kuznedelev et al., 2023
Bringing it full circle:
Human input on the edge
Figure developed while brainstorming with Pietro Perona
Accessibility of CV solutions
Deployable Computer Vision