The Need for
Language in Vision
Arka Sadhu
22nd Jan 2020
About Me
Associate (Ground) Language in Images and Videos
What this Talk is (not) About
~ 20 mins Talk
~ 10 mins Q&A
Interactive: Feel Free to Chip in
Let us start with Language!
What is Language?
Natural Language
Is Natural Language easy?
Maybe
Is Natural Language easy for computers?
Source: http://nlpprogress.com/
Quiz: Are we using Language in Vision Here?
Let’s Find Out!
What is (Computer) Vision?
What is (Computer) Vision?
Machine Vision
Human Vision
What is (Computer) Vision at a High Level?
(Spoiler Alert: No formal consensus and is Highly Debated!)
Siamese Cat
Golden Retriever
Image Classification!
Image Classification!
Siamese Cat
Golden Retriever
Truck
Car
Object Detection!
Instance Segmentation!
And Many More!
See http://cvpr2020.thecvf.com/submission/main-conference/author-guidelines#call-for-papers
for a more comprehensive list of topics
Revisiting “Understanding”
What would a “Human” say?
Revisiting “Understanding”
What would a “Human” say?
A cycle marathon
People on bicycle on a cloudy day
A man on bicycle high-fiving a kid.
Image Captioning!
Revisiting “Understanding”
Yellow Taxi
Referring Expressions
Yellow Taxi
A woman sings on stage as a man plays an instrument in unison with her vocals .
Source: http://bryanplummer.com/Flickr30kEntities/browse.php#
(IMAGE 4821054372)
A woman sings on stage as a man plays an instrument in unison with her vocals .
Phrase Grounding
Disclaimer: My Work (under review)!
Video Object Grounding
Revisiting “Understanding”
Revisiting “Understanding”
Revisiting “Understanding”
How is the weather?
What is going on here?
What are the people at the back doing?
Visual Question Answering!
What are these two people doing?
Common-sense Reasoning!
Are there an equal number of large things and metal spheres?
How many objects are either small cylinders or red things?
Who is the maker of the coffee machine?
Who is the maker of the coffee machine?
Keurig
Textual Visual
Question Answering
Revisiting “Understanding”
Revisiting “Understanding”
Source: https://askforalfred.com/
Instruction Following
Applications in the Wild!
HealthCare
EHR + Image-Scans
Check Diagnosis
Instructional Video
Source:
End-to-End Learning of Visual Representations from Uncurated Instructional Videos (https://arxiv.org/pdf/1912.06430.pdf)
A Case Study on Combining ASR and Visual Features for Generating Instructional Video Captions
Image Retrieval with Language
Movie-Script To Video
Applications in the Wild!
Applications in the Wild!
And Many More!
Applications in the Wild!
And Many More!
Any Questions?
Feel free to mail: asadhu@usc.edu with [USC-Wise] in subject