1 of 51

The Need for

Language in Vision

Arka Sadhu

22nd Jan 2020

2 of 51

About Me

  • Second-Year PhD in CS dept.

  • My Research Work:

Associate (Ground) Language in Images and Videos

3 of 51

What this Talk is (not) About

  • Tasks not Algorithms

  • (Very) High-Level overview of “current” state of research

  • Broad but not Exhaustive

4 of 51

~ 20 mins Talk

~ 10 mins Q&A

Interactive: Feel Free to Chip in

5 of 51

Let us start with Language!

What is Language?

6 of 51

Natural Language

7 of 51

Is Natural Language easy?

Maybe

8 of 51

Is Natural Language easy for computers?

  • No! It is a beast in its own right!

9 of 51

Quiz: Are we using Language in Vision Here?

10 of 51

  • Do we need Natural Language in Computer Vision?
    • Spoiler Alert: YES!

  • Are there any applications?
    • Spoiler Alert: YES!

Let’s Find Out!

11 of 51

What is (Computer) Vision?

12 of 51

What is (Computer) Vision?

Machine Vision

Human Vision

13 of 51

What is (Computer) Vision at a High Level?

  • Want the Machine to Understand the Image(s) / Video(s)

  • What does it mean to “Understand”?

  • How do we quantify “Understanding”?

(Spoiler Alert: No formal consensus and is Highly Debated!)

14 of 51

15 of 51

Siamese Cat

Golden Retriever

Image Classification!

16 of 51

Image Classification!

Siamese Cat

Golden Retriever

Truck

Car

17 of 51

18 of 51

Object Detection!

Instance Segmentation!

19 of 51

And Many More!

  • Object Tracking
  • Face Recognition
  • Scene Parsing
  • 3D Reconstruction
  • Activity Classification
  • Media Forensics
  • Generation
  • Retrieval

20 of 51

Revisiting “Understanding”

What would a “Human” say?

21 of 51

Revisiting “Understanding”

What would a “Human” say?

A cycle marathon

People on bicycle on a cloudy day

A man on bicycle high-fiving a kid.

Image Captioning!

22 of 51

Revisiting “Understanding”

  • Finding Objects is Necessary but not Sufficient
  • Use “Natural Language” to get a more “Holistic” idea
    • Image Captioning
    • Video Description
  • How about “Fine-Grained” understanding?
    • Ground The Caption!

23 of 51

Yellow Taxi

24 of 51

Referring Expressions

Yellow Taxi

25 of 51

A woman sings on stage as a man plays an instrument in unison with her vocals .

26 of 51

A woman sings on stage as a man plays an instrument in unison with her vocals .

Phrase Grounding

27 of 51

Disclaimer: My Work (under review)!

28 of 51

Video Object Grounding

29 of 51

Revisiting “Understanding”

  • Finding Objects is Necessary but not Sufficient
  • Use “Natural Language” to get a more “Holistic” idea
    • Image Captioning
    • Video Description
  • How about “Fine-Grained” understanding?
    • Ground The Caption!

30 of 51

Revisiting “Understanding”

  • Finding Objects is Necessary but not Sufficient
  • Use “Natural Language” to get a more “Holistic” idea
    • Image Captioning
    • Video Description
  • How about “Fine-Grained” understanding?
    • Ground The Caption!
    • Can we probe into what the model knows?

31 of 51

Revisiting “Understanding”

  • Finding Objects is Necessary but not Sufficient
  • Use “Natural Language” to get a more “Holistic” idea
    • Image Captioning
    • Video Description
  • How about “Fine-Grained” understanding?
    • Ground The Caption!
    • Can we probe into what the model knows?
      • Ask Questions!

32 of 51

33 of 51

How is the weather?

What is going on here?

What are the people at the back doing?

Visual Question Answering!

34 of 51

What are these two people doing?

Common-sense Reasoning!

35 of 51

36 of 51

Are there an equal number of large things and metal spheres?

How many objects are either small cylinders or red things?

37 of 51

Who is the maker of the coffee machine?

38 of 51

Who is the maker of the coffee machine?

Keurig

Textual Visual

Question Answering

39 of 51

40 of 51

Revisiting “Understanding”

  • Finding Objects is Necessary but not Sufficient
  • Use “Natural Language” to get a more “Holistic” idea
    • Image Captioning
    • Video Description
  • How about “Fine-Grained” understanding?
    • Ground The Caption!
    • Can we probe into what the model knows?
      • Ask Questions!

41 of 51

Revisiting “Understanding”

  • Finding Objects is Necessary but not Sufficient
  • Use “Natural Language” to get a more “Holistic” idea
    • Image Captioning
    • Video Description
  • How about “Fine-Grained” understanding?
    • Ground The Caption!
    • Can we probe into what the model knows?
      • Ask Questions!
    • Interact with the Environment!

42 of 51

Instruction Following

43 of 51

Applications in the Wild!

44 of 51

HealthCare

EHR + Image-Scans

Check Diagnosis

45 of 51

Instructional Video

Source:

End-to-End Learning of Visual Representations from Uncurated Instructional Videos (https://arxiv.org/pdf/1912.06430.pdf)

A Case Study on Combining ASR and Visual Features for Generating Instructional Video Captions

(https://arxiv.org/pdf/1910.02930.pdf)

46 of 51

Image Retrieval with Language

47 of 51

Movie-Script To Video

48 of 51

Applications in the Wild!

  • HealthCare
  • Instructional Video Alignment
  • Retrieval
  • Movie Script Understanding

49 of 51

Applications in the Wild!

  • HealthCare
  • Instructional Video Alignment
  • Retrieval
  • Movie Script Understanding

And Many More!

50 of 51

Applications in the Wild!

  • HealthCare
  • Instructional Video Alignment
  • Retrieval
  • Movie Script Understanding

And Many More!

  • Self-driving cars
  • News aggregation
  • Memes!

  • Personal Robots
  • Fashion Analysis
  • Games!

51 of 51

Any Questions?

Feel free to mail: asadhu@usc.edu with [USC-Wise] in subject