1 of 35

Generation and Comprehension �of Unambiguous Object Descriptions

Junhua Mao1, Jonathan Huang2, Alexander Toshev2, Oana Camburu3, Alan Yuille1,4, and Kevin Murphy2

The dataset in this work is available at https://github.com/mjhucla/Google_Refexp_toolbox

1

2

3

4

2 of 35

Unambiguous Object Descriptions

A man.

A man in blue.

A man in blue sweater.

A man who is touching his head.

3 of 35

The man who is touching his head

Unambiguous Object Descriptions

(Referring Expressions [Kazemzadeh et.al 2014]):

Uniquely describes the relevant object or region within its context.

4 of 35

Two men are sitting next to each other.

It is hard to evaluate image captions.

Two men are sitting next to each other in front of a desk watching something from a laptop.

5 of 35

“The man who is touching his head.”

Whole frame image

Object bounding box

Referring Expression

Our Model

Whole frame image & Region proposals

Input

Output

Speaker

Listener

Input

Input

Output

Chosen region in red

Input

6 of 35

Adapting a LSTM image captioning model ()

The Baseline Model

<bos>

the

in

girl

pink

the

girl

pink

in

<eos>

LSTM

Feature Extractor

Training Objective:

(xtl , ytl)

(xbr , ybr)

CNN

CNN

Feature Extractor (2005 dimension feature)

VGGNet [Simonyan et. al 2015]

CNN

[Mao et.al 2015]

[Vinyals et. al 2015]

[Donahue et.al 2015]

7 of 35

The Speaker-Listener Pipeline

Speaker module:

  1. Decode with beam search
  2. Hard to evaluate by itself?

8 of 35

The Speaker-Listener Pipeline

Speaker module:

  • Decode with beam search
  • Hard to evaluate by itself?

Listener module:

“A girl in pink”

p(S|Rn, I)

Multibox Proposals [Erhan et.al 2014]

0.89

0.32

0.05

...

0.89

9 of 35

The Speaker-Listener Pipeline

Speaker module:

  • Decode with beam search
  • Hard to evaluate by itself?

Listener module:

“A girl in pink”

  • Easy to objectively evaluate.
    • Precision@1
  • Evaluate the whole system in an end-to-end way

10 of 35

Feature Extractor

Speaker needs to consider the listener

LSTM

The baseline model want to maximize the p(S|R, I)

  • Good to generate generic descriptions
  • Not discriminative

A better, more discriminative model:

  • Consider all possible regions
  • maximize the gap between p(S|R, I) and p(S|R’, I)

R’: regions serve as negatives for R

“A smiling girl”?

11 of 35

Our Full Model

<bos>

the

in

girl

pink

the

girl

pink

in

<eos>

LSTM

Feature Extractor

Loss

...

...

R

R’1

R’m

Training Objective:

12 of 35

Dataset

The black and yellow backpack sitting on top of a suitcase.

A yellow and black back pack sitting on top of a blue suitcase.

An apple desktop computer.

The white IMac computer that is also turned on.

A boy brushing his hair while looking at his reflection.

A young male child in pajamas shaking around a hairbrush in the mirror.

Zebra looking towards the camera.

A zebra third from the left.

26,711 images (from MS COCO [Lin et.al 2015]), 54,822 objects and 104,560 referring expressions.

13 of 35

Listener Results

14 of 35

A dark brown horse with a white stripe wearing a black studded harness.

15 of 35

A dark brown horse with a white stripe wearing a black studded harness.

16 of 35

A white horse carrying a man.

17 of 35

A white horse carrying a man.

18 of 35

A dark horse carrying a woman

19 of 35

A dark horse carrying a woman

20 of 35

A woman on the dark horse.

21 of 35

A woman on the dark horse.

22 of 35

A cat laying on the left.

A black cat laying on the right.

Our Full Model

A cat laying on a bed.

A black and white cat.

Baseline

Speaker Results

23 of 35

Experiments: 4% improvement for precision@1

Baseline

Our full model

Improvement

Listener Task

40.6

44.6

4.0

End-to-End

(Speaker & Listener)

48.5

51.3

2.8

Human Evaluation

(Speaker Task)

15.9

20.4

4.5

24 of 35

Related work

  • Image Captioning
    • This year (“caption in the title”): [Hendricks et.al 2016], [Pan et.al 2016], [Yu et.al 2016], [You et.al 2016]
  • Region Descriptions
    • Concurrent work: [Hu et.al 2016], [Johnson et.al 2016]
  • Visual Questions Answering (VQA)
    • This year (“question” in the title): [Yang et.al 2016], [Noh et. al 2016], [Shih et. al 2016], [Wu et. al 2016], [Tapaswi et. al 2016] [Kafle et.al 2016], [Zhu et. al 2016], [Zhang et.al 2016]
    • Related but different from our task

Which fruit is under the apple?

Give me the fruit under the apple.

VQA

Referring Expression

25 of 35

Take Home Message

To be a better communicator, need to be a better listener!!

26 of 35

Backup Slides

27 of 35

Comparisons to Concurrent Dataset

Bottom left apple.

Bottom left.

The bottom apple.

Green apple on the bottom-left corner, under the lemon and on the left of the orange.

A green apple on the left of a orange.

Goalie.

Right dude.

Orange shirt.

The goalie wearing an orange and black shirt.

A male soccer goalkeeper wearing an orange jersey in front of a player ready to score.

UNC-Ref-COCO (UNC-Ref)

Google Refexp (G-Ref)

  • UNC-Ref (extended from [Kazemzadeh et.al 2014]): More concise, less flowery
  • Our G-Ref: Richer

28 of 35

Dataset Construction

Examples of good descriptions:

"A man wearing a blue sweater whose hand is touching his head."

"A man in a blue sweater."

Examples of bad descriptions:

"A guy." -- this description is too short.

"A man in blue." -- not precise enough, there are 2 men in blue.

Two phases

  1. Speaker phrase

Ask Turkers to describe objects in an image.

29 of 35

Dataset Construction

Two phases

  • Speaker phrase
  • Listener phrase: Verify the descriptions from the first phrase.

30 of 35

Semi-Supervised Training

The girl in pink.

Fully Supervised Images

Only Bounding Boxes

Generate descriptions

Verification

Dbb+txt

Dbb

With Generated Descriptions

The woman in blue.

Dbb+auto

Re-Train

Dfiltered

Model G

Train

Model C

31 of 35

Semi-Supervised Training

Proposals

GroundTruth

Multibox Post-classification

Descriptions

Generated

GroundTruth

Generated

GroundTruth

G-Ref

Dbb+txt

0.791

0.561

0.489

0.417

Dbb+txtDbb

0.793

0.577

0.489

0.424

UNC-Ref

Dbb+txt

0.826

0.655

0.588

0.483

Dbb+txtDbb

0.811

0.660

0.591

0.486

32 of 35

Full model variants

Proposals

GT

Multibox

Descriptions

GEN

GT

GEN

GT

ML (baseline)

0.803

0.654

0.564

0.478

MMI-SoftMax

0.848

0.689

0.591

0.502

MMI-MM-easy-GT-neg

0.851

0.677

0.590

0.492

MMI-MM-hard-GT-neg

0.857

0.699

0.591

0.503

MMI-MM-multibox-neg

0.848

0.695

0.604

0.511

33 of 35

Addressing the two tasks

The generation task:

  • Compute argmaxS p(S|R, I), S: sentence, R: region, I: image
  • Model p(S|R, I) using Recurrent Neural Networks (RNNs)

The comprehension task:

  • Compute argmaxR∈ p(R|S, I), denotes a set of region proposals

  • Assuming a uniform prior for p(R|I):

Select R* = argmaxR∈ p(S|R, I)

Constant

34 of 35

A cat laying on the left.

A black cat laying on the right.

A cat laying on a bed.

A black and white cat.

A brown horse in the right.

A white horse.

A brown horse.

A white horse.

A baseball catcher.

A baseball player swing a bat.

The umpire in the black shirt.

The catcher.

The baseball player swing a bat.

An umpire.

A zebra standing behind another zebra.

A zebra in front of another zebra.

A zebra in the middle.

A zebra in front of another zebra.

Upper: Our Full Model

Lower: Baseline

35 of 35

Guy with dark short hair in a white shirt.

A woman with curly hair playing Wii.

The controller in the woman's hand.

*The woman in white.

A dark brown horse with a white stripe wearing a black studded harness.

A white horse carrying a man.

A woman on the dark horse.

A dark horse carrying a woman.

The giraffe behind the zebra that is looking up.

The giraffe with its back to the camera.

The giraffe on the right.

A zebra.