Generation and Comprehension �of Unambiguous Object Descriptions
Junhua Mao1, Jonathan Huang2, Alexander Toshev2, Oana Camburu3, Alan Yuille1,4, and Kevin Murphy2
The dataset in this work is available at https://github.com/mjhucla/Google_Refexp_toolbox
1
2
3
4
Unambiguous Object Descriptions
A man.
A man in blue.
A man in blue sweater.
A man who is touching his head.
✔
❌
✔
❌
The man who is touching his head
Unambiguous Object Descriptions
(Referring Expressions [Kazemzadeh et.al 2014]):
Uniquely describes the relevant object or region within its context.
Two men are sitting next to each other.
It is hard to evaluate image captions.
Two men are sitting next to each other in front of a desk watching something from a laptop.
“The man who is touching his head.”
Whole frame image
Object bounding box
Referring Expression
Our Model
Whole frame image & Region proposals
Input
Output
Speaker
Listener
Input
Input
Output
Chosen region in red
Input
Adapting a LSTM image captioning model ( … )
The Baseline Model
<bos>
the
in
girl
pink
the
girl
pink
in
<eos>
LSTM
Feature Extractor
Training Objective:
(xtl , ytl)
(xbr , ybr)
CNN
CNN
⊕
⊕
Feature Extractor (2005 dimension feature)
VGGNet [Simonyan et. al 2015]
CNN
[Mao et.al 2015]
[Vinyals et. al 2015]
[Donahue et.al 2015]
The Speaker-Listener Pipeline
Speaker module:
The Speaker-Listener Pipeline
Speaker module:
Listener module:
“A girl in pink”
p(S|Rn, I)
Multibox Proposals [Erhan et.al 2014]
0.89
0.32
0.05
...
0.89
The Speaker-Listener Pipeline
Speaker module:
Listener module:
“A girl in pink”
Feature Extractor
Speaker needs to consider the listener
LSTM
The baseline model want to maximize the p(S|R, I)
A better, more discriminative model:
R’: regions serve as negatives for R
“A smiling girl”?
Our Full Model
<bos>
the
in
girl
pink
the
girl
pink
in
<eos>
LSTM
Feature Extractor
Loss
...
...
R
R’1
R’m
Training Objective:
Dataset
The black and yellow backpack sitting on top of a suitcase.
A yellow and black back pack sitting on top of a blue suitcase.
An apple desktop computer.
The white IMac computer that is also turned on.
A boy brushing his hair while looking at his reflection.
A young male child in pajamas shaking around a hairbrush in the mirror.
Zebra looking towards the camera.
A zebra third from the left.
26,711 images (from MS COCO [Lin et.al 2015]), 54,822 objects and 104,560 referring expressions.
Available at https://github.com/mjhucla/Google_Refexp_toolbox
Listener Results
A dark brown horse with a white stripe wearing a black studded harness.
A dark brown horse with a white stripe wearing a black studded harness.
A white horse carrying a man.
A white horse carrying a man.
A dark horse carrying a woman
A dark horse carrying a woman
A woman on the dark horse.
A woman on the dark horse.
A cat laying on the left.
A black cat laying on the right.
Our Full Model
A cat laying on a bed.
A black and white cat.
Baseline
Speaker Results
Experiments: 4% improvement for precision@1
Baseline
Our full model
Improvement
Listener Task
40.6
44.6
4.0
End-to-End
(Speaker & Listener)
48.5
51.3
2.8
Human Evaluation
(Speaker Task)
15.9
20.4
4.5
Related work
Which fruit is under the apple?
Give me the fruit under the apple.
VQA
Referring Expression
Take Home Message
To be a better communicator, need to be a better listener!!
Backup Slides
Paper available at http://arxiv.org/abs/1511.02283
Comparisons to Concurrent Dataset
Bottom left apple.
Bottom left.
The bottom apple.
Green apple on the bottom-left corner, under the lemon and on the left of the orange.
A green apple on the left of a orange.
Goalie.
Right dude.
Orange shirt.
The goalie wearing an orange and black shirt.
A male soccer goalkeeper wearing an orange jersey in front of a player ready to score.
UNC-Ref-COCO (UNC-Ref)
Google Refexp (G-Ref)
Dataset Construction
Examples of good descriptions:
"A man wearing a blue sweater whose hand is touching his head."
"A man in a blue sweater."
Examples of bad descriptions:
"A guy." -- this description is too short.
"A man in blue." -- not precise enough, there are 2 men in blue.
Two phases
Ask Turkers to describe objects in an image.
Dataset Construction
Two phases
Semi-Supervised Training
The girl in pink.
Fully Supervised Images
Only Bounding Boxes
Generate descriptions
Verification
Dbb+txt
Dbb
With Generated Descriptions
The woman in blue.
Dbb+auto
Re-Train
Dfiltered
Model G
Train
Model C
Semi-Supervised Training
Proposals | GroundTruth | Multibox Post-classification | ||
Descriptions | Generated | GroundTruth | Generated | GroundTruth |
G-Ref | ||||
Dbb+txt | 0.791 | 0.561 | 0.489 | 0.417 |
Dbb+txt ∪ Dbb | 0.793 | 0.577 | 0.489 | 0.424 |
UNC-Ref | ||||
Dbb+txt | 0.826 | 0.655 | 0.588 | 0.483 |
Dbb+txt ∪ Dbb | 0.811 | 0.660 | 0.591 | 0.486 |
Full model variants
Proposals | GT | Multibox | ||
Descriptions | GEN | GT | GEN | GT |
ML (baseline) | 0.803 | 0.654 | 0.564 | 0.478 |
MMI-SoftMax | 0.848 | 0.689 | 0.591 | 0.502 |
MMI-MM-easy-GT-neg | 0.851 | 0.677 | 0.590 | 0.492 |
MMI-MM-hard-GT-neg | 0.857 | 0.699 | 0.591 | 0.503 |
MMI-MM-multibox-neg | 0.848 | 0.695 | 0.604 | 0.511 |
Addressing the two tasks
The generation task:
The comprehension task:
Select R* = argmaxR∈∁ p(S|R, I)
Constant
A cat laying on the left.
A black cat laying on the right.
A cat laying on a bed.
A black and white cat.
A brown horse in the right.
A white horse.
A brown horse.
A white horse.
A baseball catcher.
A baseball player swing a bat.
The umpire in the black shirt.
The catcher.
The baseball player swing a bat.
An umpire.
A zebra standing behind another zebra.
A zebra in front of another zebra.
A zebra in the middle.
A zebra in front of another zebra.
Upper: Our Full Model
Lower: Baseline
Guy with dark short hair in a white shirt.
A woman with curly hair playing Wii.
The controller in the woman's hand.
*The woman in white.
A dark brown horse with a white stripe wearing a black studded harness.
A white horse carrying a man.
A woman on the dark horse.
A dark horse carrying a woman.
The giraffe behind the zebra that is looking up.
The giraffe with its back to the camera.
The giraffe on the right.
A zebra.