Generation and Comprehension
of Unambiguous Object Descriptions

Junhua Mao1, Jonathan Huang2, Alexander Toshev2, Oana Camburu3, Alan Yuille1,4, and Kevin Murphy2

The dataset in this work is available at https://github.com/mjhucla/Google_Refexp_toolbox

1

2

3

4

Hi everyone, I am Junhua Mao from UCLA. Today I am going to present our work about how to generate and comprehend unambiguous object descriptions. This is a joint work with UCLA, ….

Unambiguous Object Descriptions

A man.

A man in blue.

A man in blue sweater.

A man who is touching his head.

Let’s first discuss what is an unambiguous object description. We will do a very simple test here. Here is the image. I will give you some descriptions, and you should tell me whether you know which object in the image that the I am talking about. The first description is “a man”. The answer is “no” because there are two men in the image. The same thing happens for the description of “a man in blue” even though we offer more details. Then what about “A man in blue sweater”? This time we know that it should be the man in the right since another man is wearing a blue shirt. The same thing happens for the description of “a man who is touching his head.”. The last two descriptions are unambiguous because we know clearly that it should be the man on the right that we are talking about.

as a listener who sees the image and the description will know which object you are talking about.

You guys just nod if you know what object I am talking about.

The man who is touching his head

Unambiguous Object Descriptions

(Referring Expressions [Kazemzadeh et.al 2014]):

Uniquely describes the relevant object or region within its context.

More formally, an unambiguous object description, also known as a referring expression, is a description that uniquely describes the relevant object or region within its context, such that a listener can recover the location of the original object. This task has a number of potential applications that use natural language interfaces, such as controlling a robot, e.g. you can ask your robot to fetch you the white cup with tea.

Two men are sitting next to each other.

It is hard to evaluate image captions.

Two men are sitting next to each other in front of a desk watching something from a laptop.

This task is closely related to the task of image captioning, which provides a free-form description of an image.

This is a very exciting topic and lots of papers has shown very impressive results over the last one or two years.

However, with so many valid ways to describe any given images, it is really hard to decide which one is better.

E.g. for the image on the left, we can generate a description of “...”, which is correct, but luck enough details.

It can actually describe lots of images with quite different contents, as shown in the screen.

Then what about this one: …

It is more detailed but can still describe tons of images with different contents.

In fact, there are countless details in an image. The image captioning task itself does not define to what level should we add these details into the descriptions because there is no listener involves in this task.

“The man who is touching his head.”

Whole frame image

Object bounding box

Referring Expression

Our Model

Whole frame image & Region proposals

Input

Output

Speaker

Listener

Input

Input

Output

Chosen region in red

Input

So image captioning is hard to evaluate. But our referring expressions task turns out to be much easier to evaluate.

It turns out that there are actually two subtasks: first there is a speaker task. The speaker must take an image

and a region, and produce a description for that region — in this case, “the man who is touching his head”.

By itself, this is almost like image captioning.

However we have a second task — the listener task. The listener is given the description as well as the whole frame

image… and he needs to be able to decode the speaker expression and identify which region the speaker was trying

to talk about. To simplify things a bit, we have our listener always select from a finite set of proposals.

In our paper, we propose models that jointly learn to do both of these tasks.

-------

Clearly, an effective model for our unambiguous object description task should address the two tasks jointly: the speaker task and the listener task. For the speaker task, the inputs are the whole frame image and an object bounding box. Our model will generate the referring expression of the object. For the listener task, the input will be the referring expression, the image and some region proposals. Our model will choose the original region that the referring expression describes.

Adapting a LSTM image captioning model ()

The Baseline Model

<bos>

the

in

girl

pink

the

girl

pink

in

<eos>

LSTM

Feature Extractor

Training Objective:

(xtl , ytl)

(xbr , ybr)

CNN

CNN

Feature Extractor (2005 dimension feature)

VGGNet [Simonyan et. al 2015]

CNN

[Mao et.al 2015]

[Vinyals et. al 2015]

[Donahue et.al 2015]

We use an LSTM-CNN model as our baseline model. This baseline is similar to an image captioning model, except that we replace the image feature extractor with a region feature extractor. This new region feature extractor contains a CNN whole frame image feature extractor, a CNN image region feature extractor and a bbox coordinate and size feature extractor. We concatenate these three sub-feature extractor to get the final region representation. An LSTM model is trained to decode these region representations into sentences and we use negative log-likelihood as our training objective. Basically it will try to maximize the probability of the groundtruth sentence given the groundtruth region and image.

[\frac{x_{tl}}{W}, \frac{y_{tl}}{H}, \frac{x_{br}}{W}, \frac{y_{br}}{H}, \frac{S_{bbox}}{S_{image}}]

J(\theta) = -\sum_{n=1}^N \log p(S_n | R_n, I_n,\theta)

The Speaker-Listener Pipeline

Speaker module:

  • Decode with beam search
  • Hard to evaluate by itself?

It is straightforward to use the baseline model to address the speaker task. We can use the trained LSTM model to decode the region features using beam search. Similar to image captioning, the speaker task itself is hard to evaluate. We will address this issue later.

The Speaker-Listener Pipeline

Speaker module:

  • Decode with beam search
  • Hard to evaluate by itself?

Listener module:

“A girl in pink”

p(S|Rn, I)

Multibox Proposals [Erhan et.al 2014]

0.89

0.32

0.05

...

0.89

For the listener task, the input is a sentence description, such as “a girl in pink”, and several region candidates extracted by some region proposal methods, such as multibox. We then use the trained model to calculate the probability of the sentence given these regions and the whole frame image. E.g. we can first calculate the probability of the sentence given the red region, it is ...

We do the calculation for all the regions and just pick the one with the maximum probability.

The Speaker-Listener Pipeline

Speaker module:

  • Decode with beam search
  • Hard to evaluate by itself?

Listener module:

“A girl in pink”

  • Easy to objectively evaluate.
    • Precision@1
  • Evaluate the whole system in an end-to-end way

The listener task is very similar to the object detection task. The only difference is that for object detection, the input is an object class while in the listener task, the input is a sentence description. Therefore we can directly use the evaluation metrics of object detection, such as precision@1, to objectively evaluate our model.

We can also evaluate the whole speaker-listener system in an end-to-end way, by feeding the generated description from the speaker module to the listener module, and then calculate the same precision@1 score. This give an objective evaluation for the speaker module.

We can also utilize this evaluation metric to give an objective evaluation for the whole system in an end-to-end way.

Feature Extractor

Speaker needs to consider the listener

LSTM

The baseline model want to maximize the p(S|R, I)

  • Good to generate generic descriptions
  • Not discriminative

A better, more discriminative model:

  • Consider all possible regions
  • maximize the gap between p(S|R, I) and p(S|R’, I)

R’: regions serve as negatives for R

“A smiling girl”?

Let’s revisit our baseline model. This model only focuses on the groundtruth image region and tries to maximize the probability of the sentence given this region. So it is totally possible that this model will generate a description such as “a smiling girl”, <pause> which is a perfectly fine description for the region. But it is ambiguous for the listener since there is actually another smiling girl in the image.

To be a better communicator, our speaker model needs to consider the listener and think about how the listener might interpret the generated description. In practice, what this means is that the speaker needs to consider all the possible regions and try the maximize the gap between the probability of the sentence given the true region and the probability of the sentence given all other regions. In other word, we want the sentence to be really good for the groundtruth region and really bad for all the negative regions.

Our Full Model

<bos>

the

in

girl

pink

the

girl

pink

in

<eos>

LSTM

Feature Extractor

Loss

...

...

R

R’1

R’m

Training Objective:

More specifically, in our full model, we feed the groundtruth regions, and other regions in the model, and calcualte the probability of the sentence given all these regions. We then use a Softmax loss to maximize the probability for the groundtruth region and suppress the probability for all the negative regions.

This model needs to feeds many regions in the model at the same time.

In practice, we find that we can simplify the training by sampling negative regions and using a max-margin based loss. It achieves similar results as the SoftMax version, but make the training much easier.

J'(\theta) = -\sum_{n=1}^N \log p(R_n | S_n, I_n,\theta)= -\sum_{n=1}^N \log

\frac{p(S_n|R_n,I_n,\theta)}{\sum_{R' \in \mathcal{C}(I_n)}

p(S_n|R',I_n,\theta)}

J''(\theta) = - \sum_{n=1}^N \Big\{\log p(S_n|R_n, I_n, \theta) +

\lambda \max \big( 0, M - \log p(S_n | R_n, I_n, \theta) + \log p(S_n|R_n', I_n, \theta) \big)\Big\}

Dataset

The black and yellow backpack sitting on top of a suitcase.

A yellow and black back pack sitting on top of a blue suitcase.

An apple desktop computer.

The white IMac computer that is also turned on.

A boy brushing his hair while looking at his reflection.

A young male child in pajamas shaking around a hairbrush in the mirror.

Zebra looking towards the camera.

A zebra third from the left.

26,711 images (from MS COCO [Lin et.al 2015]), 54,822 objects and 104,560 referring expressions.

<PAUSE>To train our big deep neural network models, we construct a large-scale dataset based on the MS COCO dataset. To make sure that all the descriptions in this dataset are unambiguous, we adopt a two stage annotations. In the first stage, we ask annotators to play the role of speakers and write unambiguous descriptions as best as they can. In the second stage, we ask another set of annotators to play the role of listener and verify these descriptions. We iterate these two stages until we get a clean dataset. To facilitate the research in this area, we have released the dataset.

Consider listener makes the dataset clean.

Listener Results

Let’s how our model performs.

A dark brown horse with a white stripe wearing a black studded harness.

Let’s how our model performs.

A dark brown horse with a white stripe wearing a black studded harness.

…. This is a complex sentence from our dataset. Our model makes a reasonable choice.

By the way, the blue dotted rectangle denotes the top alternatives generated by our systems.

A white horse carrying a man.

Let’s how our model performs.

A white horse carrying a man.

To fully diagnose our model, we also input some descriptions that does not appear in our dataset. E.g. if we input “”

A dark horse carrying a woman

Let’s how our model performs.

A dark horse carrying a woman

The listener task is very similar to a detection task. Instead of finding regions of an object class, we try to find a region of a natural language description. Therefore, the objective evaluation metrics for object detection, such as precision@1, can be used for the listener module.

Use a picture.

A woman on the dark horse.

Let’s how our model performs.

Now let us do something funny by changing the order of the words.

A woman on the dark horse.

Then what about “”? This sentence looks very similar to the previous one with some changes of the words order. But their meaning are totally different. Our system still makes a good guess.

What is really remarkable is that our systems know that now it is the woman, not the horse, that should be identified.

A cat laying on the left.

A black cat laying on the right.

Our Full Model

A cat laying on a bed.

A black and white cat.

Baseline

Speaker Results

Add speaker model as title. Split the slides

Experiments: 4% improvement for precision@1

Baseline

Our full model

Improvement

Listener Task

40.6

44.6

4.0

End-to-End

(Speaker & Listener)

48.5

51.3

2.8

Human Evaluation

(Speaker Task)

15.9

20.4

4.5

Let’s how our model performs quantitatively. If we evaluate our model on the listener task, the baseline model get a precision of 40.6, which is means that in 40.6% of the cases the top-ranked region chosen by the model are actually the groundtruth region. Our full model can get 44.6%, which give an absolute 4% improvement over the baseline model. If we test the how system in an end-to-end way, where we feed the generated description from the speaker module to the listener module, our full model also gives 2.8% improvement over the baseline model. We also adopt a human evaluation for the speaker task. The human evaluator will choice whether the generated descriptions is better or equally good to the groudntruth description. In 20.4% of cases, the descriptions generated by our full model are considered to be better or equal to the groundtruth description, and out-performs the baseline model by absolute 4.5%.

We also conduct the human evaluation for the speaker task. It shows that 15.9% of the sentence generated by the baseline model is considered to be better than, or equally good to the groundtruth descirption.

Related work

  • Image Captioning
    • This year (“caption in the title”): [Hendricks et.al 2016], [Pan et.al 2016], [Yu et.al 2016], [You et.al 2016]
  • Region Descriptions
    • Concurrent work: [Hu et.al 2016], [Johnson et.al 2016]
  • Visual Questions Answering (VQA)
    • This year (“question” in the title): [Yang et.al 2016], [Noh et. al 2016], [Shih et. al 2016], [Wu et. al 2016], [Tapaswi et. al 2016] [Kafle et.al 2016], [Zhu et. al 2016], [Zhang et.al 2016]
    • Related but different from our task

Which fruit is under the apple?

Give me the fruit under the apple.

VQA

Referring Expression

There are quite a lot exiting related topics in the vision and language field, and there are quite a lot of inspiring papers appear in this conference. Our task and the VQA tasks are actually two ways that we can communicate with the machines. For VQA we ask questions to get information. Our task on the other hand, is given machine an instruction for them to understand. Because of the time limit, I will not go into the details. Please read our paper if you are interested.

Take Home Message

To be a better communicator, need to be a better listener!!

Please see the paper for more details… I just have one very simple point that I hope people take home.

<pause>

That to be a better communicator, you need to be a better listener.

<pause>

This is not only good general life advice, but is also true if you want to train an effective vision and language model. Thank you very much for listening. If you are interested, please scan the QR code to access our released dataset.

Backup Slides

The listener task is very similar to a detection task. Instead of finding regions of an object class, we try to find a region of a natural language description. Therefore, the objective evaluation metrics for object detection, such as precision@1, can be used for the listener module.

Use a picture.

Comparisons to Concurrent Dataset

Bottom left apple.

Bottom left.

The bottom apple.

Green apple on the bottom-left corner, under the lemon and on the left of the orange.

A green apple on the left of a orange.

Goalie.

Right dude.

Orange shirt.

The goalie wearing an orange and black shirt.

A male soccer goalkeeper wearing an orange jersey in front of a player ready to score.

UNC-Ref-COCO (UNC-Ref)

Google Refexp (G-Ref)

  • UNC-Ref (extended from [Kazemzadeh et.al 2014]): More concise, less flowery
  • Our G-Ref: Richer

Dataset Construction

Examples of good descriptions:

"A man wearing a blue sweater whose hand is touching his head."

"A man in a blue sweater."

Examples of bad descriptions:

"A guy." -- this description is too short.

"A man in blue." -- not precise enough, there are 2 men in blue.

Two phases

  • Speaker phrase

Ask Turkers to describe objects in an image.

Dataset Construction

Two phases

  • Speaker phrase
  • Listener phrase: Verify the descriptions from the first phrase.

Can be deleted

Semi-Supervised Training

The girl in pink.

Fully Supervised Images

Only Bounding Boxes

Generate descriptions

Verification

Dbb+txt

Dbb

With Generated Descriptions

The woman in blue.

Dbb+auto

Re-Train

Dfiltered

Model G

Train

Model C

Semi-Supervised Training

Proposals

GroundTruth

Multibox Post-classification

Descriptions

Generated

GroundTruth

Generated

GroundTruth

G-Ref

Dbb+txt

0.791

0.561

0.489

0.417

Dbb+txtDbb

0.793

0.577

0.489

0.424

UNC-Ref

Dbb+txt

0.826

0.655

0.588

0.483

Dbb+txtDbb

0.811

0.660

0.591

0.486

Full model variants

Proposals

GT

Multibox

Descriptions

GEN

GT

GEN

GT

ML (baseline)

0.803

0.654

0.564

0.478

MMI-SoftMax

0.848

0.689

0.591

0.502

MMI-MM-easy-GT-neg

0.851

0.677

0.590

0.492

MMI-MM-hard-GT-neg

0.857

0.699

0.591

0.503

MMI-MM-multibox-neg

0.848

0.695

0.604

0.511

Addressing the two tasks

The generation task:

  • Compute argmaxS p(S|R, I), S: sentence, R: region, I: image
  • Model p(S|R, I) using Recurrent Neural Networks (RNNs)

The comprehension task:

  • Compute argmaxR∈ p(R|S, I), denotes a set of region proposals

  • Assuming a uniform prior for p(R|I):

Select R* = argmaxR∈ p(S|R, I)

Constant

http://www.sciweavers.org/free-online-latex-equation-editor

p(R|S,I) = \frac{p(S|R,I) p(R|I)}{\sum_{R' \in \mathcal{C}} p(S|R',I) p(R'|I)}

A cat laying on the left.

A black cat laying on the right.

A cat laying on a bed.

A black and white cat.

A brown horse in the right.

A white horse.

A brown horse.

A white horse.

A baseball catcher.

A baseball player swing a bat.

The umpire in the black shirt.

The catcher.

The baseball player swing a bat.

An umpire.

A zebra standing behind another zebra.

A zebra in front of another zebra.

A zebra in the middle.

A zebra in front of another zebra.

Upper: Our Full Model

Lower: Baseline

Guy with dark short hair in a white shirt.

A woman with curly hair playing Wii.

The controller in the woman's hand.

*The woman in white.

A dark brown horse with a white stripe wearing a black studded harness.

A white horse carrying a man.

A woman on the dark horse.

A dark horse carrying a woman.

The giraffe behind the zebra that is looking up.

The giraffe with its back to the camera.

The giraffe on the right.

A zebra.

cvpr2016_oral_unambigurous_object_descirption - Google Slides