1 of 36

Text matters?!

Statistical analysis on COCO-Text dataset

2 of 36

overview

Motivation
Experiment Design

Methodology: Ablation Study
Data partition

Comprehensive Experiment Results
Case Study on False Correlation
Problem and challenge
Q & A

3 of 36

Motivation

You would think in-image text shall tell you a lot about the image semantics...

Civil instead of wild setting
Street name, billboards, stop sign -> salience highlight of the image

We want to know how or how much in-image texts matter to image semantics, particularly how are they interpreted by captioning networks
Perhaps we can use the more sophisticated OCR techs to improve captioning performance

4 of 36

Experiment Design (Methodology)

Apply ablation on selected parts of the image.
Compare captions of ablated images with the original’s
Ablation methods:

gaussian, black-out, median, background black-out, background median.

5 of 36

Experimental Design (Data Partition)

Split the COCO Dataset into following groups:

Group A: images without text instances annotated.
Group B: images with only irrelevant text instances.
Group C: images with some relevant text instances.�

We called a text instance “relevant” if it appears in captions from the training set.

Captions:

a stop sign at an intersection, with "hammer time" printed on it
a stop sign with graffiti reading "hammer time."
street sign on post with graffiti near green shrubs.
a stop sign where someone wrote "hammer time."
a stop sign at a corner with the words hammer time below stop.

Captions:

a boy is swinging at a ball during a baseball game.
a boy that is hitting a ball with a baseball bat.
a person with a bat and helmet hits a ball.
a boy in a yellow and white uniform hitting a baseball
a young baseball player is hitting the ball.

6 of 36

Experiment Design (Group B)

Contains all images with only irrelevant text instances.
Ablation Methods:

Gaussian
Black-out
Median

Metrics:

Jaccard Index
Tree LSTM

7 of 36

Experiment Design (Group C)

Contains images with relevant text instance(s)
Ablation Methods:

Gaussian
Black-out
Median

Metrics:

Jaccard Index
Tree LSTM

8 of 36

Experiment Design (Group B, C)

Hypothesis�

Group C (img with rel. text) should experience more caption change than Group B (img without rel. text)�
Black-out should cause most disrupted changes, while median filter should cause the least.�

9 of 36

Experiment Result (Group B)

Disruptive Degree: Blackout > Gaussian > Median
Most doesn’t change

Ablation Method	Metrics	Mean	Std. Dev.	Median	Max	Min
Gaussian	Tree LSTM	4.473455	0.730141	4.779640	4.879636	1.131421
Gaussian	Jaccard	0.902979	0.227583	1.000000	1.000000	0.000000
Blackout	Tree LSTM	4.236897	0.896967	4.777700	4.881594	1.164003
Blackout	Jaccard	0.809745	0.299980	1.000000	1.000000	0.000000
Median	Tree LSTM	4.517174	0.686364	4.780784	4.878270	1.126073
Median	Jaccard	0.913922	0.216759	1.000000	1.000000	0.000000

10 of 36

Experiment Result (Group C)

Disruptive Degree: Blackout > Gaussian > Median
Most doesn’t change

Ablation Method	Metrics	Mean	Std. Dev.	Median	Max	Min
Gaussian	Tree LSTM	4.323064	0.810277	4.782078	4.873451	1.174542
Gaussian	Jaccard	0.829445	0.288410	1.000000	1.000000	0.000000
Blackout	Tree LSTM	3.911840	0.978031	4.286312	4.867795	1.148584
Blackout	Jaccard	0.673284	0.351129	0.800000	1.000000	0.000000
Median	Tree LSTM	4.313300	0.760557	4.696978	4.799298	1.190687
Median	Jaccard	0.852893	0.271832	1.000000	1.000000	0.000000

11 of 36

Experiment Result (Group B + C)

Group C experiences more changes than Group B
Difference isn’t very significant.

Group	Metrics	Mean	Std. Dev.	Median	Max	Min
Group B	Tree LSTM	4.517174	0.686364	4.780784	4.878270	1.126073
Group B	Jaccard	0.913922	0.216759	1.000000	1.000000	0.000000
Group C	Tree LSTM	4.313300	0.760557	4.696978	4.799298	1.190687
Group C	Jaccard	0.852893	0.271832	1.000000	1.000000	0.000000

Table: Data of median ablation on both Group B and Group C

Group	Metrics	Mean	Std. Dev.	Median	Max	Min
Group B	Tree LSTM	4.236897	0.896967	4.777700	4.881594	1.164003
Group B	Jaccard	0.809745	0.299980	1.000000	1.000000	0.000000
Group C	Tree LSTM	3.911840	0.978031	4.286312	4.867795	1.148584
Group C	Jaccard	0.673284	0.351129	0.800000	1.000000	0.000000

Table: Data of blackout ablation on both Group B and Group C

12 of 36

Experiment Result (Group B + C)

Jaccard indexes distributions are discrete.
Tree LSTM similarity scores are more continuous.

Eg. Data for Group B with gaussian ablation. �Normalized Jaccard indexes (left) v.s. Tree LSTM scores of (right).

13 of 36

Experiment Result (Group B + C)

Distribution of Tree LSTM similarity scores is highly similar to that of jaccard score.

Eg. Data for Group C with median ablation�Norm. Jaccard(left); Norm. Tree LSTM scores (middle); Diff. (right)

14 of 36

Experimental Design (Group A)

Contains all images without text instances
Ablation methods:

Background blackout
Background median

Hypothesis:

Background should contribute to Object-scene context; as a result, removing the scene completely will create high caption-semantic disruption.

Semantic Distance Metrics:

Jaccard Index
Tree-LSTM Similarity Score
Bleu Score Difference

15 of 36

Experiment Result (Group A)

Ablation	Metrics	Mean	Std. Dev.	Median	Max	Min
Background Blackout	Tree LSTM	2.960662	1.056050	2.751745	4.878211	1.118678
	Jaccard	0.380484	0.344731	0.250000	1.000000	0.000000
	BLEU Diff. (orig-ablt)	0.113305	0.181888	0.072560	0.993355	-0.591098
Background Median	Tree LSTM	Coming Soon!
	Jaccard
	BLEU Diff.

16 of 36

Interesting examples

17 of 36

18 of 36

19 of 36

20 of 36

21 of 36

22 of 36

False Correlation Experiment (From this week)

In previous anecdotal experiments, we noticed that with the background blackout ablation, the captioning network will produce stable word pair even with minimal visual cues, eg. “toilet and sink”, “red rose”.�
Maybe the RNN does not see the images at all (or extract very limited information from image contexts)?�
Hint at a way to improve captioning accuracy.

23 of 36

Hypothesis

�

Word context dominates image context during caption generation.

24 of 36

procedures

Calculate word-word conditional probabilities from training captions.
Categorical studies on how ablation affects appearance of the highly probable words.
We choose “toilet” category.
Highest probable words are “BATHROOM”, “SINK”, “WHITE”, “SITTING”, and “NEXT”
Extreme ablation: Only leaves a toilet.
Interesting both experimentally and aesthetically.

25 of 36

Results

The co-appearing words generally have high priors

Over 10%, with mean probability over all word-pairs being 0.0003%.

Captions highly follow prior probability distribution.
Some examples:

Toilet->[BATHROOM, SINK, WHITE, SITTING, NEXT]
Rose -> [VASE, PINK, RED, TABLE, WHITE]
Clock ->[TOWER, BUILDING, LARGE, TOP, TALL]

Caveat: The ones we spotted are not the highest in ranking.

26 of 36

Results -- Ablation

Total # of ‘toilet’ images: 2318��
Decrease in col.3 might be due to random guessing of the network: Since some images after ablation is hardly recognizable.

Median filter width	# of original captions having the 5 words	# of ablated captions having the 5 words	# of captions having the word ‘SINK’
7	2234	2229	1588 + 1576
71	2234	1939	1588 + 915

27 of 36

28 of 36

29 of 36

30 of 36

Results -- Toilet only

Median width = 71, total number of image = 1135��
The size of the vocabulary of the original + ablated captions is merely 374, compared to the entire training vocab of 22644.

# of original captions having the 5 words	# of ablated captions having the 5 words	# of captions having the word ‘SINK’
1073	966	514 + 333

31 of 36

Results -- Toilet only

��

Saturated conditional probabilities due to shrunk vocabulary size

Word	Original Cond. Prob.	Ablated Cond. Prob.	Training Cond. Prob.
Bathroom	0.864	0.884	0.541
Sink	0.483	0.342	0.284
White	0.236	0.589	0.202
Sitting	0.357	0.470	0.132
Next	0.194	0.446	0.117

32 of 36

33 of 36

34 of 36

35 of 36

Further Research ideas

Adversarial performance metric for captioning network�
Separately train RNN on richer corpus to gain a stronger language model�
Try to use simple mechanisms to reproduce neuraltalk benchmark

Start with a word, then discard image context altogether and rely solely on word context.

36 of 36

Q & A