1 of 36

Text matters?!

Statistical analysis on COCO-Text dataset

2 of 36

overview

  • Motivation
  • Experiment Design
    • Methodology: Ablation Study
    • Data partition
  • Comprehensive Experiment Results
  • Case Study on False Correlation
  • Problem and challenge
  • Q & A

3 of 36

Motivation

  • You would think in-image text shall tell you a lot about the image semantics...
    • Civil instead of wild setting
    • Street name, billboards, stop sign -> salience highlight of the image
  • We want to know how or how much in-image texts matter to image semantics, particularly how are they interpreted by captioning networks
  • Perhaps we can use the more sophisticated OCR techs to improve captioning performance

4 of 36

Experiment Design (Methodology)

  • Apply ablation on selected parts of the image.
  • Compare captions of ablated images with the original’s
  • Ablation methods:
    • gaussian, black-out, median, background black-out, background median.

5 of 36

Experimental Design (Data Partition)

  • Split the COCO Dataset into following groups:
    • Group A: images without text instances annotated.
    • Group B: images with only irrelevant text instances.
    • Group C: images with some relevant text instances.�
  • We called a text instance “relevant” if it appears in captions from the training set.

Captions:

  • a stop sign at an intersection, with "hammer time" printed on it
  • a stop sign with graffiti reading "hammer time."
  • street sign on post with graffiti near green shrubs.
  • a stop sign where someone wrote "hammer time."
  • a stop sign at a corner with the words hammer time below stop.

Captions:

  • a boy is swinging at a ball during a baseball game.
  • a boy that is hitting a ball with a baseball bat.
  • a person with a bat and helmet hits a ball.
  • a boy in a yellow and white uniform hitting a baseball
  • a young baseball player is hitting the ball.

6 of 36

Experiment Design (Group B)

  • Contains all images with only irrelevant text instances.
  • Ablation Methods:
    • Gaussian
    • Black-out
    • Median
  • Metrics:
    • Jaccard Index
    • Tree LSTM

7 of 36

Experiment Design (Group C)

  • Contains images with relevant text instance(s)
  • Ablation Methods:
    • Gaussian
    • Black-out
    • Median
  • Metrics:
    • Jaccard Index
    • Tree LSTM

8 of 36

Experiment Design (Group B, C)

  • Hypothesis�
    • Group C (img with rel. text) should experience more caption change than Group B (img without rel. text)�
    • Black-out should cause most disrupted changes, while median filter should cause the least.�

9 of 36

Experiment Result (Group B)

  • Disruptive Degree: Blackout > Gaussian > Median
  • Most doesn’t change

Ablation Method

Metrics

Mean

Std. Dev.

Median

Max

Min

Gaussian

Tree LSTM

4.473455

0.730141

4.779640

4.879636

1.131421

Jaccard

0.902979

0.227583

1.000000

1.000000

0.000000

Blackout

Tree LSTM

4.236897

0.896967

4.777700

4.881594

1.164003

Jaccard

0.809745

0.299980

1.000000

1.000000

0.000000

Median

Tree LSTM

4.517174

0.686364

4.780784

4.878270

1.126073

Jaccard

0.913922

0.216759

1.000000

1.000000

0.000000

10 of 36

Experiment Result (Group C)

  • Disruptive Degree: Blackout > Gaussian > Median
  • Most doesn’t change

Ablation Method

Metrics

Mean

Std. Dev.

Median

Max

Min

Gaussian

Tree LSTM

4.323064

0.810277

4.782078

4.873451

1.174542

Jaccard

0.829445

0.288410

1.000000

1.000000

0.000000

Blackout

Tree LSTM

3.911840

0.978031

4.286312

4.867795

1.148584

Jaccard

0.673284

0.351129

0.800000

1.000000

0.000000

Median

Tree LSTM

4.313300

0.760557

4.696978

4.799298

1.190687

Jaccard

0.852893

0.271832

1.000000

1.000000

0.000000

11 of 36

Experiment Result (Group B + C)

  • Group C experiences more changes than Group B
  • Difference isn’t very significant.

Group

Metrics

Mean

Std. Dev.

Median

Max

Min

Group B

Tree LSTM

4.517174

0.686364

4.780784

4.878270

1.126073

Jaccard

0.913922

0.216759

1.000000

1.000000

0.000000

Group C

Tree LSTM

4.313300

0.760557

4.696978

4.799298

1.190687

Jaccard

0.852893

0.271832

1.000000

1.000000

0.000000

Table: Data of median ablation on both Group B and Group C

Group

Metrics

Mean

Std. Dev.

Median

Max

Min

Group B

Tree LSTM

4.236897

0.896967

4.777700

4.881594

1.164003

Jaccard

0.809745

0.299980

1.000000

1.000000

0.000000

Group C

Tree LSTM

3.911840

0.978031

4.286312

4.867795

1.148584

Jaccard

0.673284

0.351129

0.800000

1.000000

0.000000

Table: Data of blackout ablation on both Group B and Group C

12 of 36

Experiment Result (Group B + C)

  • Jaccard indexes distributions are discrete.
  • Tree LSTM similarity scores are more continuous.
    • Eg. Data for Group B with gaussian ablation. �Normalized Jaccard indexes (left) v.s. Tree LSTM scores of (right).

13 of 36

Experiment Result (Group B + C)

  • Distribution of Tree LSTM similarity scores is highly similar to that of jaccard score.
    • Eg. Data for Group C with median ablation�Norm. Jaccard(left); Norm. Tree LSTM scores (middle); Diff. (right)

14 of 36

Experimental Design (Group A)

  • Contains all images without text instances
  • Ablation methods:
    • Background blackout
    • Background median
  • Hypothesis:
    • Background should contribute to Object-scene context; as a result, removing the scene completely will create high caption-semantic disruption.
  • Semantic Distance Metrics:
    • Jaccard Index
    • Tree-LSTM Similarity Score
    • Bleu Score Difference

15 of 36

Experiment Result (Group A)

Ablation

Metrics

Mean

Std. Dev.

Median

Max

Min

Background

Blackout

Tree LSTM

2.960662

1.056050

2.751745

4.878211

1.118678

Jaccard

0.380484

0.344731

0.250000

1.000000

0.000000

BLEU Diff.

(orig-ablt)

0.113305

0.181888

0.072560

0.993355

-0.591098

Background

Median

Tree LSTM

Coming Soon!

Jaccard

BLEU Diff.

16 of 36

Interesting examples

17 of 36

18 of 36

19 of 36

20 of 36

21 of 36

22 of 36

False Correlation Experiment (From this week)

  • In previous anecdotal experiments, we noticed that with the background blackout ablation, the captioning network will produce stable word pair even with minimal visual cues, eg. “toilet and sink”, “red rose”.�
  • Maybe the RNN does not see the images at all (or extract very limited information from image contexts)?�
  • Hint at a way to improve captioning accuracy.

23 of 36

Hypothesis

  • Word context dominates image context during caption generation.

24 of 36

procedures

  1. Calculate word-word conditional probabilities from training captions.
  2. Categorical studies on how ablation affects appearance of the highly probable words.
  3. We choose “toilet” category.
  4. Highest probable words are “BATHROOM”, “SINK”, “WHITE”, “SITTING”, and “NEXT”
  5. Extreme ablation: Only leaves a toilet.
  6. Interesting both experimentally and aesthetically.

25 of 36

Results

  • The co-appearing words generally have high priors
    • Over 10%, with mean probability over all word-pairs being 0.0003%.
  • Captions highly follow prior probability distribution.
  • Some examples:
    • Toilet->[BATHROOM, SINK, WHITE, SITTING, NEXT]
    • Rose -> [VASE, PINK, RED, TABLE, WHITE]
    • Clock ->[TOWER, BUILDING, LARGE, TOP, TALL]

  • Caveat: The ones we spotted are not the highest in ranking.

26 of 36

Results -- Ablation

  • Total # of ‘toilet’ images: 2318������
  • Decrease in col.3 might be due to random guessing of the network: Since some images after ablation is hardly recognizable.

Median filter width

# of original captions having the 5 words

# of ablated captions having the 5 words

# of captions having the word ‘SINK’

7

2234

2229

1588 + 1576

71

2234

1939

1588 + 915

27 of 36

28 of 36

29 of 36

30 of 36

Results -- Toilet only

  • Median width = 71, total number of image = 1135�����
  • The size of the vocabulary of the original + ablated captions is merely 374, compared to the entire training vocab of 22644.

# of original captions having the 5 words

# of ablated captions having the 5 words

# of captions having the word ‘SINK’

1073

966

514 + 333

31 of 36

Results -- Toilet only

������

  • Saturated conditional probabilities due to shrunk vocabulary size

Word

Original Cond. Prob.

Ablated Cond. Prob.

Training Cond. Prob.

Bathroom

0.864

0.884

0.541

Sink

0.483

0.342

0.284

White

0.236

0.589

0.202

Sitting

0.357

0.470

0.132

Next

0.194

0.446

0.117

32 of 36

33 of 36

34 of 36

35 of 36

Further Research ideas

  • Adversarial performance metric for captioning network�
  • Separately train RNN on richer corpus to gain a stronger language model�
  • Try to use simple mechanisms to reproduce neuraltalk benchmark
    • Start with a word, then discard image context altogether and rely solely on word context.

36 of 36

Q & A