1 of 14

Captioning Image to Assist People Who are Blind

�Wuao Liu, Sihang Wei, Hao-Hsiang Hsu�EECS 442 Final Project�

1

2 of 14

Motivation

  • Generating a natural language to describe the input image automatically
    • Understand the scene of the image
    • Connect computer vision and natural language processing
  • Our training dataset VizWiz and COCO
    • VizWiz aims to assist people who are blind to overcome their real daily visual challenges
    • VizWiz Consist of 39181 images each paired with 5 captions
    • COCO is a large-scale object detection, segmentation and captioning dataset

3 of 14

Current Methods of Image Captioning

  • Encoder-Decoder Architecture
  • RNN/LSTM
  • RNN with Attention
  • Transformers
  • …...

4 of 14

Baseline Method (CNN + RNN/LSTM)

5 of 14

Attention Methodology

  • Visual feature extraction
  • Soft attention
  • Caption generator

Cell t

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Cell t+1

 

 

 

 

 

Feature Map

 

 

Attention Mechanism

 

Recurrent Networks

6 of 14

  • Visual feature extraction

6

 

 

 

 

 

 

 

 

 

7 of 14

  • Soft attention

 

 

 

8 of 14

  • Caption generator

[Slide derived from Chris Olah: https://colah.github.io/posts/2015-08-Understanding-LSTMs/]

9 of 14

Results

We reproduce the

Algorithm mentioned

In the paper, and

Prove that attention

Mechanism can

Improve the model

performance

Dataset

Model

BLEU1

BLEU2

BLEU3

BLEU4

Paper result

COCO

Soft-Att

70.7

49.2

34.4

24.3

Ours

COCO

Soft-Att

60.4

19.6

9.6

7.0

VizWiz

CNN+

RNN

57.9

17.8

3.1

1.4

VizWiz

Soft-Att

59.2

20.3

6.5

5.1

10 of 14

Contributions

  • Data Pre-processing: Dataset, Vocabulary generation for MSCOCO and VizWiz
  • A CNN+RNN baseline method to generate captions
  • Reproduce paper and implemented soft-attention method
  • Model training and parameter tuning
  • Validate and analyze the effectiveness of the attention model
  • Looking deeper into visual transformer

Green words: Open source code online

Black words: Our implementation

11 of 14

Future Work

  • Use transformer to replace

attention model

  • Transformer can achieve better

BLEU score with less training epoch�

a hand holding a bottle of something that looks like a blurry white surface.

A person is cutting a cake with a knife.

12 of 14

More Demos

a hand holding a cd card with the words “ <unk> ” near it .

the left corner of a keyboard showing the letters `` <unk> '' , `` keys '' , the numbers `` , '' `` <unk> '' above the letters ``

quality issues are too severe to recognize visual content , it 's too blurry to read , what looks like a white and black object that has a metal trim

someone is holding a white plastic container with a blue lid , and the lid is sitting on a tan surface.

13 of 14

More Demos

A group of people standing around a truck.

A man sitting at a table with a laptop.

A couple of people walking along a beach next to the ocean.

14 of 14

Thanks for listening !