JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

1 of 14

Captioning Image to Assist People Who are Blind

�Wuao Liu, Sihang Wei, Hao-Hsiang Hsu�EECS 442 Final Project�

2 of 14

Motivation

Generating a natural language to describe the input image automatically

Understand the scene of the image
Connect computer vision and natural language processing

Our training dataset VizWiz and COCO

VizWiz aims to assist people who are blind to overcome their real daily visual challenges
VizWiz Consist of 39181 images each paired with 5 captions
COCO is a large-scale object detection, segmentation and captioning dataset

3 of 14

Current Methods of Image Captioning

Encoder-Decoder Architecture
RNN/LSTM
RNN with Attention
Transformers
…...

4 of 14

Baseline Method (CNN + RNN/LSTM)

5 of 14

Attention Methodology

Visual feature extraction
Soft attention
Caption generator

Cell t

Cell t+1

Feature Map

Attention Mechanism

Recurrent Networks

6 of 14

Visual feature extraction

…

7 of 14

Soft attention

8 of 14

Caption generator

[Slide derived from Chris Olah: https://colah.github.io/posts/2015-08-Understanding-LSTMs/]

9 of 14

Results

We reproduce the

Algorithm mentioned

In the paper, and

Prove that attention

Mechanism can

Improve the model

performance

	Dataset	Model	BLEU1	BLEU2	BLEU3	BLEU4
Paper result	COCO	Soft-Att	70.7	49.2	34.4	24.3
Ours	COCO	Soft-Att	60.4	19.6	9.6	7.0
	VizWiz	CNN+ RNN	57.9	17.8	3.1	1.4
	VizWiz	Soft-Att	59.2	20.3	6.5	5.1

10 of 14

Contributions

Data Pre-processing: Dataset, Vocabulary generation for MSCOCO and VizWiz
A CNN+RNN baseline method to generate captions
Reproduce paper and implemented soft-attention method
Model training and parameter tuning
Validate and analyze the effectiveness of the attention model
Looking deeper into visual transformer

Green words: Open source code online

Black words: Our implementation

11 of 14

Future Work

Use transformer to replace

attention model

Transformer can achieve better

BLEU score with less training epoch�

a hand holding a bottle of something that looks like a blurry white surface.

A person is cutting a cake with a knife.

12 of 14

More Demos

a hand holding a cd card with the words “ <unk> ” near it .

the left corner of a keyboard showing the letters `` <unk> '' , `` keys '' , the numbers `` , '' `` <unk> '' above the letters ``

quality issues are too severe to recognize visual content , it 's too blurry to read , what looks like a white and black object that has a metal trim

someone is holding a white plastic container with a blue lid , and the lid is sitting on a tan surface.

13 of 14

More Demos

A group of people standing around a truck.

A man sitting at a table with a laptop.

A couple of people walking along a beach next to the ocean.

14 of 14

Thanks for listening !