1 of 1

Neural Image Caption Generator

Lila Zimbalist + Udayveer Sodhi

Introduction

We implemented an existing paper that solves the problem of structured prediction and classification of images. Its objective is to use a neural network which processes images to generate a caption for the image. Who hasn’t taken a picture and wished that it was automatically given an accurate caption! This automation is also able to save a lot of time for anyone who needs to caption their pictures on websites or presentations – they don’t have to delegate that task to someone else or spend time out of their day captioning arbitrarily many images.

Methodology

Our model is composed of two main parts:

Encoder CNN
Decoder RNN with LSTM

The encoder is made up of convolutional, max pooling, flattening, and dense layers. Our decoder is an RNN that uses an LSTM and a dense layer..

We used the flickr8k dataset to train our model. This is a dataset of 8000 images that are divided into 6000 training images, 1000 testing images, and 1000”dev” images. Each training image also has 5 captions associated with it to provide variety and more accuracy in the generated captions

Results

Unfortunately, we don’t have very many results to report. We were able to train the model successfully, with our loss reducing from ~2.5 to ~1.5 – we would expect loss to be higher to begin with but are encouraged by the fact that it reduces over training.

As can be seen in generated captions below, our model was able to generate varied captions with different pieces of vocabulary but such captions were not tied to a specific image and tended not to make much sense

Our model architecture; CNN encoder with RNN decoder.

Image from our paper “Show and Tell: A Neural Image Caption Generator”

Discussion

Lessons learned

Through this implementation, we learned of lots of useful keras functions to help us preprocess the data more efficiently

Lingering problems/limitations

We believe the greatest limitation to our model’s success is an issue with calculating loss, which in turn causes issues with optimizing the model. That’s the main issue we tried to solve when debugging our model, and is definitely one thing we’d like to continue to work on in the future

Future Work

In addition to figuring out the problems with our loss, we would like to utilize the data given to us in our dataset further. The Flickr8k dataset that we used gives each image 5 different, but related, captions. Our model only uses one of those captions to train/test on, but in the future we would like to try to find a way to use all 5 captions to make even better predictions

1. A person hike down a snowy mountain

2. A group of person walk through a shop mall

Examples of images/captions from our dataset

Examples of Generated Captions

“a man in a a a a a a a a a his a watch field of in a field the a purple uniform green shirt smile and watch base shelter golden”

“a black dog be dog be a a a in a a snow in the background mouth field them a a house the a grassy field rider pass behind compete float skateboard”

Note: while it was our greatest hope and intention to provide more visuals and data surrounding our results, because of the issues surrounding our model we were unable to collect any such data

This project is based on the paper “Show and Tell: A Neural Image Caption Generator” by Vinyals, Toshev, et. al