3 of 18

Problem definition

Image Captioning is one of the most fundamental Vision-Language tasks, which involves the model generating a Natural Language description for a given image.
SoTA approaches of the field, such as OSCAR and LEMON, while very powerful, are built on a huge number of parameters and thus are expensive to run.
We propose a novel method that uses a fraction of these parameters and gives considerable performance in the same ballpark as the aforementioned models by utilizing OpenAI’s CLIP visual encodings and OpenAI’s GPT2 textual context encodings together.

4 of 18

Related works

Deep Captioning With Multimodal Recurrent

Neural Networks (M-RNN)

It was one of the earliest methods, which made use of Multimodal Recurrent Neural Networks for captioning.
Consists of two sub-networks - a deep recurrent neural network for sentence generation and a deep convolutional network for images.
The RNN learns a dense feature embedding for each word in the dictionary and thus stores the semantic, temporal context generated in its recurrent layers. The CNN - generates the image representation. The multimodal part connects the language and vision models by a single layer representation.
The activation of the three layers are inputted to the same multimodal feature space and added together to obtain the activation of the multimodal layer.

5 of 18

Related works

Show and Tell (2014) and Show Attend and Tell (2015)

In the Show and Tell paper, a Deep CNN is utilized to generate representations of the image in the form of a fixed length embedding which are then inputted into the RNN decoder meant to generate sentences. The LSTM based RNN is fed in with words of the sentence, after which it predicts a caption.
In the Show, Attend and Tell paper, the idea of Attention to make sure that the image’s most salient features are considered for captioning was introduced. Features extracted from lower convolutional layers are utilized to create annotation vectors for images, which are then fed into the decoder.

6 of 18

Related works

OSCAR: Object-Semantics Aligned Pre-training for Vision-Language Tasks

Foundation: salient objects in an image are often mentioned in paired text
Input: < word tokens, object tags, region features >

7 of 18

Related works

CLIP - Learning Transferable Visual Models From Natural Language Supervision

The CLIP model learns relationships between the whole sentence and the image that it describes, on which it was trained with contrastive loss. Given the extensive training of the model on data with context, it thus acts like a great classifier as well, with a large amount of pre-learned features.

8 of 18

Related works

GPT2 - Language Models are Unsupervised Multitask Learners

GPT2 is a large trained transformer based language model that has been extensively trained on nearly 40GB worth of data on 1.5Bn parameters.
It was trained to predict the next word, given all of the previous words within some text.

9 of 18

Methodology

Our proposed Model -

10 of 18

Methodology

First, to extract features from the image, we run pre-trained CLIP model on a given training image. This generates an embedding for the train image.
The embedding is then parsed through a mapping network, where the CLIP generated embeddings are mapped to embedding vectors to be inputted into the language model.
The aforementioned mapped visual embeddings are concatenated to the caption embeddings. These concatenated embeddings are then fed into the language model.
The mapping network, which forms the only trainable sector of our architecture is trained by utilizing cross entropy loss.

11 of 18

Methodology

The architecture we use for our mapping network is the Transformer architecture, which allows for better global attention between the input tokens.
This transformer network is input with the CLIP generated visual encodings.

12 of 18

Methodology

In our experiments, to see if our relatively smaller model performs well on the reduced dataset that we considered, we took the model from Show, Attend and Tell as the control for our experiment. This was done because of the fact that it formed a classical baseline and had around the same number of parameters as our model (a bit more in fact). �
Both the models were ran on the same dataset for the same number of epochs, albeit with different hyperparameter choices that performed best on the two models. The control model’s empirical choices of hyperparameters were taken for the experiments.

13 of 18

Dataset detail and Results

The dataset that we utilized was a subset of the MSCOCO 2014 dataset. We made use of the Karpathy splits for the captioning of the data made available by CSAIL.�
The data was converted into 2 formats - one to be taken as input into our control model Show, Attend and Tell which utilizes tokens, the other to be taken as input to our model’s CLIP section which takes in raw sentences. �
The final subset we considered for our experiments consisted of 7500 images for our train set and 2500 images for our val/test set.

14 of 18

Dataset EDA

Average Sentence Length: 10.45695170060774

Train Images: 7500

Train Captions: 37516

Validation Images: 2500

Validation Captions: 12508

15 of 18

Analysis-Results

The results we obtained were as follows, with due consideration given to the number of parameters of both models and the training times observed for the two.

��Oscar (135M parameters)

Show-Attend-Tell (57M parameters, 443 minutes)

CLIP model (53 M parameters, 290 minutes)

Score/ Model	Show Attend and Tell	CLIP Approach
BLEU-1	0.487269	0.490556
BLEU-2	0.351025	0.350983
BLEU-3	0.239221	0.238337
BLEU-4	0.133368	0.133195

17 of 18

Member Contributions

Equal Contribution from each member of the team

2 papers are implemented (Show-Attend-Tell-> Model 1, Clip Approach-> Model 2)

Ansh Arora : Literature Review, Arch Model 2, Preprocessing of data and model 1
Daksh Thapar : Literature Review, Arch Model 2, Preprocessing of data and model 2
Hardik Garg : Literature Review, Arch Model 2, Preprocessing of data and model 2

We were all in the campus, so all the work was done together.

18 of 18

References

Hu, X., Yin, X. and Lin, K. (2021). VIVO: Visual Vocabulary Pre-Training for Novel Object Captioning. AAAI 2021 Computer Vision and Pattern Recognition.
Xu, K., Ba, J. and Kiros, R. (2016). Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. Machine Learning Computer Vision and Pattern Recognition.
Luo, Y. and Ji, J. (2021). Dual-Level Collaborative Transformer for Image Captioning.
Anderson, P. (2018). Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. CVPR Computer Vision and Pattern Recognition.
Li, X. and Yin, X. (2020). Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks. ECCV 2020.
Radford, A. (2021). Learning Transferable Visual Models From Natural Language Supervision. CVPR 2021.

1 of 18

2 of 18

3 of 18

4 of 18

5 of 18

6 of 18

7 of 18

8 of 18

9 of 18

10 of 18

11 of 18

12 of 18

13 of 18

14 of 18

15 of 18

16 of 18

17 of 18

18 of 18