1 of 15

Scaling Laws for Image Captioning

By Balaji Balasubramanian and Eshwanth Baskaran

2 of 15

What are scaling laws?

The Language Model’s performance increases smoothly as we increase the amount of compute used for training, dataset size and the model size.
In each of the three graphs below, one of the three metrics is fixed and the others can be freely varied to obtain the best performance.

Kaplan et al. - Scaling Laws for Neural Language Models

3 of 15

What is transfer learning?

Transfer the knowledge gained in one task to use it in another task.
In deep learning, use a pre-trained network and apply it to a custom task.
Helpful: low-data regime

https://www.pinterest.com/pin/424745808604824736/

4 of 15

Image Captioning

Multi-modal task
Predict the text caption for a given image.

Mokady, R., Hertz, A., & Bermano, A. H. (2021). ClipCap: CLIP Prefix for Image Captioning. arXiv. https://doi.org/10.48550/ARXIV.2111.09734

5 of 15

Image Captioning

The task of Image Captioning can be represented as:-

Mokady, R., Hertz, A., & Bermano, A. H. (2021). ClipCap: CLIP Prefix for Image Captioning. arXiv. https://doi.org/10.48550/ARXIV.2111.09734

Image and Caption

Training Objective

Training Objective for Autoregressive Language Model

6 of 15

Image Captioning

Traditional Approach for Image Captioning- CNN + RNN

http://cs231n.stanford.edu/2021/slides/2021/lecture_10.pdf

7 of 15

ClipCap for Image Captioning

Image captioning using large pretrained image and language models.
Image encoder: CLIP

Rich semantic features which were trained with textual context

Text generator: GPT-2

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning Transferable Visual Models From Natural Language Supervision. arXiv. https://doi.org/10.48550/ARXIV.2103.00020

https://jalammar.github.io/illustrated-gpt2/

8 of 15

ClipCap

The output of the CLIP is converted to a fixed length prefix using the Mapping Network.

9 of 15

ClipCap

Mapping network:

Mapping from image feature space to language model space
Composed of transformer layers

Training:

Several ways

10 of 15

Our study

Study scaling-law behaviour for Clipcap w.r.t the dataset size and number of model parameters.
Dataset:

COCO Captions: 100k train samples, 20k dev samples.

Compute credits: LAION AI

11 of 15

Our study - Scaling law with dataset size

12 of 15

Our study - Scaling law with #model parameters

13 of 15

Our study - Scaling law with #model parameters

Hyperparameters tuned are:-

Number of transformer layers
Prefix Length of Mapping Network
Dropout (not very effective)

The best model had 4 transformer layers and prefix length of 10.

14 of 15

Model performance

Model	BLEU	METEOR
Ours (mapping network)	23.44	19.2

Pred: A woman is sitting with a fire hydrant.

Ref: A lady sitting beside a fire hydrant with hand on head.

Pred: A woman is standing on a boat with fruit.

Ref: An Asian woman with some vegetables in her boat

15 of 15

Summary

Study the scaling law behaviour in the Image Captioning task.
Measure the performance with dataset size and #model parameters.
Next steps:

Analyse scaling law behavior with compute.
Perform analysis on a larger scale for accurate metrics.