1 of 15

Scaling Laws for Image Captioning

By Balaji Balasubramanian and Eshwanth Baskaran

2 of 15

What are scaling laws?

  • The Language Model’s performance increases smoothly as we increase the amount of compute used for training, dataset size and the model size.
  • In each of the three graphs below, one of the three metrics is fixed and the others can be freely varied to obtain the best performance.

Kaplan et al. - Scaling Laws for Neural Language Models

3 of 15

What is transfer learning?

  • Transfer the knowledge gained in one task to use it in another task.
  • In deep learning, use a pre-trained network and apply it to a custom task.
  • Helpful: low-data regime

https://www.pinterest.com/pin/424745808604824736/

4 of 15

Image Captioning

  • Multi-modal task
  • Predict the text caption for a given image.

Mokady, R., Hertz, A., & Bermano, A. H. (2021). ClipCap: CLIP Prefix for Image Captioning. arXiv. https://doi.org/10.48550/ARXIV.2111.09734

5 of 15

Image Captioning

  • The task of Image Captioning can be represented as:-

Mokady, R., Hertz, A., & Bermano, A. H. (2021). ClipCap: CLIP Prefix for Image Captioning. arXiv. https://doi.org/10.48550/ARXIV.2111.09734

Image and Caption

Training Objective

Training Objective for Autoregressive Language Model

6 of 15

Image Captioning

  • Traditional Approach for Image Captioning- CNN + RNN

http://cs231n.stanford.edu/2021/slides/2021/lecture_10.pdf

7 of 15

ClipCap for Image Captioning

  • Image captioning using large pretrained image and language models.
  • Image encoder: CLIP
    • Rich semantic features which were trained with textual context
  • Text generator: GPT-2

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning Transferable Visual Models From Natural Language Supervision. arXiv. https://doi.org/10.48550/ARXIV.2103.00020

https://jalammar.github.io/illustrated-gpt2/

8 of 15

ClipCap

  • The output of the CLIP is converted to a fixed length prefix using the Mapping Network.

9 of 15

ClipCap

Mapping network:

  • Mapping from image feature space to language model space
  • Composed of transformer layers

Training:

  • Several ways

10 of 15

Our study

  • Study scaling-law behaviour for Clipcap w.r.t the dataset size and number of model parameters.
  • Dataset:
    • COCO Captions: 100k train samples, 20k dev samples.
  • Compute credits: LAION AI

11 of 15

Our study - Scaling law with dataset size

12 of 15

Our study - Scaling law with #model parameters

13 of 15

Our study - Scaling law with #model parameters

Hyperparameters tuned are:-

  1. Number of transformer layers
  2. Prefix Length of Mapping Network
  3. Dropout (not very effective)

The best model had 4 transformer layers and prefix length of 10.

14 of 15

Model performance

Model

BLEU

METEOR

Ours (mapping network)

23.44

19.2

Pred: A woman is sitting with a fire hydrant.

Ref: A lady sitting beside a fire hydrant with hand on head.

Pred: A woman is standing on a boat with fruit.

Ref: An Asian woman with some vegetables in her boat

15 of 15

Summary

  • Study the scaling law behaviour in the Image Captioning task.
  • Measure the performance with dataset size and #model parameters.
  • Next steps:
    • Analyse scaling law behavior with compute.
    • Perform analysis on a larger scale for accurate metrics.