CSE641 - Deep Learning Winter 2022
Group 35
Lightweight Image Captioning based on CLIP and GPT2 encodings
Ansh Arora (2019022), Daksh Thapar (2018137),
Hardik Garg (2019040)
Problem definition
Related works
Deep Captioning With Multimodal Recurrent
Neural Networks (M-RNN)
Related works
Show and Tell (2014) and Show Attend and Tell (2015)
Related works
OSCAR: Object-Semantics Aligned Pre-training for Vision-Language Tasks
Related works
CLIP - Learning Transferable Visual Models From Natural Language Supervision
Related works
GPT2 - Language Models are Unsupervised Multitask Learners
Methodology
Our proposed Model -
Methodology
Methodology
Methodology
Dataset detail and Results
Dataset EDA
Average Sentence Length: 10.45695170060774
Train Images: 7500
Train Captions: 37516
Validation Images: 2500
Validation Captions: 12508
Analysis-Results
The results we obtained were as follows, with due consideration given to the number of parameters of both models and the training times observed for the two.
��Oscar (135M parameters)
Show-Attend-Tell (57M parameters, 443 minutes)
CLIP model (53 M parameters, 290 minutes)
Score/ Model | Show Attend and Tell | CLIP Approach |
BLEU-1 | 0.487269 | 0.490556 |
BLEU-2 | 0.351025 | 0.350983 |
BLEU-3 | 0.239221 | 0.238337 |
BLEU-4 | 0.133368 | 0.133195 |
Loss Plots
Member Contributions
Equal Contribution from each member of the team
2 papers are implemented (Show-Attend-Tell-> Model 1, Clip Approach-> Model 2)
We were all in the campus, so all the work was done together.
References