AlphaCook
Chenhao Lu, Nan Chen, Ping-Yao Shen, Ziao Zhang
Results
Lessons Learned:
The first major problem we encountered was that the model was overfitting very badly. The reason to that was the fact that we used all captioned frames as training data and did not account for the various segment lengths, leading to a very biased dataset. The issue was mitigated after we changed our sampling strategy.
Then our biggest problem was to improve the performance of the model. We tried tuning the hyperparameters, changing our preprocessing methods, using a different model structure, etc., but could not raise the performance to a very desirable level. We propose that there could be several causes to this.
Our team learns a lot through this deep learning project. On the non-academic side, the biggest takeaway is that cooperative teamwork is the most efficient way in solving challenging problems. We learnt how to be a strong team: make sure our team’s goal and everyone can sacrifice their own benefits to help finally achieve the team’s goal. Academically, we think the biggest takeaway was the fact that we need to be better at reading and comprehending literature in the related fields. The papers only helped to a certain extent in this project because we couldn’t fully understand the proposed procedure and thus were unable to make our own modifications most of the time.
Limitations/Problems:
1. We are sampling the frames randomly and passing each frame individually into the model, and therefore could be losing some temporal information about the order of the frames.
2. The ResNet-34 features might not be the most suitable for this type of tasks.
3. Our model structure is still very simple, and therefore cannot accurately learn the internal relationship between the image frames and captions.
4. Our metrics may not be the best for these tasks. We could explore other common metrics like BLEU, cosine similarity, etc.
Future Work: Experimenting with the options mentioned above
Acknowledgements
Reference
[1] Luo, Huaishao, et al. "Univl: A unified video and language pre-training model for multimodal understanding and generation." arXiv preprint arXiv:2002.06353 (2020).
Each video comes in the form of a 500x512 matrix where 500 frames are sampled from the video and featurized using ResNet-34. We did not use the uncaptioned frames, and for each captioned segment, we sampled 3 frames to use as the training samples.
All captions are tokenized, masked, and padded. All numbers are replaced with a special <num> token. There are ~2,300 words in total, and we used the most frequent 1,800 as the vocabulary.
We are trying to build a model that generates recipes from food videos. While food video is a great way to learn cooking, some videos are too long and may be difficult to memorize all the steps mentioned in the video. We believe it will be useful if we can generate a precise text recipe that gives the learner an overview of the cooking steps.
Dataset: YouCook2
Hugging Face - Vision Encoder Decoder Models:
https://huggingface.co/nlpconnect/vit-gpt2-image-captioning
Model
Discussion
Fig 1: Comparison of basic and pretrained models on an unlabeled test video segment
Fig 2. Perplexity and Loss per Epoch
Basic Transformer Model:
Training
- 31011 training samples, 10476 validation samples
- Adam optimizer with learning rate 0.005, 5 epochs
- Reached 3.22 SCCE loss, 0.359 token-wise accuracy, and ~24 perplexity on the validation set
An example of generated captions on a sushi video:
California Roll How To Make California Rolls An Easy Sushi Recipe
(https://www.youtube.com/watch?v=C6boSYQalpU)
Generated Recipe:
1. roll the seaweed rice
2. place the salmon on a baking tray
3. place the crab on a grill
4. cut the salmon into cubes
Introduction & Dataset
Preprocessing
We want to thank our mentor TA Logan Bauman.
And of course extend our gratitude to Professor Chen Sun for teaching us!
Here is an example of generated recipes from the video: California Roll How To Make California Rolls An Easy Sushi Recipe
Fig 3: Generated recipe for california rolls