1 of 1

AlphaCook

Chenhao Lu, Nan Chen, Ping-Yao Shen, Ziao Zhang

Results

Lessons Learned:

The first major problem we encountered was that the model was overfitting very badly. The reason to that was the fact that we used all captioned frames as training data and did not account for the various segment lengths, leading to a very biased dataset. The issue was mitigated after we changed our sampling strategy.

Then our biggest problem was to improve the performance of the model. We tried tuning the hyperparameters, changing our preprocessing methods, using a different model structure, etc., but could not raise the performance to a very desirable level. We propose that there could be several causes to this.

Our team learns a lot through this deep learning project. On the non-academic side, the biggest takeaway is that cooperative teamwork is the most efficient way in solving challenging problems. We learnt how to be a strong team: make sure our team’s goal and everyone can sacrifice their own benefits to help finally achieve the team’s goal. Academically, we think the biggest takeaway was the fact that we need to be better at reading and comprehending literature in the related fields. The papers only helped to a certain extent in this project because we couldn’t fully understand the proposed procedure and thus were unable to make our own modifications most of the time.

Limitations/Problems:

1. We are sampling the frames randomly and passing each frame individually into the model, and therefore could be losing some temporal information about the order of the frames.

2. The ResNet-34 features might not be the most suitable for this type of tasks.

3. Our model structure is still very simple, and therefore cannot accurately learn the internal relationship between the image frames and captions.

4. Our metrics may not be the best for these tasks. We could explore other common metrics like BLEU, cosine similarity, etc.

Future Work: Experimenting with the options mentioned above

Acknowledgements

Reference

[1] Luo, Huaishao, et al. "Univl: A unified video and language pre-training model for multimodal understanding and generation." arXiv preprint arXiv:2002.06353 (2020).

Each video comes in the form of a 500x512 matrix where 500 frames are sampled from the video and featurized using ResNet-34. We did not use the uncaptioned frames, and for each captioned segment, we sampled 3 frames to use as the training samples.

All captions are tokenized, masked, and padded. All numbers are replaced with a special <num> token. There are ~2,300 words in total, and we used the most frequent 1,800 as the vocabulary.

We are trying to build a model that generates recipes from food videos. While food video is a great way to learn cooking, some videos are too long and may be difficult to memorize all the steps mentioned in the video. We believe it will be useful if we can generate a precise text recipe that gives the learner an overview of the cooking steps.

Dataset: YouCook2

A large-scale cooking video dataset for procedure, which contains 2000 long untrimmed videos from 89 cooking recipes; on average, each distinct recipe has 22 videos.
Each video are annotated with temporal boundaries and described by imperative English sentences.

Hugging Face - Vision Encoder Decoder Models:

Pretrained captioning model found at:

https://huggingface.co/nlpconnect/vit-gpt2-image-captioning

Not trained on recipe data, therefore the generated captions are not in recipe form
Used to compare our own model’s performance and see how the more advanced models are performing on the task

Model

Discussion

Fig 1: Comparison of basic and pretrained models on an unlabeled test video segment

Fig 2. Perplexity and Loss per Epoch

Basic Transformer Model:

Same as the model in hw5
Modified according to the suggestions in relative papers
Did not perform very well on the test data

Training

- 31011 training samples, 10476 validation samples

- Adam optimizer with learning rate 0.005, 5 epochs

- Reached 3.22 SCCE loss, 0.359 token-wise accuracy, and ~24 perplexity on the validation set

An example of generated captions on a sushi video:

California Roll How To Make California Rolls An Easy Sushi Recipe

(https://www.youtube.com/watch?v=C6boSYQalpU)

Generated Recipe:

1. roll the seaweed rice

2. place the salmon on a baking tray

3. place the crab on a grill

4. cut the salmon into cubes

Introduction & Dataset

Preprocessing

We want to thank our mentor TA Logan Bauman.

And of course extend our gratitude to Professor Chen Sun for teaching us!

Here is an example of generated recipes from the video: California Roll How To Make California Rolls An Easy Sushi Recipe

Fig 3: Generated recipe for california rolls