Lecture 25 Quiz
You are doing a sentiment analysis task for Marvel Studios, where you want to predict the the prevailing sentiment wrt to movies from the studio. The studio has given you a hefty budget to perform this task, with the freedom to use it however you see fit. As your first step, and armed with knowledge from HW1, you scrape Twitter for tweets related to Marvel Studios and their movies. What would NOT be the next logical step?
use available sentiment lexicons to create a rule based classifier
use a portion of your budget to buy a GPU. Deep learning is state--of-the-art for sentiment analysis and using a GPU is would be an order of magnitude faster
use a portion of your budget to annotate a few thousand tweets via Mechanical Turk
After a week , you have trained a LSTM model (you finally got that GPU!) using Keras on 5K tweets to predict sentiment. The performance is dismal - 57% accuracy on the held out set. Moreover, it takes over 2 hours to train 10 epochs. What do you do?
Its probably lack of data. 5K is too small to get good accuracy. Deep models are data hungry. (Eg : ImageNet has 1 million images) Use your budget to annotate a few thousand more tweets.
Train a Tree-LSTM based classifier (can handle negations better) with attention mechanism. Initialise the weights via transfer learning (pre-train the model on larger available sentiment datasets. This helps with the small data problem)
Use a TCN instead of LSTM. CNNs are much faster and recent work has shown that they outperform LSTMs on several tasks
Review your code. Perhaps you called "fit_transform" instead of "transform" on the test data? Are you using a list instead of dictionary to lookup the index of a word?
After another few days, you finally have a model (based on the Bidirectional Attention Flow architecture) that gives ~98% F1 (you realised on the way that accuracy is not a good metric since your classes were imbalanced). What do you do next?
Automate the training process so that data scraping and model retraining is done on a weekly basis. As new Marvel movies come out, the vocabulary of tweets will change and your model needs to be updated to reflect the new data distribution.
Put your Dana Scully hat on and think "hmm...something's fishy here. this is too good to be true." and debug some more
Open source your code on github so that the entire community can benefit.
EMNLP (top-tier NLP conference) deadline is coming up soon. Start writing a paper.
Never submit passwords through Google Forms.
This form was created inside of Carnegie Mellon University.
Terms of Service