3 of 24

Problem Statement

Three problems that we wanted to tackle:

Predicting how popular a confession will be

Predicting how many of each Facebook reactions (like, love, haha, wow, sad, angry) the confession will get

Generating popular confessions

Generating confessions such that the confession gets a lot of Facebook reactions

Understanding what makes or breaks a confession’s popularity

4 of 24

Data & Challenges

5 of 24

Data From MIT Confessions Pages

Scraping data from Facebook is nearly impossible, so one must just use their Graph API.
Using the Facebook Graph API, for all confessions in a confessions page:

Collect the text in the confession
Collect the number of reactions for each reaction type associated with the confession
Collect the page data on the day of that confession (number of users, user engagements, etc).

Here’s a link to the data collected

6 of 24

Data From MIT Confessions Pages

Overall, collected 4,555 examples from MIT Confessions and MIT Summer Confessions.
We can definitely collect more, but the process is a pain.

Plan is to focus on training and getting good models before requesting for more data

7 of 24

Observations from the Data

The data is very skewed. Except for likes and comments, most reactions have 0 count. For all reaction types, we get the most count below 5.

About 67% of confessions get 0 reactions (including comments).

To get the highest number of reactions, one’s confession needs to be somewhere between 75-175 characters.
Most reactions are inversely correlated (i.e. more likes result in less of everything else).

That shows that most confessions usually have a “feeling” they convey and most users veer towards that feeling

8 of 24

Graphs of the Data: Data is extremely skewed (4,555 examples in total)

9 of 24

Graphs of the Data: Highest Reaction Counts Happen Below 200 Characters

10 of 24

Machine Learning Models

11 of 24

Two Models: Predictor and Generator

A “Bucket” Classification model to predict the number of reactions from a confession.
A Long-Short-Term-Memory (LSTM) Recurrent Neural Network (RNN) to generate confessions.

12 of 24

Bucket Classifier

For a given reaction type, we predict which range (i.e. bucket) of reaction count the confession will fall into.
E.g., out of 0-10, 11-20, 21-30, what is the probability that a given confession will fall into each of them?

Had to figure out a way to split the data evenly

1 model per reaction

Text Input

Bucket Classification Model

Buckets Probabilities

Text

Embedding (32)

Conv1D

ReLU

MaxPooling1D

Dropout (0.1)

Flatten

Dense (64)

ReLU

Dropout (0.1)

Dense (BC)

Softmax

Text embedding converts each word index into a vector of width size 32.

BC is the number of buckets, which varies depending on the reaction type.

The text input are a list of integer where each integer represents a word.

13 of 24

LSTM Generator

Generate confessions that will yield a lot reactions
We trained this on confessions that had a total of at least 25 reactions (including comments)

Text Input

LSTM Classification Model

One-Hot Vector for Word Index

Text

Embedding (300)

LSTM (300)

Dense (WC)

Softmax

Text embedding converts each word index into a vector of width 300.

WC is the number of words, so the output here is a one hot vector. Ideally, we should output a text embedding vector.

The text input are a list of integer where each integer represents a word.

We have an LSTM with 300 as the dimensionality of the output space.

14 of 24

Training Results

15 of 24

Results on Bucket Classifier

For each reactions, trained for 75 epochs batch size of 64. For likes:

Validation loss went from: 0.13 to 0.23
Validation accuracy went from 0.96 to 0.96 → remained unchanged
Training loss went from: 0.13 to 0.00
Training accuracy went from: 0.96 to 0.99

Other reaction types had similar results
Shows signs of both overfitting and not really learning anything :(

The model is learning that most of the data looks the same, so it’s better off predicting (most of the time) that same thing (i.e. predict 0).

Some future extensions:

Would try to use an LSTM or RNN layer to try to improve on the results
A word 2 vec embedding would also help
Definitely need more data

16 of 24

Results on LSTM Generator

We trained for 250 epochs with all confessions that have at least 25 reactions.

Validation loss went from: 6.50 to 5.3870
Validation accuracy went from 0.05 to 0.5185
Training loss went from: 6.87 to 0.1697
Training accuracy went from: 0.04 to 0.9632

Although there are overfitting on the training data, the increase in performance on the training correlate with the increase on the validation.

Thus, the model is actually learning something, just not learning the best way it can

Some future suggestions:

Use a Word2Vec embedding

17 of 24

Some Results

18 of 24

Bucket Classifier Results

These text inputs were taken from the current MIT confessions page. The output is the highest predicted bucket. The model can definitely be improved.

---

Text: “b1 freshman lack nuke power must elevate”
Output: 2-17 wow, 0 sad, 16-18 likes, 3-12 love
Expected: 6 wows, 2 sads, 1 like, 1 love

---

Text: “Bow down you lowly phytoplankton, for I am your new sardine king.”
Output: 2-17 wow, 8-10 like, 6-27 haha, 0 haha
Expected: 7 likes, 6 hahas, 1 love

---

Text: “Watching all these YouTubers promote science for the masses is truly humbling. Shoutout to ElectroBOOM, 3Blue1Brown, VSauce, GreatScott, minutephysics, Veritasium, CGPGrey, numberphile, SciShow and infinitely many others!”
Output: 1-2 likes, 0 likes
Expected: 53 likes, 15 likes

19 of 24

LSTM Generator Results

---

Seed: Roast that
Output: Roast that 6 check mit midterms harder than international olympiads of course ...

---

Seed: People
Output: People who answer questions in large lectures where did you find such confidence bullshitters ? the animations are named asdf , ...

---

Seed: I love the
Output: I love the freshmen on my floor , they all have such funny and unique personalities ! so proud of them getting through their first real fight ...

NOTE: Model may have been re-trained and modified after generating these, so it may not output the same confessions given these seeds

20 of 24

LSTM Generator Results: It is Sometimes Sensible

Seed: MIT Midterms are
Output: mit midterms are by just being a p-set due at a potato, but it a few and we used to be bitter in a place like this, but just showing someone kindness, whether or not they are a friend, makes this place more rewarding. I feel like once a few years have gone by a lot of us lose sight of that. Keep doing nice things, it doesn’t take that much time or effort, and we can turn this place around, but unreadable enough that lazy female will have literally a plane and take.

21 of 24

Future Work & Extensions

22 of 24

Future Work & Extensions

Understanding why some confessions are more popular than others
Collect more data in order to better learn and have a better LSTM model

Overfitting in both models is a result of this lack of data
At the same time, there is a limit on the total number of confessions (about 20-30K in total across all pages)
Facebook reactions were introduced worldwide on February 24, 2016, so we can’t use posts before that for classification

Finish up and polish the website
Experiment with better data cleaning
Try using LSTM for bucket classification
Use a Word2Vec embedding for both models
Use the same model for all reactions (i.e. output contains all reactions)

23 of 24

demo: mit-smart-confessions.herokuapp.com

24 of 24

Thanks

Project Mentor: Yaakov Helman

Industry Mentor: Charles Tam

Github Website: mit-smart-confessions-website

Github API: mit-smart-confessions-api