1 of 24

MIT Smart Confessions

Taking MIT confessions to the next level using machine learning

Robert M. Vunabandi, Jürgen Cito

2 of 24

What are MIT Confessions?

  • Facebook page where MIT students write posts anonymously
  • Post seen by many other MIT students
  • Anonymous posts are called “confessions”
  • Many MIT confession pages:

3 of 24

Problem Statement

Three problems that we wanted to tackle:

  • Predicting how popular a confession will be
    • Predicting how many of each Facebook reactions (like, love, haha, wow, sad, angry) the confession will get
  • Generating popular confessions
    • Generating confessions such that the confession gets a lot of Facebook reactions
  • Understanding what makes or breaks a confession’s popularity

4 of 24

Data & Challenges

5 of 24

Data From MIT Confessions Pages

  • Scraping data from Facebook is nearly impossible, so one must just use their Graph API.
  • Using the Facebook Graph API, for all confessions in a confessions page:
    • Collect the text in the confession
    • Collect the number of reactions for each reaction type associated with the confession
    • Collect the page data on the day of that confession (number of users, user engagements, etc).
  • Here’s a link to the data collected

6 of 24

Data From MIT Confessions Pages

  • Overall, collected 4,555 examples from MIT Confessions and MIT Summer Confessions.
  • We can definitely collect more, but the process is a pain.
    • Plan is to focus on training and getting good models before requesting for more data

7 of 24

Observations from the Data

  • The data is very skewed. Except for likes and comments, most reactions have 0 count. For all reaction types, we get the most count below 5.
    • About 67% of confessions get 0 reactions (including comments).
  • To get the highest number of reactions, one’s confession needs to be somewhere between 75-175 characters.
  • Most reactions are inversely correlated (i.e. more likes result in less of everything else).
    • That shows that most confessions usually have a “feeling” they convey and most users veer towards that feeling

8 of 24

Graphs of the Data: Data is extremely skewed (4,555 examples in total)

9 of 24

Graphs of the Data: Highest Reaction Counts Happen Below 200 Characters

10 of 24

Machine Learning Models

11 of 24

Two Models: Predictor and Generator

  • A “Bucket” Classification model to predict the number of reactions from a confession.
  • A Long-Short-Term-Memory (LSTM) Recurrent Neural Network (RNN) to generate confessions.

12 of 24

Bucket Classifier

  • For a given reaction type, we predict which range (i.e. bucket) of reaction count the confession will fall into.
  • E.g., out of 0-10, 11-20, 21-30, what is the probability that a given confession will fall into each of them?
    • Had to figure out a way to split the data evenly
  • 1 model per reaction

Text Input

Bucket Classification Model

Buckets Probabilities

Text

Embedding (32)

Conv1D

ReLU

MaxPooling1D

Dropout (0.1)

Flatten

Dense (64)

ReLU

Dropout (0.1)

Dense (BC)

Softmax

Text embedding converts each word index into a vector of width size 32.

BC is the number of buckets, which varies depending on the reaction type.

The text input are a list of integer where each integer represents a word.

13 of 24

LSTM Generator

  • Generate confessions that will yield a lot reactions
  • We trained this on confessions that had a total of at least 25 reactions (including comments)

Text Input

LSTM Classification Model

One-Hot Vector for Word Index

Text

Embedding (300)

LSTM (300)

Dense (WC)

Softmax

Text embedding converts each word index into a vector of width 300.

WC is the number of words, so the output here is a one hot vector. Ideally, we should output a text embedding vector.

The text input are a list of integer where each integer represents a word.

We have an LSTM with 300 as the dimensionality of the output space.

14 of 24

Training Results

15 of 24

Results on Bucket Classifier

  • For each reactions, trained for 75 epochs batch size of 64. For likes:
    • Validation loss went from: 0.13 to 0.23
    • Validation accuracy went from 0.96 to 0.96 → remained unchanged
    • Training loss went from: 0.13 to 0.00
    • Training accuracy went from: 0.96 to 0.99
  • Other reaction types had similar results
  • Shows signs of both overfitting and not really learning anything :(
    • The model is learning that most of the data looks the same, so it’s better off predicting (most of the time) that same thing (i.e. predict 0).
  • Some future extensions:
    • Would try to use an LSTM or RNN layer to try to improve on the results
    • A word 2 vec embedding would also help
    • Definitely need more data

16 of 24

Results on LSTM Generator

  • We trained for 250 epochs with all confessions that have at least 25 reactions.
    • Validation loss went from: 6.50 to 5.3870
    • Validation accuracy went from 0.05 to 0.5185
    • Training loss went from: 6.87 to 0.1697
    • Training accuracy went from: 0.04 to 0.9632
  • Although there are overfitting on the training data, the increase in performance on the training correlate with the increase on the validation.
    • Thus, the model is actually learning something, just not learning the best way it can
  • Some future suggestions:
    • Use a Word2Vec embedding

17 of 24

Some Results

18 of 24

Bucket Classifier Results

These text inputs were taken from the current MIT confessions page. The output is the highest predicted bucket. The model can definitely be improved.

---

  • Text: “b1 freshman lack nuke power must elevate”
  • Output: 2-17 wow, 0 sad, 16-18 likes, 3-12 love
  • Expected: 6 wows, 2 sads, 1 like, 1 love

---

  • Text: “Bow down you lowly phytoplankton, for I am your new sardine king.”
  • Output: 2-17 wow, 8-10 like, 6-27 haha, 0 haha
  • Expected: 7 likes, 6 hahas, 1 love

---

  • Text: “Watching all these YouTubers promote science for the masses is truly humbling. Shoutout to ElectroBOOM, 3Blue1Brown, VSauce, GreatScott, minutephysics, Veritasium, CGPGrey, numberphile, SciShow and infinitely many others!”
  • Output: 1-2 likes, 0 likes
  • Expected: 53 likes, 15 likes

19 of 24

LSTM Generator Results

---

  • Seed: Roast that
  • Output: Roast that 6 check mit midterms harder than international olympiads of course ...

---

  • Seed: People
  • Output: People who answer questions in large lectures where did you find such confidence bullshitters ? the animations are named asdf , ...

---

  • Seed: I love the
  • Output: I love the freshmen on my floor , they all have such funny and unique personalities ! so proud of them getting through their first real fight ...

NOTE: Model may have been re-trained and modified after generating these, so it may not output the same confessions given these seeds

20 of 24

LSTM Generator Results: It is Sometimes Sensible

  • Seed: MIT Midterms are
  • Output: mit midterms are by just being a p-set due at a potato, but it a few and we used to be bitter in a place like this, but just showing someone kindness, whether or not they are a friend, makes this place more rewarding. I feel like once a few years have gone by a lot of us lose sight of that. Keep doing nice things, it doesn’t take that much time or effort, and we can turn this place around, but unreadable enough that lazy female will have literally a plane and take.

21 of 24

Future Work & Extensions

22 of 24

Future Work & Extensions

  • Understanding why some confessions are more popular than others
  • Collect more data in order to better learn and have a better LSTM model
    • Overfitting in both models is a result of this lack of data
    • At the same time, there is a limit on the total number of confessions (about 20-30K in total across all pages)
    • Facebook reactions were introduced worldwide on February 24, 2016, so we can’t use posts before that for classification
  • Finish up and polish the website
  • Experiment with better data cleaning
  • Try using LSTM for bucket classification
  • Use a Word2Vec embedding for both models
  • Use the same model for all reactions (i.e. output contains all reactions)

23 of 24

24 of 24

Thanks

Project Mentor: Yaakov Helman

Industry Mentor: Charles Tam

Github Website: mit-smart-confessions-website

Github API: mit-smart-confessions-api