Topical Classification and Predictive Modeling

of a Large-Scale Text Corpus

Wesley Loo

Andrew Reece

Harvard University

Spring 2015


Head Instructor: Verena Kaynig-Fittkau


We describe an implementation of Latent Dirichlet Allocation, a form of topic modeling, to analyze reviews for movies from 2010-2015. Using Gibbs sampling, we fit the model for a range of topic numbers. Additionally, we used the 100-topic model to predict movie success using three outcomes - total domestic gross, critics’ scores and audience scores. Finally, we designed an interactive website for users to explore our data and analyses.


Using computers to understand human language is a difficult problem. Formal languages utilize strict, explicit syntax to convey specific meanings; in contrast, natural languages are less constrained and allow for nuance and implicit meanings which may not be apparent at first glance. Natural language processing focuses on utilizing algorithms to classify, summarize and determine the similarity of text documents, among other goals.

One approach to analyzing text documents is topic modeling, which applies a statistical model to describe the observed text. Latent Dirichlet Allocation is a popular form of topic modeling developed by Blei, Ng and Jordan (Blei et al., 2003). Using a three level hierarchical Bayesian model,  LDA represents the observed text documents as mixtures of underlying topics. Each of these topics have a probability distribution across all words in the text corpus. Thus, the words in the document can be constructed from the underlying topics.

Previously, LDA has been used to find the underlying topics of a variety of text documents, from abstracts published in Proceedings of National Academy of Sciences (Griffith and Steyvers, 2004)  to random Wikipedia articles (Hoffman et al., 2010).  In this paper, we utilize LDA to analyze over 40,000 movie review blurbs from Rotten Tomatoes, for over 2000 movies.

Related work and libraries used

To our knowledge, no one has used LDA to model topics on movie reviews. For the implementation of Gibbs sampling, we used numpy, regular expressions, time, requests, and pandas. For scraping the movie review blurbs from Rotten Tomatoes and text parsing, we used BeautifulSoup and SciKitLearn feature extraction text functions. For the prediction models, we used the SciKitLearn function Ridge for regularized linear regression. The website uses a Flask framework with Jinja2 templates.


We implemented LDA with Gibbs Sampling with guidance from Griffiths and Steyvers (2004) and their chapter on Probabilistic Topic Models in Landauer et al., 2013. LDA aims to determine:

  1. the best distribution of T topics to accurately describe the content of a given document (out of a corpus of D documents)
  2. the best distribution of W words to accurately describe each topic

In the general case of topic modeling (not LDA-specific), we want to compute:

Where: zi is a latent variable indicating the topic from which the ith word was drawn, P(wi|zi=j) is the probability of the word wi under the jth topic, P(zi=j) gives the probability of choosing a word from topics j in the current document, which will vary across different documents.

P(w|z) indicates which words are important to a topic, whereas P(z) is the prevalence of those topics within a document. We assign P(w|z=j)≡ϕ(j)w and P(z=j)≡θ(d)j for notational convenience. With LDA, we maximize: P(w|ϕ,α)=∫P(w|ϕ,θ)P(θ|α)dθ,θ∼Dirichlet(α)

As per Griffiths & Steyvers, we can also estimate ϕ as ϕ∼Dirichlet(β).

Our random variables are:

Our fixed hyperparameters are α,β. and suggested values for hyperparameters come from Griffiths and Steyvers, 2013:  α=50/T,β=0.01. For LDA, we are interested in the posterior distribution:

There are a couple of challenges we face with this formula. First, we need to compute the joint distribution P(w,z)=P(w|z)P(z). Recall that θ and ϕ are each only in one of the two expressions in this product. So we can separately integrate out each variable to get what we want. Integrating out ϕ gives:

where n(⋅)j is the number of times word w has been assigned to topic j in the vector of assignments z

Integrating out θ gives us:

where n(d)j is the number of times a word from document d has been assigned to topic j.

We can use a Gibbs sampler to estimate our target posterior if we have the conditional distribution. This can be derived by cancelling out terms in the above two expressions, to give:

where n(⋅)−i,j is a count that does not include the current assignment of zi. The first ratio expresses the probability of wi under topic j, and the second ratio expresses the probability of topic j in document di.

After all documents have been iterated over, and all words in all documents have topics assigned, we can make estimates of ϕ(j)i^ and θ(d)j^ via our count matrices:


Data: Using BeautifulSoup, we scraped BoxOfficeMojo for movie titles, budgets, revenue, number of theaters and opening dates for movies from 2010 to present. We then scraped Rotten Tomatoes for the first 20 movie review blurbs for each movie. The final dataset was filtered to remove duplicates and only include movies with data for each of these fields.

Gibbs sampling: Using the equations described in the model, we used LDA to analyze the movie review data with a range of topic numbers (2, 4, 10, 70, 100). For further analysis, the 100 topic model was chosen. For each topic, we viewed the 30 words with the highest probability for that topic and assigned summary phrases for both genre (e.g. “Horror/Thriller”) and critique (e.g. “Beautiful cinematography”).

Correlation and prediction: We chose to use the 100 topic model for predicting three outcomes: revenue, critics’ scores and audience scores. Correlations between each topic and these outcomes were also calculated in R. Using the Ridge function in SciKitLearn, we used regularized linear regression to predict the outcomes given the resulting topic distribution for each movie.

Website: We designed and implemented the website, which allows the user to view the data and results interactively by inputting the movie of their choice and seeing the top topics, words associated with those topics, and correlations of those topics with and predictions for revenue, critics’ score and audience score.


We scraped movie titles, total gross revenue and number of opening theaters from the website Box Office Mojo for movies from 2010 to May 4, 2015. Using these movie titles, we then scraped the website Rotten Tomatoes for the movie review blurbs for each movie. In the final dataset, we had over 40,000 reviews covering over 2,000 movies. Treating each set of reviews for a single movie as a document, we proceeded with LDA as described above.

Two user defined parameters are important for our implementation of LDA: the number of topics and the number of iterations for Gibbs sampling. In figure 1 we see that as we increase the topic number, the improvement in log likelihood of the model is more drastic after the first few iterations. Across all runs, the log likelihood appears to plateau around 40 iterations.

Figure 1. Log likelihood over 80 iterations for different topic counts. After around 40 iterations, the relative increase in log likelihood has begun to plateau.

Determining what each topic meant was a difficult task. As part of the exploratory analysis, we looked at the correlation of the three outcomes (revenue, critics’ scores and audience scores) to get a sense for what each topic meant. For example, in the 2 topic model, the first topic positively correlated with critics’ and audience scores, but negatively correlated with gross revenue and number of opening theaters (Figure 2). In contrast, the second topic was positively correlated with gross revenue and number of opening theaters, but negatively correlated with both score metrics. For both topics, the words with the highest probabilities alone did not convey these meanings directly. Instead we had to infer the topic meaning by comparing the topics with different metrics.

By subsequently increasing the number of topics in the model, we saw clearer meanings emerge for some topics. For example, in the 100 topic model, some topics were clearly describing the genre of the movie, while others pointed to an element that critics picked up on in their reviews (e.g., a directorial debut) (Figure 3).  Given our goal to make the analysis understandable by a lay user, we decided to use the 100 topic model for the website. We looked at the top 30 words associated with each topic and assigned phrases to describe the topic’s genre and critical comment attributes. However, there were a number of topics for which the meaning was more difficult to ascertain, and thus these fields were left empty.


Figure 2. Results from the 2 topic model. While the top words for each topic may not signal a clear meaning, we can infer the meaning of the topic by exploring the correlated data.

Figure 3. Representative topics from the 100 topic model. These are examples of topics with clear meanings both along genre (sci-fi and thriller) and critical comment (good acting and director debut).

To compare prediction capability of the LDA model, we used 100 iterations of repeated random subsampling cross validation of the Ridge model compared to naive mean estimates. Figure 4 shows the mean absolute error of of the Ridge model vs the naive mean estimates using a 2 topic model for four response variables: total gross revenue, adjusted gross revenue, critics’ scores and audience scores. While the prediction for revenue only seems to do as well as the naive mean estimate, the mean absolute error in predictions for the critic and audience scores are smaller than for the naive mean estimate. We thus chose to continue using the Ridge model for predictions used on the website.

Figure 4. Mean absolute error for Ridge model vs Naive mean using a 2 topic LDA model using repeated random sub-sampling with 100 iterations. Overall, the predictions do as well as or better than the naive average MAE, particularly for the critic and audience scores.


The LDA model offers a unique method to analyze movie reviews. We chose to use the 100 topic model as the basis for our website,, as a way for people to interact with the analysis. For the given movie, the website prints the top 3 topics which have labels in the genre and critical comment sections (i.e. if the topic with the highest probability does not have a label in the genre section, the website will display the next highest topic that does have a label). Hovering over the labels displays the 30 words with the highest probabilities for the given topic as well as the correlation coefficient for the topic with total gross revenue, critics’ scores and audience scores. The page also displays the prediction from the regularized linear model for those three response variables.

Since we chose to use exploratory data analysis as the final output, we did not optimize the number of topics for any particular response variable. However, in cases where a clear output function could be optimized, it would be possible to tune this parameter. Based on our analyses, increasing the topic number simply allows for more ‘meaningless’ topics that are difficult to identify, but those topics with clear signals will remain (e.g. the horror/thriller genre topic).


Gibbs sampling is an efficient method to sample from the posterior of the Latent Dirichlet Allocation model. We have shown its use in describing movie reviews and subsequently predicting movie success using these latent topics.


Blei, D. M., Ng, A. Y. & Jordan, M. I. Latent dirichlet allocation. the Journal of machine Learning research (2003).

Griffiths, T. L. & Steyvers, M. Finding scientific topics. PNAS (2004).

Hoffman, M., Bach, F. R. & Blei, D. M. Online learning for latent dirichlet allocation. advances in neural information (2010).

Steyvers, M. & Griffiths, T. Probabilistic topic models. Handbook of latent semantic analysis (2007).

Appendix 1: URL for website and data


Data can be found at:

Appendix 2: Description of ipython notebooks

movie-data-scrape.ipynb: Scrapes the data from BoxOfficeMojo and Rotten Tomatoes

lda-gibbs.ipynb: Implements Gibbs sampling of LDA posterior

prediction-alloneblock.ipynb: Regularized linear regression using the 100-topic model

lda-exploration-X-topics.ipynb: Notebook which includes analysis for X number of topics