Baris Yuksel (@baris_wonders)
September 2013
Obama said 'Yes, we can!' and Bush said 'The future will be better tomorrow', but which one would say 'It has been a long day, let’s go party!'? This was a question my mind whirled around couple years ago. I imagined both presidents, getting up from their desk to call their aids ‘It is party time’, and then taking off to go to a dance party with a dj and a disco ball and all that. There are certainly many ways to look at this question, and certainly most of them are more philosophical. but I still wondered if we can get close to an answer by taking a purely linguistic approach incorporating machine learning techniques.
It is an interesting thought exercise to think about whether we can tell if someone is likely to say a certain statement. My curiosity about this reached new levels especially after seeing the work of Luke Dubois in “Hindsight is always 20/20” in ‘08 in New York City(1). He did a collage of the words from the state of union speeches of different presidents by using the frequencies to set the type sizes. The words they use more often are bigger in size, thus you can tell what the biggest concerns of their presidency are about. This made me realize that there is a statistical nature to the language used by different presidents, but could this be used to take it one step further to build this guessing machine?
In theory, it should possible to build a simple machine learning model which can guess the answer. A similar technology is already used by Google Translate. When the users type in the text to be translated, Google Translate can guess which language it is. If we can assume that each president used slightly different language during their speeches, then it should be possible to build an engine that can guess whether the entered phrase is more likely to be uttered by Obama or Bush, even though neither of them never said it.
Finally earlier this year, I decided to take on the challenge. Since the Google Prediction API exposes the very same algorithms machine learning algorithms we use at Google to the public, I thought I might be able to get something working with Prediction API.
This article assumes that you have a working knowledge of the App Engine development environment and are using the latest App Engine Python SDK. Before starting on the demo project, you should do the following preliminary steps:
The first step is to create an App Engine application:
In order to train a (supervised) model, first we need to collect some training data. This was just as easy as doing a Google search as the internet is swarming with quotes by Obama and Bush. In order to create a balanced set, I tried to keep the quote count to 100 per president. I had to do some manual cleaning by deleting newlines, double quotation marks, dates and references at the end of the sentences .
This is the link to my training set.
We will use the Google Cloud Data storage to get our data to the Prediction API.
Step 4: Train your model
This is when we can say “tell me any phrase, let me guess whether Obama or Bush is more likely to utter it.”. You can use the API’s calls from App Engine or you can use the “Google API Explorer” to issue example calls. In here, we will use the latter approach:
And drumroll, the model we just trained guesses it is this president who would go to partying:
If you have any questions, do not hesitate to contact me @baris_wonders. I wish you a predictful day!
References:
(1) For the artist I mentioned, please visit Luke Dubois’ site: http://hindsightisalways2020.net/
(2) For a general machine learning framework which is a whitebox, try experimenting with Weka: http://www.cs.waikato.ac.nz/ml/weka/