Obama-Bush Engine with the Google Prediction API

Baris Yuksel (@baris_wonders)

September 2013

Obama said 'Yes, we can!' and Bush said 'The future will be better tomorrow', but which one would say 'It has been a long day, let’s go party!'? This was a question my mind whirled around couple years ago. I imagined both presidents, getting up from their desk to call their aids ‘It is party time’, and then taking off to go to a dance party with a dj and a disco ball and all that. There are certainly many ways to look at this question, and certainly most of them are more philosophical.  but I still wondered if we can get close to an answer by taking a purely linguistic approach incorporating machine learning techniques.

It is an interesting thought exercise to think about whether we can tell if someone is likely to say a certain statement.  My curiosity about this reached new levels especially after seeing the work of Luke Dubois in “Hindsight is always 20/20” in ‘08 in New York City(1). He did a collage of the words from the state of union speeches of different presidents by using the frequencies to set the type sizes. The words they use more often are bigger in size, thus you can tell what the biggest concerns of their presidency are about. This made me realize that there is a statistical nature to the language used by different presidents, but could this be used to take it one step further to build this guessing machine?

In theory, it should possible to build a simple machine learning model which can guess the answer. A similar technology is already used by Google Translate. When the users type in the text to be translated, Google Translate can guess which language it is. If we can assume that each president used slightly different language during their speeches, then it should be possible to build an engine that can guess whether the entered phrase is more likely to be uttered by Obama or Bush, even though neither of them never said it.

Finally earlier this year, I decided to take on the challenge. Since  the Google Prediction API exposes the very same algorithms machine learning algorithms we use at Google to the public, I thought I might be able to get something working with Prediction API.

Before You Start

This article assumes that you have a working knowledge of the App Engine development environment and are using the latest App Engine Python SDK. Before starting on the demo project, you should do the following preliminary steps:

  1. If you aren't familiar with App Engine, take a look at the Getting Started Guide and familiarize yourself with the development environment.

Step 1: Create an App Engine application

The first step is to create an App Engine application:

  1. Search for “Google API Console”, and go there.
  2. Click on “Create project”:
  3. Turn on Prediction API on “Services” tab
  4. Turn on Google Cloud Data Store API:
  5. Enable billing:
  6. Setup billing:
  7. You should have a confirmation like this:

Step 2: Create a labelled training set

In order to train a (supervised) model, first we need to collect some training data. This was just as easy as doing a Google search as the internet is swarming with quotes by Obama and Bush. In order to create a balanced set, I tried to keep the quote count to 100 per president.  I had to do some manual cleaning by deleting newlines, double quotation marks, dates and references at the end of the sentences .

This is the link to my training set.

Step 3: Upload your csv data to Google Cloud Data Storage

We will use the Google Cloud Data storage to get our data to the Prediction API.

  1. While on “Google API Console”, click on “Google Cloud Storage”. It should be on the left menu, at the bottom after you enabled it in the “Services” tab (Step 1, item 3):
  2. From the “Storage Access” menu, click on “Google Cloud Storage Manager”. This will take you to “Google Cloud Console”:
  3.  Click on “New Bucket” to create one. Here I created “politics-d”. You can then click on “politics-d” to open the buckets menu. Once you are in the buckets menu, click on “Upload” to upload your training data you created in Step-2. After you are done, it should like this:

Step 4: Train your model

  1. Search for “Google API Explorer”, and go there.
  2. Go to “Prediction API” and click on it. You should see this:
  3. Click on the “prediction.trainedmodels.insert”. In the project field, you enter the number from “Google API Console”, when you click on the “Overview”. It is the field which says “Project Number”.
  4. Then, in the “Request Body”, you add an “ID”, which is a made-up name for your model. In my example, it is “predictor”.
  5. Again in the “Request body”, you add the “storageDataLocation” which has the form of “bucket you created in Cloud Data Storage/filename”. In my example it is “politics-d/ObamaBush.csv”.
  6. You set the model type to “Classification”. The final screen should look like:
  7. Turn on the authentication which is a little button on the upper right part of the menu:
  8. When you hit “Execute”, Prediction API should come back with a “200 OK” Response.

Step 5: Make predictions

This is when we can say “tell me any phrase, let me guess whether Obama or Bush is more likely to utter it.”. You can use the API’s calls from App Engine or you can use the “Google API Explorer” to issue example calls. In here, we will use the latter approach:

  1. In the “Google API Explorer”’s “Prediciton API” screen, click on “prediction.trainedmodels.predict”.
  2. In the “Project” field, enter the “Project Number” from the “Google API Console” for your project.
  3. In the “ID” field, enter the made-up name you created in Step-4, item 4.
  4. In the request, add an “input” and then a “csvInstance”.
  5. Enter your sentence. In our example it is “It has been a long day. Let’s go party!”.
  6. Click the authorization button on the right hand side, and authenticate.
  7. Hit “Execute”. The final output should be like:

Final Step: Who would say it?

And drumroll, the model we just trained guesses it is this president who would go to partying:

If you have any questions, do not hesitate to contact me @baris_wonders. I wish you a predictful day!

References:

(1) For the artist I mentioned, please visit Luke Dubois’ site: http://hindsightisalways2020.net/

(2) For a general machine learning framework which is a whitebox, try experimenting with Weka: http://www.cs.waikato.ac.nz/ml/weka/