Twitter Analytics Progress

Amit Balode & Chintan Tank      (for current project status scroll down)

B656 Web Mining
Progress Report

Introduction - Project Concept

Twitter is a social networking web application used by millions of people to update their status by answering "What are you doing?" This tweet facilitates micro blogging enabling friends, family, and co–workers to communicate and stay connected through the exchange of quick, frequent answers which may lead to sharing of common thoughts, discussions about the similar topic(s). It implies a possibility of similar set of users who can be recommended to broaden their network. However, Twitter lacks this feature of user recommendation. This motivated us towards building such a Recommendation Engine.

NOTE : Click to goto LiveStatus Page

Milestone 1 - Finalize Requirements                                    Complete    Due: Feb 10, 2009   

Week 1


After going through couple of iterations the requirements have been finalized. On the basis of disucssions held with Jacob & our primary research on the topic the project proposal was finalized and submitted. Barring one question about "user recommendation" the project proposal was approved. This will be resolved as we make more progress on the project.

Milestone 2 - Data Extraction & Refining                              Complete1                                                Due: Feb 24, 2009

    Week 2 - 4
   

For this milestone we have started developing data extraction scripts & reading on various approaches for mining the tweet data simultaneously.

 

  1. Codebase Development & Database Server setup
    • Set up the database server on dbserv.uits.indiana.edu.
    • Researched & found about Twitter APIs used to interface with the Twitter Database.
    • Wrote Java scripts for extracting
      • Twitter User Information - collecting information like user id, user name, location, follower count, screen name etc.
      • Tweets - collecting information like tweet content, user id, created date, status id.
    • Wrote PERL scripts to automate the process of compiling the Java Script on IU CS machines and executing them. Did this to leverage the resources on hand at CS department.
    • LiveStatus page to show the current statistics of our database results. It can be viewed here.
    • Setting up this progress report & the LiveStatus page.
  2. Researching on different Algorithms
      • Unsupervised Learning / Clustering Algorithms in general.
      • Bottom-up Approach algorithm by Junghoo Cho, Narayanan Shivakumar Hector Garcia-Molina  Link http://is.gd/ksRd
      • K-means clustering algorithm
  3. Due to cap on the database quota we have moved our database to CS MySQL Database server. The process of scraping data is going on right now.
  4. Added more statistics "Avg. Tweet per User in our DB", "Highest "Number of Tweet" User in our DB " to the statistics page.
  5. Collection of Geocoding information
      • Wrote Java script to scrape geocode information from the Location fields in the User Profiles. After researching on geocoding web services available, settled on the use of Yahoo API for this.
      • Updated the Database Schema to save this location Information in the form of
        • user id, original location string, 
        • latitude, longitude,
        • address, city, state, zip, country
      • Made changes to the script to handle null, malformed & iPhone specific location strings. Added another table "TWITTER_USER_LOCATION_ERROR_INFO" for storing erroneous locations. Next step will be to clean this data based on the priority.

Milestone 3 - Implementation                                            In Progress Due: Apr 18, 2009

    Week 5 - 8

  1. Data Scraping & Extraction
    • After modifying our code as and when required we have managed to build up a corpus of twitter data having approximately 400,000 Tweets with all its meta-data like location, favorite, in reply to etc. and approximately  Twitter user profiles.
    • We faced a couple of issues on the way,
      • There is data extraction limitation applied for the usage of Twitter API (100 requests per IP per hour) which slowed the extraction process. We mitigated this by running our scripts in parallel on various CS machine simultaneously.
      • The Data we were collecting was getting corrupted due to unchecked exceptions & we had to reboot the entire process of data extraction.
    • We used the Yahoo API to convert the location information provided into a universal latitude, longitude format.
    • In order to leverage the location metric we wrote a script in PHP that will provide k-nearest neighbors of a particular user based on its lattitude & longitude value.

  2. Data Manipulation
    • We have extracted data in different text files with different configurations for performing manipulations over it like,
      • User_ID, Latitude, Longitude for K-nearest neighbors operation
      • Screen_name, Tweets for,
        • Stop words removal testing
        • Stemming process testing
      • etc.

  3. Stop Words Removal
    • We searched for a good black-list for stop words and decided to use the comprehensive list provided by http://snowball.tartarus.org/
    • Using this list we have experimented on a limited amount of corpus of tweets.

  4. Stemming algorithm (Porter)
    • This algorithm is applied to tweets and saved back to text file. We are/will be still modifying this algorithm based on our requirement to remove/ add unwanted rules from this algorithm.

  5. Usage of HAC (Heirarchical Agglomerative Clustering)
    • This will be used for finding similar clusters (users in one cluster will be recommended to each other). Threshold is not decided yet, since it will be changed based on how well and strong clusters are formed. Currently we are planning to use gCLUTO for performing HAC.


However, due to restrictive amount of features per module, the user recommendation engine might not give good results for comparison. So we, are now planning to merge all these three factors into a common field factor = tweets (n number of tweets per user) + location + interest. This factor will then be projected in vector space and then used for comparison.

  Week 9

  1. Current Important Stats,
    1. Tweets : 600,000
    2. Twitter Profiles : 31,000
  2. In order to verify our experiments we have started collecting the information regarding followers i.e. all the twitter users that are following a particular user for all the users in our twitter users database.
    Monitoring the user-follower pair data at regular intervals we found out that there was good amount of noise.
    • There were cases in which a particular user has huge amount of followers possibly a celebrity like Digg's Kevin Rose with 405,726 followers. When considering the goal of our project which is to recommend similar users we decided that this is getting counter-intuitive. So the next step was to find out only the friends for that user i.e. the users that follow Kevin Rose and the users that Kevin Rose follows. This mitigated the problem of too much data and time that would be spent processing it. 
    • There was also a different kind of noise where twitter users followed a high number of other users. Such was a case with Barack Obama's profile. Currently this user follows 576,558 users and 679,812 users follow Barack Obama. This kind of pattern, we observed in other profiles as well, where they typically belong to social media operators profile. Again for the purpose of our project where we would try to recommend genuine users, we had to mitigate this. So, the case where a user has more than 1000 friends, we only collect the ids of only the first 1000 friends.
    • As of now we have collected over 2 million user-follower pairs spanning approx 2000 users.
  3. In parallel we have started analysing the tweets as mentioned earlier. To this end we have had to tune the stop-word removal tool and stemmer so that it does not corrupt certain assets like hyper-links, references to other twitter users inside the tweet (i.e. words that start with '@' symbol).
  4. Implementation Details,
    • After having looked at a collection of tools, we have settled on using either the command line open source CLUTO algorithm (for HAC) or a graphical extension for it gCLUTO, for formation of clusters in our data.
    • doc2mat.pl (Perl script) takes a text file as an input. Every row in the text file is a vector to be represented in feature space. The first string in every row represents unique user id/document id. Output of the script is a document * term matrix(.mat) file. This script has an option to internally perform stemming and removing stop words. (We are trying various combinations of these scripts either by ignoring stemming/stop words removal  OR by including them  OR by modifying certain metrics).
    • vcluster, an executable file as part of the CLUTO package, takes an input as document * term matrix(.mat). This script has several options for clustering (k-means, HAC etc.) and other options for similiarity measures (Eucledian, cosine similiarity etc.). Currently we are applying HAC and suitable k value (number of clusters) cutoff value for the tree.
  5. Output, Currently we ran the algorithm for 100 users and varying k value from 10 to 50. The performance was better as the value of k increased, however the results were poor after verifying/ comparing the recommendations with actual friends from Twitter. We expect to get better results after increasing the user count.

Week 10

  1. In order to test our hypothesis we started experimenting with bigger datasets containing 5000, 10000 & 30000 Twitter users and thus encompassing our entire Tweets database (approx. 700,000 tweets) by performing top to bottom partitioning on the corpus that we got after passing the content through stop word remover & stemmer.
  2. The number of clusters formed were 1250, 2500 & 6250 respectively. The average of 1250 clusters per 5000 users was decided after running some tests on lesser number of users. This average gave us better results. Also the clusters for users that had large amount of tweets in our datasets were better off with respect to similarity with other users within the clusters.
  3. Running these tests became very costly in terms of time taken so we decided to start using Quarry - IU's newest general-purpose Unix supercomputer. Even then the time taken was very high.
  4. At first we tried the staright forward top to bottom partitioning algorithm which took 5 minutes for 5,000 & 11 hours for 10,000 users. When we attempted HAC on 30,000 users we had to abort the process after waiting for 36 hours.
  5. This led us to change our strategy. We decided to run our tests on only 5,500 users. But this time we picked only those users which had high amount of tweets. In next couple of days we began scraping tweets selectively for these users to maximize the benefit.
  6. Simultaneously we consulted Jacob on our next step. We were concerned with the quality of results we were getting. The main issue we faced was that if user A & user B are friends as per Twitter, there is no guarantee that user A's tweets were similar to user B, also they might live in totally different place.
  7. So, comparing our cluster results with already formed Twitter relationships was not fruitful.
  8. As per our discussions with Jacob & Professor we decided that in addition to this evaluation we would conduct a limited user study by contacting the actual twitter users & providing them with the a couple of users that were similar to them or in the same cluster & a couple of users which are not at all similar to them.
  9. To this effect,
    • we have created a web portlet at http://www.cs.indiana.edu/cgi-pub/cdtank/websiteBuild/twitterUserStudy.php where the users can come and log in using the credentials provided to them in an earlier email.
    • after logging in they will be presented with a list of users (hyperlink to each user's twitter profile is also provided) with 3 choices for each user,
      • I Liked the user profile
      • I Disliked the user profile
      • I have mixed feelings about this profile
    • they can then submit their votes & can also send us comments about their profile.
  10. This will provide us with a real feedback which will better help guage performance of our recommendation engine.

Week 11

  1. We have identified 13 Twitter users to whom we will send out invitations for testing our recommendations. In order to streamline all the twitter activities for our user recommendation engine we have created an account @tw_user_suggest
  2. Till now we were running our clustering scripts on generic nodes on Quarry super computer. The algorithms we ran was repeated bisections. When we shifted to using HAC, due to the extreme memory requirements & processing power our script was aborted many times. We decided to request the Quarry Sys-admin to a feasible solution for this. George Turner, quarry/hps sys-admin was kind enough to allocate a dedicated node for our computations.
  3. After this we tried various combinations of cluster granularity & other options for selecting similarity methods like cos, correlation etc. Runtime of scripts varied from 2 minutes to 15 hours deopending upon the parameters chosen.
  4. Simultaneously we developed scripts to perform analysis on the clustering results. For each member in the cluster we checked whether other members in the cluster were somehow directly related - either as a follower or a friend. We aggregated the results for each member, cluster & all the results. This evaluation would serve as an alternative to the user study that we will perform. We did not consider clusters with only 1 member for our computations for obvious reasons.
  5. After the initial roll-out of user study invitations to the chosen 13 users we are planning to contact the twitter users from our clustering results to see if they will be interested in our recommendations. We would try to append the results from them as & when time permits.

Week 12 - current

  1. In addition to 13 Twitter users we sent out survey invitations to 24 more twitter users who were in the same clusters as the 13 users.
  2. We have gotten back results from 7 users at this point of time & waiting for other responses.
  3. Simultaneously we have started work on our paper & presentation.




 

Milestone

Start Date

End Date

Revised Start Date
Revised End Date

1

Finalize Requirements

Feb 3, 2009

Feb 10, 2009

Feb 3, 2009
Feb 10,2009

2

Data Extraction & Refining

Feb 10, 2009

Feb 24,2009

Feb 10,2009
Mar 2,2009

3

Implementation:

 

 



 

a.  Feature Selection

Feb 24, 2009

Mar 10, 2009

Mar 2,2009
Mar 27,2009

 

b. Mining Algorithms

Mar 10,2009

Mar 31, 2009

Mar 27, 2009
Apr 18,2009

4

Testing, Performance Evaluation

Mar 31, 2009

         --

Apr 18,2009



notes



1 Data Amount required by the team is extracted. But this is a continuing process. We will put a freeze to data collection/scraping later in the project.