Introduction - Project Concept
Twitter is a social networking web application used by millions of people to update their status by answering "What are you doing?" This tweet facilitates micro blogging enabling friends, family, and co–workers to communicate and stay connected through the exchange of quick, frequent answers which may lead to sharing of common thoughts, discussions about the similar topic(s). It implies a possibility of similar set of users who can be recommended to broaden their network. However, Twitter lacks this feature of user recommendation. This motivated us towards building such a Recommendation Engine.
NOTE : Click to goto LiveStatus Page
Milestone 1 - Finalize Requirements Complete Due: Feb 10, 2009
Week 1
After going through couple of iterations the requirements have been finalized. On the basis of disucssions held with Jacob & our primary research on the topic the project proposal was finalized and submitted. Barring one question about "user recommendation" the project proposal was approved. This will be resolved as we make more progress on the project.
Milestone 2 - Data Extraction & Refining Complete1 Due: Feb 24, 2009
Week 2 - 4 For this milestone we have started developing data extraction scripts & reading on various approaches for mining the tweet data simultaneously.
- Codebase Development & Database Server setup
- Set up the database server on dbserv.uits.indiana.edu.
- Researched & found about Twitter APIs used to interface with the Twitter Database.
- Wrote Java scripts for extracting
- Twitter User Information - collecting information like user id, user name, location, follower count, screen name etc.
- Tweets - collecting information like tweet content, user id, created date, status id.
- Wrote PERL scripts to automate the process of compiling the Java Script on IU CS machines and executing them. Did this to leverage the resources on hand at CS department.
- LiveStatus page to show the current statistics of our database results. It can be viewed here.
- Setting up this progress report & the LiveStatus page.
- Researching on different Algorithms
- Unsupervised Learning / Clustering Algorithms in general.
- Bottom-up Approach algorithm by Junghoo Cho, Narayanan Shivakumar Hector Garcia-Molina Link http://is.gd/ksRd
- K-means clustering algorithm
- Due to cap on the database quota we have moved our database to CS MySQL Database server. The process of scraping data is going on right now.
- Added more statistics "Avg. Tweet per User in our DB", "Highest "Number of Tweet" User in our DB " to the statistics page.
- Collection of Geocoding information
- Wrote Java script to scrape geocode information from the Location fields in the User Profiles. After researching on geocoding web services available, settled on the use of Yahoo API for this.
- Updated the Database Schema to save this location Information in the form of
- user id, original location string,
- latitude, longitude,
- address, city, state, zip, country
- Made changes to the script to handle null, malformed & iPhone specific location strings. Added another table "TWITTER_USER_LOCATION_ERROR_INFO" for storing erroneous locations. Next step will be to clean this data based on the priority.
Milestone 3 - Implementation In Progress Due: Apr 18, 2009
Week 5 - 8 However, due to restrictive amount of features per module, the user recommendation engine might not give good results for comparison. So we, are now planning to merge all these three factors into a common field factor = tweets (n number of tweets per user) + location + interest. This factor will then be projected in vector space and then used for comparison.
Week 9- Current Important Stats,
- Tweets : 600,000
- Twitter Profiles : 31,000
- In order to verify our experiments we have started collecting the information regarding followers i.e. all the twitter users that are following a particular user for all the users in our twitter users database.
Monitoring the user-follower pair data at regular intervals we found out that there was good amount of noise.
- There were cases in which a particular user has huge amount of followers possibly a celebrity like Digg's Kevin Rose with 405,726 followers. When considering the goal of our project which is to recommend similar users we decided that this is getting counter-intuitive. So the next step was to find out only the friends for that user i.e. the users that follow Kevin Rose and the users that Kevin Rose follows. This mitigated the problem of too much data and time that would be spent processing it.
- There was also a different kind of noise where twitter users followed a high number of other users. Such was a case with Barack Obama's profile. Currently this user follows 576,558 users and 679,812 users follow Barack Obama. This kind of pattern, we observed in other profiles as well, where they typically belong to social media operators profile. Again for the purpose of our project where we would try to recommend genuine users, we had to mitigate this. So, the case where a user has more than 1000 friends, we only collect the ids of only the first 1000 friends.
- As of now we have collected over 2 million user-follower pairs spanning approx 2000 users.
- In parallel we have started analysing the tweets as mentioned earlier. To this end we have had to tune the stop-word removal tool and stemmer so that it does not corrupt certain assets like hyper-links, references to other twitter users inside the tweet (i.e. words that start with '@' symbol).
- Implementation Details,
- After having looked at a collection of tools, we have settled on using either the command line open source CLUTO algorithm (for HAC) or a graphical extension for it gCLUTO, for formation of clusters in our data.
- doc2mat.pl (Perl script) takes a text file as an input. Every row in the text file is a vector to be represented in feature space. The first string in every row represents unique user id/document id. Output of the script is a document * term matrix(.mat) file. This script has an option to internally perform stemming and removing stop words. (We are trying various combinations of these scripts either by ignoring stemming/stop words removal OR by including them OR by modifying certain metrics).
- vcluster, an executable file as part of the CLUTO package, takes an input as document * term matrix(.mat). This script has several options for clustering (k-means, HAC etc.) and other options for similiarity measures (Eucledian, cosine similiarity etc.). Currently we are applying HAC and suitable k value (number of clusters) cutoff value for the tree.
- Output, Currently we ran the algorithm for 100 users and varying k value from 10 to 50. The performance was better as the value of k increased, however the results were poor after verifying/ comparing the recommendations with actual friends from Twitter. We expect to get better results after increasing the user count.