udacity contest

A study of Urank alogrithm: Wordpress social network as an example

by C.H. Chan aka Chainsaw Riot [e-mail / udacity id: chainsawtiney@gmail.com]

Honorary Research Director, Department of Paediatrics, Kwong Wah Hospital, Hong Kong

License:

This article is copyrighted to CH Chan and released under Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported license.

1.Aims

The primary objective of this study is to study the properties of Urank algorithm from social network perspective.

2.Introduction

The term “blogosphere” is used to describe the community or social network of interconnected blogs.[1] The interconnectivity of blogs can be best demonstrated by the blogrolls. Blogroll is a list of blogs that the blogger might recommended by providing links to them.[2] Blogroll is usually presented as a sidebar list. Figure 1 shows an example of a blogroll.

Figure 1: A typical blogroll

It is possible to harvest the links of blog on a blogroll using a webcrawler and construct a directed graph of interconnected blogs. Figure 2 shows an schematic diagram of the directed graph of interconnected blogs.

Using a link analysis algorithm such as Urank, we can measure the “relative popularity” of each blogs inside the blogosphere.

Figure 2: The schematic diagram of the directed graph of interconnected blogs

3.Implementation

This study was divided into three parts: 1) Crawling wordpress blogs, 2) Ranking wordpress blogs using Urank, 3) Studying Urank using social network analysis (SNA).

3.1.Crawling wordpress blogs

The web crawler used in this study was based on the web crawler built in Udacity CS 101. The web crawler ‘cwp’ was customized to crawl only the Wordpress blogs and the links in blogrolls. Wordpress blogs can be identified with the ‘generator’ meta tag in the HTML source. Similarly, blogrolls in Wordpress blogs can be identified with the “class = xoxo-blogroll” of UL (unordered list) HTML tag. Parsing of these tag was facilitated by the Beautiful Soup library. [3]

Compared to the original web crawler, the following changes were made in cwp: 1) Instead of crawling the newly added URL of the tocrawl queue, cwp would crawl the oldest member of the tocrawl queue. Therefore, cwp is “first in, first out”. 2) Error logging was added. 3) URLs were cleaned up before adding to the graph. 4) The crawling was terminated when the length of graph equals to 200. 5) The graph and errorlog were “pickled” for further analysis. [4]

The seed blog selected was my blog. (http://blog.tiney.com) [5]

Several well-known Wordpress blogs were not added to the graph by cwp because of HTTP 403 error or Beautiful Soup error. These blogs were added manually. [6]

3.2.Ranking Wordpress blogs and study the convergence of Urank

The graph created and “pickled” from cwp was “cleaned” before feeding to the compute_ranks procedure. For each blog, the outlinks were trimmed to include only the Wordpress blog crawled. The cleaned graph was then fed to the compute_ranks to calculate the Urank for each blog. The compute_ranks procedure was modified from the one built in Udacity. The original compute_ranks had the default value of 10 iterations. This number might be too small in real world situation and the iterative Urank might not been converged. The plot of total absolute successive change of Uranks for each blogs (epsilon) and the number of iterations was plotted. And alternative stopping rule was proposed in compute_ranks2 which based on the value of epsilon to ensure the convergence.

The cleaned graph was formatted into edge list [7] and exported as csv file for further analysis using R. [8]

3.3 Studying Urank using social network analysis

Social network analysis (SNA) is focused on uncovering the pattern of people’s interaction. [9] In the context of SNA, centrality is defined as the measures of the relative importance of a vertex within the graph. [10] There are four commonly used centrality measures, namely, degree centrality, betweenness, closeness, and eigenvector centrality. The interpretations of these measures are listed in table 1. The igraph package [11] is a free software package for visualization and analysis of social network and it is available for C, Ruby, Python and R. In this part of the study, igraph for R was used. [11]

The graph from 3.2 was visualised and the centrality measures of each blog were calculated. The correlations among these centrality measures and Urank was calculated as the non-parametric Kandell Tau rank correlation coefficient.

Measure	Interpretation	Correlation with Urank (Tau)
Degree centrality	Number of links from and to the blog in question	0.549
Betweenness	The frequency of passing through the blog in question through linkages of all blogs	0.339
Closeness	The total distance to other blogs from the blog in question	-0.182
Eigenvector centrality	Influence of the blog in question within all blogs	0.128
A trademarked-ranking algorithm invented by one of the co-founders of the world biggest search engine. I can’t say the P word.	Similar to Eigenvector centrality	0.857

Table 1: The interpretation of centrality measures and their correlation with Urank

4. Findings

4.1 The convergence of Urank

From the current example of 204 Wordpress blogs, the relationship between epsilon and the number of iterations is presented in figure 3. The convergence was quick. The selection of 10 iterations in the original Urank calculation was justified. An alternative version of Urank calculation procedure was proposed based on the tolerated epsilon value. (compute_ranks2)

Figure 3: The relationship between number of iterations and the epsilon

4.2 Summation of Urank

The trademarked-ranking algorithm that inspired the Urank has a beautiful property: for a graph, the summation of calculated scores for each nodes would be added up to one. This properties was not found in Urank for unknown reason. [12] It was reproduced in the current empirical example of 204 Wordpress blogs, the total Uranks was only 0.35. [13]

4.3 Visualization of the graph

The visualization of the graph is presented in figure 4. The size of bubbles is corresponding to the Urank and labelled bubbles are those with Urank larger than the 50th percentile of all Uranks among all 204 Wordpress blogs.

Figure 4: The visualization of the social network of 204 Wordpress blogs. The size of the bubbles corresponding to the Urank of that particular blog. Blogs with red labels are those with Uranks higher than the 50th percentile [Full size: http://dl.dropbox.com/u/5409929/graph%20(2).png]

4.4 Correlation with other centrality measures

The scatterplots are presented in figure 5. The correlation coefficients were presented in table x. The correlation between Urank and degree centrality was highest. The correlations of Urank with closeness and eigenvector centrality were only mild at best.

Figure 5: The correlations of Uranks with various centrality measures. Topleft: Degree, Topright: Betweenness, Bottomleft: Closeness, Bottomright: Eigenvector centrality. The X-axis of all scatterplots is log-scaled.

5. Conclusion

This study has shown some properties of Urank using an example of 204 Wordpress blogs.

6. Acknowledgement

The author would like to thank Prof David Evans and Mr Peter Chapman for the wonderful Udacity CS 101 course.

7. Reference and footnote

Wikipedia: blogosphere
Wikipedia: Glossary of blogging
http://www.crummy.com/software/BeautifulSoup/
Python v2.7.2 documentation 11.1 - pickle - Python Object Serialization
Narcissistic
The actual reason for manual addition was that these are the blogs of my friends.
NetworkX documentation - edge list
R Development Core Team. R: A Language and Environment for Statistical Computing.
What is Social Network Analysis? International Network for Social Network Analysis
Wikipedia: Centrality
The reason for using R version instead of Python version is that the author is more confident with statistical analysis using R than using Python.
In the video of Unit 6.28, the Uranks for each nodes in the graph were 0.038, 0.116, 0.039, 0.054, 0.033, 0.097. The sum of all Urank score is 0.377.
However, the scores generated by the trademarked-ranking algorithm were added up to one in this Wordpress example. The ranks can be calculated using igraph.