A study of Urank alogrithm: Wordpress social network as an example
by C.H. Chan aka Chainsaw Riot [e-mail / udacity id: chainsawtiney@gmail.com]
Honorary Research Director, Department of Paediatrics, Kwong Wah Hospital, Hong Kong
License:
This article is copyrighted to CH Chan and released under Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported license.
1.Aims
The primary objective of this study is to study the properties of Urank algorithm from social network perspective.
2.Introduction
The term “blogosphere” is used to describe the community or social network of interconnected blogs.[1] The interconnectivity of blogs can be best demonstrated by the blogrolls. Blogroll is a list of blogs that the blogger might recommended by providing links to them.[2] Blogroll is usually presented as a sidebar list. Figure 1 shows an example of a blogroll.
Figure 1: A typical blogroll
It is possible to harvest the links of blog on a blogroll using a webcrawler and construct a directed graph of interconnected blogs. Figure 2 shows an schematic diagram of the directed graph of interconnected blogs.
Using a link analysis algorithm such as Urank, we can measure the “relative popularity” of each blogs inside the blogosphere.
Figure 2: The schematic diagram of the directed graph of interconnected blogs
3.Implementation
This study was divided into three parts: 1) Crawling wordpress blogs, 2) Ranking wordpress blogs using Urank, 3) Studying Urank using social network analysis (SNA).
3.1.Crawling wordpress blogs
The web crawler used in this study was based on the web crawler built in Udacity CS 101. The web crawler ‘cwp’ was customized to crawl only the Wordpress blogs and the links in blogrolls. Wordpress blogs can be identified with the ‘generator’ meta tag in the HTML source. Similarly, blogrolls in Wordpress blogs can be identified with the “class = xoxo-blogroll” of UL (unordered list) HTML tag. Parsing of these tag was facilitated by the Beautiful Soup library. [3]
Compared to the original web crawler, the following changes were made in cwp: 1) Instead of crawling the newly added URL of the tocrawl queue, cwp would crawl the oldest member of the tocrawl queue. Therefore, cwp is “first in, first out”. 2) Error logging was added. 3) URLs were cleaned up before adding to the graph. 4) The crawling was terminated when the length of graph equals to 200. 5) The graph and errorlog were “pickled” for further analysis. [4]
The seed blog selected was my blog. (http://blog.tiney.com) [5]
Several well-known Wordpress blogs were not added to the graph by cwp because of HTTP 403 error or Beautiful Soup error. These blogs were added manually. [6]
3.2.Ranking Wordpress blogs and study the convergence of Urank
The graph created and “pickled” from cwp was “cleaned” before feeding to the compute_ranks procedure. For each blog, the outlinks were trimmed to include only the Wordpress blog crawled. The cleaned graph was then fed to the compute_ranks to calculate the Urank for each blog. The compute_ranks procedure was modified from the one built in Udacity. The original compute_ranks had the default value of 10 iterations. This number might be too small in real world situation and the iterative Urank might not been converged. The plot of total absolute successive change of Uranks for each blogs (epsilon) and the number of iterations was plotted. And alternative stopping rule was proposed in compute_ranks2 which based on the value of epsilon to ensure the convergence.
The cleaned graph was formatted into edge list [7] and exported as csv file for further analysis using R. [8]
3.3 Studying Urank using social network analysis
Social network analysis (SNA) is focused on uncovering the pattern of people’s interaction. [9] In the context of SNA, centrality is defined as the measures of the relative importance of a vertex within the graph. [10] There are four commonly used centrality measures, namely, degree centrality, betweenness, closeness, and eigenvector centrality. The interpretations of these measures are listed in table 1. The igraph package [11] is a free software package for visualization and analysis of social network and it is available for C, Ruby, Python and R. In this part of the study, igraph for R was used. [11]
The graph from 3.2 was visualised and the centrality measures of each blog were calculated. The correlations among these centrality measures and Urank was calculated as the non-parametric Kandell Tau rank correlation coefficient.
Measure | Interpretation | Correlation with Urank (Tau) |
Degree centrality | Number of links from and to the blog in question | 0.549 |
Betweenness | The frequency of passing through the blog in question through linkages of all blogs | 0.339 |
Closeness | The total distance to other blogs from the blog in question | -0.182 |
Eigenvector centrality | Influence of the blog in question within all blogs | 0.128 |
A trademarked-ranking algorithm invented by one of the co-founders of the world biggest search engine. I can’t say the P word. | Similar to Eigenvector centrality | 0.857 |
Table 1: The interpretation of centrality measures and their correlation with Urank
4. Findings
4.1 The convergence of Urank
From the current example of 204 Wordpress blogs, the relationship between epsilon and the number of iterations is presented in figure 3. The convergence was quick. The selection of 10 iterations in the original Urank calculation was justified. An alternative version of Urank calculation procedure was proposed based on the tolerated epsilon value. (compute_ranks2)
Figure 3: The relationship between number of iterations and the epsilon
4.2 Summation of Urank
The trademarked-ranking algorithm that inspired the Urank has a beautiful property: for a graph, the summation of calculated scores for each nodes would be added up to one. This properties was not found in Urank for unknown reason. [12] It was reproduced in the current empirical example of 204 Wordpress blogs, the total Uranks was only 0.35. [13]
4.3 Visualization of the graph
The visualization of the graph is presented in figure 4. The size of bubbles is corresponding to the Urank and labelled bubbles are those with Urank larger than the 50th percentile of all Uranks among all 204 Wordpress blogs.
Figure 4: The visualization of the social network of 204 Wordpress blogs. The size of the bubbles corresponding to the Urank of that particular blog. Blogs with red labels are those with Uranks higher than the 50th percentile [Full size: http://dl.dropbox.com/u/5409929/graph%20(2).png]
4.4 Correlation with other centrality measures
The scatterplots are presented in figure 5. The correlation coefficients were presented in table x. The correlation between Urank and degree centrality was highest. The correlations of Urank with closeness and eigenvector centrality were only mild at best.
Figure 5: The correlations of Uranks with various centrality measures. Topleft: Degree, Topright: Betweenness, Bottomleft: Closeness, Bottomright: Eigenvector centrality. The X-axis of all scatterplots is log-scaled.
5. Conclusion
This study has shown some properties of Urank using an example of 204 Wordpress blogs.
6. Acknowledgement
The author would like to thank Prof David Evans and Mr Peter Chapman for the wonderful Udacity CS 101 course.
7. Reference and footnote