an in-depth look into the world of aggregated review scores

By Uri Gorelik

Abstract. Aggregated review scores are a non-biased solution to add objectivity to highly subjective reviews. This article describes Aggregamer; a wrapper for the popular review score aggregator, Metacritic. Aggregamer, uses data-mining techniques to reveal interesting insights through Metacritic’s review score data such as: average review score and discerning relevant critics. This report also describes how Aggregamer uses collaborative filtering to recommend products and RF-Rec to predict review scores.

1 Introduction

How useful is a recommendation? A recommendation is only as useful as the authority under which it is issued. A friend, colleague or family member can provide meaningful, relevant, and accurate recommendations given that they have adequate knowledge within the domain of the recommended item. However, these recommenders can be limited to a small breadth of items and an individual might seek to explore other recommendations provided by professionals.

Professional recommenders, or critics, often have some kind of speciality within a domain which cites them an authority on the subject. At first, it may seem that a critic can provide better recommendations than a friend but a critic lacks one fundamental piece of information: to whom the item is being recommended. It is the responsibility of the individual to find critics that match their own preferences. In fact, an individual might create a pool of trusted critics and use a combination of ratings to make a decision on a product.

Video games are hot candidates for critic reviews. With tens of thousands of games being released yearly[1], many games must rely on advertising or recommendations. As a consumer it is difficult to wade through games that might be worth the investment of time and money. The importance of review scores goes even beyond the consumer. Many smaller games, which do not receive the same attention or budget as “blockbuster” games, may live or die based solely on review scores. Entire companies may even be at risk of crumbling if their reviews scores are inadequate. Take for example developing studio Obsidian Entertainment, who missed a crucial bonus from their publisher Bethesda Softworks, LLC by missing a targeted aggregated-review score by one point[2].

Aggregated review scores are important to the video game industry. Though many can disagree, aggregated review scores are used as a quick measure to determine the quality of a game. They are commonly obtained through a leading review website called Metacritic, which is a CBS Interactive subsidiary. Metacritic’s dominance in the field is evident from simple searches which yield no obvious competitors. While Metacritic is operated like a business, using advertisements and promoting content, it also lacks transparency. Metacritic does not clearly state how a critic can apply to become aggregated, nor does it  reveal how it might weigh certain critic’s reviews:

“Can you tell me how each of the different critics are weighted in your formula?

  Absolutely not.” (

        Aggregamer at its core is essentially a wrapper for Metacritic. It uses all of the critic scores that Metacritic has collected for a game and weighs them equally. Aggregamer also finds “critical” critics, also referred to as partial prejudice critic partitioning, and provides the ability to recalculate the aggregated review score. This article will discuss how Aggregamer attained its data from Metacritic, how it selects which scores to consider, its use of frequency-based score prediction, and its item-to-item recommendation system.

2 Background

5-point rating scale. A common rating scale that uses ratings from 0-5. On this scale 0 is the lowest possible rating while 5 is the highest possible rating. Commonly referred to as stars, e.g. a 3-star rating and loosely related to Likert scale.

10-point rating scale. A common rating scale that uses ratings from 0-10. On this scale 0 is the lowest possible rating while 10 is the highest possible rating.

10-point-decimal rating scale. A rating scale from 0-100 but sometimes represented with a decimal, e.g. a score of 74 might be represented as 7.4. This kind of rating scale allows for more precise ratings. Typically reviewers use it as a 20-point rating scale (only use .5 in between whole numbers), but some rate to the decimal. The 10-point-decimal scale helps to combat the game rating skew by increasing the amount of distinct possible ratings (e.g.  for a positive review score, it increases the amount of ratings from three in a 10-point scale, to 30 in a 10-point-decimal scale.)

Anecdotal scale. The following (figure 2.1) is an anecdotal description of the rating tiers. Notice as the scores increase, the range in which they represent becomes more precise. This is due to the fact that as the score moves away from the mean, the set of games that fall within those scores becomes smaller and is more easily definable. Also to note, there is currently no game with a higher aggregated score than 97.






“The Wild West,” bad review but no consistency






Bad, but there are worse






Not bad















Massive impact, very wide recognition

Figure 2.1: Anecdotal rating descriptions

Collaborative filtering. Algorithms typically ran on large datasets to create recommendations.

Console. A console is the piece of hardware that a game operates on, e.g. Xbox One, Playstation 4, Nintendo Wii U. Computers are generally distinguished from consoles, but can also play games.

Crawler. A tool used to systematically discovered and parse URLs on a given domain or many domains.

Game rating skew. A skew that only applies to the 10-point rating scale and the 10-point-decimal rating scale, where the mean rating is 7 (or 70). This means that on a 10-point rating scale a positive review is between 7-10 and on a 10-point-decimal rating scale a positive review is typically between 75-100 (or 80-100 based on opinion).

Metacritic. A website which aggregates review scores for many items, such as: movies, TV, music, and video games. It is the leading website in its area and is widely used as a first resort for seeking reliable review scores within its vast collection of items in each of its reviewed sections.

RF-Rec. A rating prediction algorithm based on rating frequencies.

Wrapper. A wrapper can be a general term for something the augments existing functionality with something new.

XPath. A query language for selecting XML nodes.

3 Related Work

Work directly related to this article includes Amazon’s Item-to-Item collaborative filtering algorithm, and the RF-Rec algorithm. This article also dabbles in a less sophisticated interpretation of inter-rater-reliability.

Amazon’s Item-to-Item collaborative filtering algorithm was originally proposed in 2003 as a scalable way to quickly make recommendations while still using a large dataset. This method creates a product similarity table offline and then uses it in conjunction with the user’s purchased items.

RF-Rec is an algorithm that can be used to predict the scores of items that have not been rated. The user’s most commonly used score (mode) is weighed against the mode score of the item and a predicted score is achieved.

Inter-rater-reliability is a measurement to determine if raters agree with one another. It is not directly used within this project but there are some considerations to filter out critics who are rating too high or too low and may skew a good aggregated rating.

4 Methodology

This section will discuss four distinct steps in the realization of Aggregamer: obtaining the data from Metacritic, calculating the averages, predicting scores using RF-Rec, and recommending games using item-to-item collaborative filtering.

4.1 Obtaining the data. Metacritic was parsed using a ruby library called Nokogiri.This library allows an XML document to be navigated using XPaths. First, a list of consoles was obtained  through a  drop down menu[3]. Each console has an alphabetical list (letter index) of games with each list containing 100 items and a next page link if the list has more than 100 items.

        Secondly, a rudimentary crawler was created to navigate all of the letter index pages as well as any next page links that may have been present. Each crawler was statically delayed by three seconds.

Figure 4.1.1: An example of an index page for the PC “console” under the letter ‘F’

        In the previous step, each game was stored in a database, which stored the name of game and a URL where details about the game can be viewed. Using this URL along with appending the path /critic-reviews[4], all of the game review scores and critics who reviewed them were displayed. Once this page was obtained, a series of XPaths were used to extract the relevant data. Each page-request to obtain the game’s scores was statically delayed by two seconds[5].

.//div[@class='review_section']/div[@class='review_stats']/div[@class='review_grade']/div[contains(@class, 'indiv')]

Figure 4.1.2: An example of an XPath that was used to obtain the reviewer’s score

4.2 Calculating the averages. Once all of the review scores had been obtained, a few simple techniques were used to determine the mean review score. The scale on which the critic rated was determinable by looking at the distinct rating scores. For example, if a critic’s distinct scoring-set contained five or six elements (i.e. {0,1,2,3,4,5}), then the critic was considered to rate on a 5-point. Likewise if the scoring set contained 10 or 11 elements, then the critic was considered to rate on a 10-point scale. It is necessary to check a range between five and six, as well as 10 and 11 because some critics will not give the maximum or minimum value. For example, prolific critic GiantBomb does not rate any game as a zero therefore their distinct scoring set only contains five elements, whereas a critic who does rate with zeros will have six elements. Critics that rate on a 10-point-decimal scale were assumed to be the complement of the 5-point and 10-point reviewers.

        The mean review score was calculated individually for each of the scales and it was found that the 5-point reviewers had a mean score of 71, the 10-point reviewers had a mean score of 70, and the 10-point-decimal reviewers had a mean score of 74.

        It is important to only consider critics with ample reviews. Otherwise critics with few reviews could have a large impact on the overall system. For this reason, critics with fewer than 50 reviewers, were deemed to be in the irrelevant-critic set. By removing all irrelevant critics from the previously mentioned segments, it was found that there are 6 relevant 5-point, 32 relevant 10-point, and 248 relevant 10-point-decimal critics.


4.3 Predicting scores using RF-rec. One possible flaw that could occur with the 10-point-decimal scale, is that the frequencies will be quite spread out and yield inaccurate results when compared to other scales. This implementation of RF-Rec rounds to the nearest half decimal, i.e. a score of 73 will get rounded to 75 and a score of 72 will get rounded to 70, and then proceeds to performs the algorithm. With less discrete scores for frequencies, the hope is to predict scores for games that might have fewer reviews, or are reviewed by critics that use different scales.

4.4 Recommending games using Item-to-Item collaborative filtering. Using the review scores, item-to-item collaborative filtering was implemented to recommend new games. The algorithm compares two items based on common reviews. Using both sets of reviews the cosine similarity is computed between the two items. The pair with the highest similarity produces a recommendation. In this specific implementation, a required minimum of overlapping reviews was set to 20, i.e. if the two items did not share at least 20 critics they were considered irrelevant.

5 Discussion

5.1 Average review score. Through manual analysis of the review scores for critics who rated on the 5-point scale, it was revealed that Metacritic actually weighs scores differently for some critics. For instance, if critic A rates a game with a score of one, Metacritic might translate that score to a 20, whereas for critic B it might be translated to 40. For this reason all reviewers that were found to rate using the 5-point scale were removed from the average.

        However, since there were only six relevant 5-point-critics, the previous average of 74 was not meaningfully affected.

Figure 5.1.1: A bar graph showing the frequency of the top ten most used review scores

        This subset of reviewers was further reduced by only considering reviewers who had a mean review score that was less than or equal to the overall mean of 74, this subset is called partial prejudice critic partitioning. Recalculating the aggregated review score for a sample of popular games yielded very similar scores. Only a handful of games show significant score changes (see figure 5.1.2).

Game Title

Metacritic Score

Aggregamer Score




Grand Theft Auto V



Last of Us



Mass Effect 3



Pokemon X






Super Mario Galaxy



Figure 5.1.2: Comparison of Aggregamer’s “Partial Prejudice Critic Partitioning”

5.2 Predicting scores using RF-Rec. Frequency based score prediction turned out well. Using  the same sample of games that were present in figure 5.1.2, and taking relevant-reviewer IGN, a comparison is made between IGN’s original score and their predicted score, attained via RF-Rec. As a game approaches the mean review score of 70 RF-Rec has a very high accuracy, thus the list in figure 5.2.1 showcases games that were reviewed above the average to see how it deals with challenging predictions.

Game Title

Real Score

Predicted Score




Grand Theft Auto V



Last of Us



Mass Effect 3



Pokemon X






Super Mario Galaxy



Figure 5.2.1: Predicting critic IGN’s review scores for popular games.

        Once Aggregamer, allows non-critics to review games, RF-Rec can be expanded to support user-based recommendations (discussed in section 6). By asking a user to review already released games, RF-Rec can be used to predict the user’s score for a new game.

5.3 Recommending games using Item-to-Item collaborative filtering. Recommendations through Aggregamer showed promising results. Choosing a popular title with a sequel produced the sequel in the list of recommendations, for example finding recommendations for Super Mario Galaxy yielded: BioShock, Half-Life 2, Super Mario Galaxy 2, Rome: Total War, Uncharted 2: Among Thieves. Similarly, recommendations for Pokemon X yielded: Batman: Arkham City, Call of Duty: Modern Warfare 2, Pokemon Y, BlazBlue: Calamity Trigger, Empire: Total War.

        Changing the minimum number of overlapping critics had significant impacts on the reviews. At first the minimum was set to five, which produced less than favourable results, but then the minimum finally settled at 20 which started producing sequels to games and other related titles. It would also be interesting to optimize the minimum number of overlapping critics required for the cosine similarity calculation.

        While these two sample recommendations did produce popular sequels to the original game, it also produced titles that were hard to imagine as accurate recommendations based on their differences in genre and style This does not however eliminate the possibility that the results would be favourable to the user.

        Similarities can be considered static; one of the benefits of critic only reviews is that there eventually will come a time where no new reviews will be accepted. Thus the similarities can be calculated offline, and then quickly looked up when requested. This method offers low overhead for displaying a recommendation while still leveraging the entire system.

6 Conclusion

The world of game reviews can often be misleading as well as intimidating. It is difficult to find reliable and trustworthy reviews when seeking information for a potential new purchase. Websites such as Metacritic offer an extensive database of information on a large variety of games, however, its lack of transparency may raise eyebrows.

        The analysis on mean review scores and relevant critics hopes to provide a more clear and true view of aggregated scores, potentially even eliminating critics who are solely motivated by monetary gains or who are endorsed by a biased party.

Aggregamer also offers a possible solution to what review websites are lacking: game recommendations. By utilizing item-to-item collaborative filtering, Aggregamer was able to offer arguably meaningful recommendations.

        The possibilities for the expansion of Aggregamer are very extensive, such as comparing more games using partial prejudice critic partitioning, to determine if there are prevalent critics who consistently add a skew to the aggregated score. It would also be beneficial to investigate scoping recommendations to a specific genre or console. A way to evaluate the quality of a recommendation, perhaps through a third party source, would also be beneficial.

        Aggregamer hopes to one day open its doors to the public and allow for non-critic users to review games and receive meaningful recommendations.

7 References

Linden, G., B. Smith, and J. York. " Recommendations: Item-to-item Collaborative Filtering." IEEE Internet Computing 7, no. 1 (12 2003): 76-80. doi:10.1109/MIC.2003.1167344.

"More Games Have Released on Steam so Far in 2014 than All of Last Year." Gamasutra Article. Accessed April 02, 2015.

"Obsidian Missed Fallout: New Vegas Metacritic Bonus by One Point." Engadget. Accessed April 02, 2015.

Gedikli, F., and Jannach D. “Recommending based on rating frequencies: Accurate enough?”

        September 2011

        Commerce and Enterprise Computing (CEC), 2011 IEEE 13th Conference on 5-7 Sept. 2011

        E-ISBN: 978-0-7695-4535-6

[1] See Gamasutra reference.

[2] See Endgadget article.

[3] A common form of navigation which lists a topic and then expands to more specific terms.

[4] Example of a real URL: ``

[5] The entire process took around 18 hours to complete.