In this project, you will formulate a recommendation engine framework by running the Pearson Correlation Similarity Measure on a given dataset of movie recommendations, and then refine your predictions using clustering algorithms.
Parts of this project will be automatically verified through the Udacity autograder, but the majority of the project will be self-assessed using this rubric.
Part 1: Collaborative Filtering
Recommendation engines generally use two different techniques to make predictions:
We will use Collaborative Filtering to build our recommendation engine.
We will start by mathematically formulating a recommendation engine framework. In a recommendation-system application, there are two classes of entities, users and items. Users have preferences for certain items and these preferences must be discovered from the data. The data is represented as a utility matrix, a value that represents the rating given by that user for that item and is given for each user-item pair. The ratings can take a value between 1 to 5, representing the number of stars that the user gave as a rating for that item.
We will assume that the matrix is sparse, meaning that most of the entries are unknown.
Here is an example of a utility matrix, depicting four users: Ann, Bob, Carl and Doug. The available movie names are HP1, HP2 and HP3 for the Harry Potter movies, and SW1, SW2 and SW3 for Star Wars episodes 1, 2 and 3.
Table 1. An example Utility Matrix for Movie Recommendation (you can fill this out here)
HP1 | HP2 | HP3 | SW1 | SW2 | SW3 | |
Ann | 4 | 1 | ||||
Bob | 5 | 5 | 4 | |||
Carl | 4 | 5 | ||||
Doug | 3 | 3 |
The goal of the recommendation engine is to predict the blanks in a utility matrix. For example, how will Ann rate SW2?
In order to make these predictions, we must first measure similarity of users or items from the rows and columns of the Utility Matrix. The data is too small to draw any reliable conclusions. Examine the preferences of Ann and Carl. They both have rated the same movie but have diametrically opposite opinions on it. We would expect that a good distance measure would set them far apart. We can use different similarity measures for this task, but in this project we will use the Pearson Correlation Similarity Measure.
Let denote the rating given by user to item . If is the set of all items that two users and have rated, then the Pearson Correlation Similarity Measure between the two users is given by
,
where denotes the average rating given by user to all items. To calculate we only consider items that were rated by the user.
We still need to take an additional step to recommend items to users. One way of predicting the value of the utility matrix entry of a given user, U, and a given item, I, is to find the n number of users most similar to U and average their ratings for item I, counting only those among the n similar users who have rated I.
It is generally better to normalize the matrix first. That is, for each of the n users, subtract the average rating for all items from the rating of each item, i. Average the difference for those users who have rated I, and add this average to the average rating that U gives for all items. This normalization adjusts the estimate in the case that U tends to give very high or very low ratings.
Part 1 Instructions:
Part 2: Clustering Users and Items
In this part of the project, you will be using the MovieLens dataset to create a utility matrix. Each of the users and items will be defined in the dataset itself. You should download the part2.py template to get started. Check the entries in the data/README for finding out how the data is given to you.
It is difficult to detect the similarity between items and users because we have little information about user-item pairs in this sparse utility matrix. Even if two items belong to the same genre, they are likely to be very few users who bought or rated both. Similarly, even if two users both like a genre or multiple genres, they may not have bought any items to indicate that.
One way of dealing with this lack of data is to cluster items and/or users. In this example, we will cluster similar movies.
HP | SW | |
Ann | 4 | 1 |
Bob | 4.67 | |
Carl | 4.5 | |
Doug | 3 | 3 |
After clustering the items, we can revise the utility matrix so the columns represent clusters of items and the entry for User U and Cluster C is the average rating that U gave to the members of Cluster C that U did not individually rate. Note that U may have rated none of the cluster members, in which case the entry for U and C is left blank.
Part 2 Instructions: