1 of 86

By Kai Xin

Data Scientist, Lazada

Building large hybrid

recommender systems

2 of 86

What’s a recommender?

“

3 of 86

We have the

classic retail examples

4 of 86

It is no longer good enough to recommend an average product to an average user.

Check out more of Seth Godin’s thoughts and books:

http://sethgodin.typepad.com/

50% male 50% female?

5 of 86

A good recommender understands consumer needs and provide timely, useful, non-obvious and diverse recommendations to users.

Now. Instant. At time of purchase

Discounts. Discovery of new stuff.

I bought the book, can you recommend me similar books from other authors?

“

6 of 86

Link to article

7 of 86

Or in plain language of the me generation:

I demand that you customize your recommendations to me.

Also, I want it now.

“

8 of 86

Or in plain language of the me generation:

Customize your recommendations to me.

And I want it now.

How? By Data science + Technology

“

9 of 86

Data science + Technology:

I can build a recommender using 21M movie ratings from MovieLens, optimize it in 30min on Amazon EC2 cluster using a distributed version of collaborative filtering algorithm - alternating least squares from Apache Spark’s MLlib. Total cost: $1

“

10 of 86

I can tweak my $1 recommender and

personalize education for millions of children to reduce dropouts.

altschool website

Wired article on altschool

11 of 86

I can tweak my $1 recommender to

customize a healthy lifestyle plan

for individuals.

Guardian: Future of wearables

Scientific American: fitness trackers - do they work

12 of 86

I can tweak my $1 recommender and

match donors with clients to

improve donation frequency

Kiva - Microfinance

13 of 86

Data science + Technology =

Power to change the world

“

14 of 86

Two parts to this talk

Data Science

How does it work? What are the algorithms?
Cold start problem: how do we recommend to new users?
How do we tune & evaluate the model?

Technology

How do we scale up the solution to 20M, 200M, 2B, 20B ratings?
How can we recommend in real time?

We will start with a small, simple movie recommender and move on to larger recommenders for the me me me generation.

15 of 86

Future work...

Social, psychological aspects of human behavior

Motivations & Rewards
Predictably Irrational
Social network theories

16 of 86

How does a recommender work?

17 of 86

Crowd behavior helps understand individual behavior

18 of 86

User’s actions and history feeds

into system

Recommendation System

19 of 86

Collaborative Filtering

20 of 86

Content Filtering

21 of 86

Content Filtering (Fail?)

22 of 86

Usually we use a mix of both: for example use content to narrow down options and collaborative to recommend most relevant

23 of 86

Part I : Technology

24 of 86

Movie recommender case

GroupLens Research has collected and made available rating data sets from the MovieLens web site (http://movielens.org).

The full data contains 21,000,000 ratings and 470,000 tag applications applied to 27,000 movies by 230,000 users.

Last updated 4/2015.

25 of 86

References

Scalable collaborative filtering with Spark MLlib

Movie recommendation with MLlib

26 of 86

How do we scale up recommenders? What’s the big deal about Spark?

“

27 of 86

Ways to run Spark

28 of 86

Spark directed acyclic graph

“Compared to MapReduce, which creates a DAG with two predefined stages - Map and Reduce, DAGs created by Spark can contain any number of stages.

This allows some jobs to complete faster than they would in MapReduce, with simple jobs completing after just one stage, and more complex tasks completing in a single run of many stages, rather than having to be split into multiple jobs.”

~ MapR Anatomy of a Spark Job

29 of 86

How Spark works - stages

30 of 86

Spark: Flexible and Lazy

Basically, Sparks’ DAG setup makes it flexible to optimized the workflow and allow iterative and interactive data exploration (great for building data science models).

Also, Spark jobs are lazy in nature: This design enables Spark to run more efficiently, we can realize that a dataset created through map will be used in a reduce and return only the result of the reduce to the driver, rather than the larger mapped dataset.

31 of 86

Resilient Distributed Datasets (RDD)

“Spark jobs perform work on Resilient Distributed Datasets (RDDs), an abstraction for a collection of elements that can be operated on in parallel.

When running Spark in a Hadoop cluster, RDDs are created from files in the distributed file system in any format supported by Hadoop, such as text files, SequenceFiles, or anything else supported by a Hadoop InputFormat.”

~ MapR Anatomy of a Spark Job

32 of 86

Common RDD operations

Transformations: return a new, modified RDD:

map(), filter(), sample(), and union().

Actions: return computations being performed on an RDD:

reduce(), count(), first(), and foreach().

Hold RDDs in storage (RDD default is non persistent):

cache() [memory only] or persist() [memory / disk ]

33 of 86

Spark: Speed & Streaming

34 of 86

Fast and scalable

machine learning libraries (MLlib)

35 of 86

Spark’s collaborative filtering algorithm: Alternating least squares (ALS)

https://databricks.com/blog/2014/07/23/scalable-collaborative-filtering-with-spark-mllib.html. Unlike the user or item based recommenders that computes the similarity of users or items to make recommendations, the ALS algorithm uncovers the latent factors that explain the observed user to item ratings and tries to find optimal factor weights to minimize the least squares between predicted and actual ratings. https://mahout.apache.org/users/recommender/intro-als-hadoop.html

Spark MLlib implements a collaborative filtering algorithm called Alternating Least Squares (ALS), which has been implemented in many machine learning libraries and widely studied and used in both academia and industry. ALS models the rating matrix (R) as the multiplication of low-rank user (U) and product (V) factors, and learns these factors by minimizing the reconstruction error of the observed ratings. The unknown ratings can subsequently be computed by multiplying these factors. In this way, companies can recommend products based on the predicted ratings and increase sales and customer satisfaction.

ALS is an iterative algorithm. In each iteration, the algorithm alternatively fixes one factor matrix and solves for the other, and this process continues until it converges. MLlib features a blocked implementation of the ALS algorithm that leverages Spark’s efficient support for distributed, iterative computation. It uses native LAPACK to achieve high performance and scales to billions of ratings on commodity clusterss

36 of 86

More precisely: Blocked Alternating least squares (ALS)

http://data-artisans.com/als.html , http://www.netlib.org/lapack/

https://databricks.com/blog/2014/07/23/scalable-collaborative-filtering-with-spark-mllib.html

Spark MLlib implements a collaborative filtering algorithm called Alternating Least Squares (ALS), which has been implemented in many machine learning libraries and widely studied and used in both academia and industry. ALS models the rating matrix (R) as the multiplication of low-rank user (U) and product (V) factors, and learns these factors by minimizing the reconstruction error of the observed ratings. The unknown ratings can subsequently be computed by multiplying these factors. In this way, companies can recommend products based on the predicted ratings and increase sales and customer satisfaction.

ALS is an iterative algorithm. In each iteration, the algorithm alternatively fixes one factor matrix and solves for the other, and this process continues until it converges. MLlib features a blocked implementation of the ALS algorithm that leverages Spark’s efficient support for distributed, iterative computation. It uses native LAPACK to achieve high performance and scales to billions of ratings on commodity clusterss

37 of 86

Blocked Alternating least squares (ALS)

http://data-artisans.com/als.html , http://www.netlib.org/lapack/

Why only blocked ALS implemented in Spark? Because many fancy algorithms don’t scale well and are not yet suitable for large systems

https://databricks.com/blog/2014/07/23/scalable-collaborative-filtering-with-spark-mllib.html

Spark MLlib implements a collaborative filtering algorithm called Alternating Least Squares (ALS), which has been implemented in many machine learning libraries and widely studied and used in both academia and industry. ALS models the rating matrix (R) as the multiplication of low-rank user (U) and product (V) factors, and learns these factors by minimizing the reconstruction error of the observed ratings. The unknown ratings can subsequently be computed by multiplying these factors. In this way, companies can recommend products based on the predicted ratings and increase sales and customer satisfaction.

ALS is an iterative algorithm. In each iteration, the algorithm alternatively fixes one factor matrix and solves for the other, and this process continues until it converges. MLlib features a blocked implementation of the ALS algorithm that leverages Spark’s efficient support for distributed, iterative computation. It uses native LAPACK to achieve high performance and scales to billions of ratings on commodity clusterss

38 of 86

Code with Python, Java, Scala

file = spark.textFile("hdfs://...")

file.flatMap(lambda line: line.split())

.map(lambda word: (word, 1))

.reduceByKey(lambda a, b: a+b)

39 of 86

Fairly easy to setup & submit

jobs to cluster

Create cluster with 4 workers on EC2

spark-ec2 -k key -i key.pem --region=ap-southeast-1 -s 4 launch myCluster

Submit python code to cluster

spark-submit code.py hdfs://… --verbose

40 of 86

How do we set up spark on EC2?

(Self Study)

“

41 of 86

EC2 instances price table:

http://www.ec2instances.info/

42 of 86

Spot Instance price table:

Bid for leftover instances to save money but they can DC you if market bid > your bid.

43 of 86

Read Spark Documentation: https://spark.apache.org/docs/latest/

44 of 86

1. Download spark: https://spark.apache.org/downloads.html

45 of 86

2. Run example

(if error, check java_home or goto step 0)

cd to spark folder
./bin/run-example SparkPi 10

> Pi is roughly 3.13918

46 of 86

3. Code pySpark file

from pyspark.mllib.recommendation import ALS

# load training and test data into (user, product, rating) tuples

def parseRating(line):

fields = line.split()

return (int(fields[0]), int(fields[1]), float(fields[2]))

training = sc.textFile("...").map(parseRating).cache()

test = sc.textFile("...").map(parseRating)

# train a recommendation model

model = ALS.train(training, rank = 10, iterations = 5)

# make predictions on (user, product) pairs from the test data

predictions = model.predictAll(test.map(lambda x: (x[0], x[1])))

47 of 86

4 Read documentation on EC2

https://spark.apache.org/docs/latest/ec2-scripts.html

Important, do this!

48 of 86

4.1 Create Amazon Keypair

Remember to chmod key to 400!

49 of 86

4.2 Security credentials

Obtain Access Keys (Access Key ID and Secret Access Key)

50 of 86

5. Initiate cluster

(if error, check AWS keys see image in step 4)

cd to spark/ec2 folder

./spark-ec2 -k kx-us -i ~/awsKey/kx-us.pem \

--region=us-east-1 --zone=us-east-1c --spot-price=0.141 \

--instance-type=r3.large --slaves 3 launch kx-spark-cluster

51 of 86

5. Initiate cluster

(if error, check AWS keys see image in step 4)

./spark-ec2 ⇒ the command
-k kx-us -i ~/awsKey/kx-us.pem ⇒ the keys
--region=us-east-1 ⇒ region to launch
--zone=us-east-1c ⇒ zone to launch
--spot-price=0.141 ⇒ set price you willing to pay
-slaves 3 ⇒ number of slaves (worker nodes)
launch ⇒ the command to launch clusters
kx-spark-cluster ⇒ name of cluster

52 of 86

6. Login to cluster

cd to spark/ec2 folder

./spark-ec2 -k kx-us -i ~/awsKey/kx-us.pem \

--region=us-east-1 --zone=us-east-1c \

login kx-spark-cluster

53 of 86

7. Spark command line log too verbose

cd to spark/conf folder

sed 's/rootCategory=INFO/rootCategory=WARN/' \

log4j.properties.template>log4j.properties

=>

54 of 86

8. Add files to hdfs storage

*HDFS files will be lost once you power off the EC2 instance

(on local computer) scp -i ~/awsKey/kx-us.pem \

movielens.tar.bz2 root@[masterNodeIP]:~/data

(on master node) tar jxf movielens.tar.bz2

(on master node) ~/ephemeral-hdfs/bin/hadoop fs \

-put movielens movielens

55 of 86

9. Submit python job to cluster

~/spark/bin/spark-submit \

~/data/movielens/mlALS2.py \

hdfs:///user/root/movielens/data --verbose

56 of 86

9. Submit python job to cluster

~/spark/bin/spark-submit ⇒ command ~/data/movielens/mlALS2.py ⇒ python code hdfs:///user/root/movielens/data --verbose ⇒ data

57 of 86

10. Stop / Start / Destroy Cluster

cd to spark/ec2 folder
./spark-ec2 -k kx-us -i ~/awsKey/kx-us.pem \

--region=us-east-1 --zone=us-east-1c \

stop kx-spark-cluster

[stop / start / destroy]

58 of 86

59 of 86

Coming soon…in part II

Python code

Results�Tuning ALS

Evaluation metrics

More print screens

60 of 86

thiakx@gmail.com

https://sg.linkedin.com/in/thiakx

t: @thiakx

61 of 86

Part II - Data Science

62 of 86

Overview of recommenders

Non Personalized

Based on overall historical data.
Best sellers.
New.
Editor’s pick.
Featured.

Personalized

Customized to each user’s behavior.
People who bought X also bought Y. (Collaborative filtering)
Explicit

Ratings, reviews comments

Implicit

What they clicked, purchased, follow.

63 of 86

Overview of recommenders -Audible

Non personalized, overall best sellers

Personalized -

implicit based on past purchase

64 of 86

Overview of recommenders -Audible

Improve the recommender by training it with you preference

65 of 86

Overview of recommenders

Non Personalized

Based on overall historical data.
Best sellers.
New.
Editor’s pick.
Featured.

Personalized

Customized to each user’s behavior.
People who bought X also bought Y. (Collaborative filtering)
Explicit

Ratings, reviews comments

Implicit

What they clicked, purchased, follow.

66 of 86

Recommender Metrics

Reference: Evaluating Recommendation Systems by

Guy Shani and Asela Gunawardana

67 of 86

68 of 86

Good recommenders are not only about matching past behaviors but also about discovery of new interests and adjusting future behaviors.

69 of 86

First rule of evaluation metric:

Normalization is extremely important!

For a very strict customer, 3 star is "good" for him.

For a lenient user who rates everything 4 star and above, 3 star is “very bad”.

70 of 86

2nd rule of evaluation metric:

We care about long term statistical significance, not one-off 0.1% improvements. And also watch out for confidence intervals

Common tool: sign test or paired Student’s t-test

In essence, we compare between:

The number of users for whom algorithm A outperforms algorithm B.
The number of users for whom algorithm B outperforms algorithm A.

71 of 86

Overview of metrics

Experiments

Offline
Focus group study
Large online experiments (A-B test)

Historical matching

RMSE
Ranking
ROC curve
precision@n

Discovery & Adjustment

Cold start
Serendipity
Diversity
Utility (lift, up-sell, cross-sell, conversions rate (CR), Click-Through Rate (CTR).

72 of 86

Experiments

Components

Hypothesis
Controlling variables (data & algos)
Generalization power

Note: Not all users are equal, a user who buys a lot of items may deserve a higher weight to his opinions.

73 of 86

Offline Experiments

(build and run models on machines)

Strength

Low cost.
Good at evaluating model based on historical data.
Good at filtering out really bad ideas / models.

Weakness

No interaction with customers.
Assume past behavior represent future behavior.
Bad at differentiating between good and great models.

74 of 86

Tips for Offline Experiments

Watch out for data bias via sampling (care when filtering out low counts or randomly sampling data)

Ways to conduct experiments:

Temporal order | random time | fixed time | random sample

Simulate user behavior (danger zone, if simulation is wrong, we are building a model based on wrong input)

Cheaper

More accurate

How about streaming - real time order?

75 of 86

User Study Group

Strength

Get direct feedbacks from customers and observe usage behavior.
Collect qualitative data that is often crucial for interpreting the quantitative results.
Good at testing Serendipity, Diversity.

Weakness

Expensive for the small number of people tested.
Customers’ behavior affected by study group setting, may not give true answers.
Self selection of volunteers - not representative.
Need to understand proper design of user studies to design questions and setting.

76 of 86

Online Experiments

(A-B testing on website)

Strength

Best way to determine Utility and long term metrics (retention, estimate customer lifetime value etc)
Able to compare & run multiple algorithms & design on multiple variations of the site.

Weakness

Takes time.
Can only test one variance at a time.
Be careful not to irritate users with constant trials, new design or bad recommendations.

77 of 86

Tips for Online Experiments

Sample (redirect) users randomly, so that the comparisons between alternatives are fair.

Single out the different aspects of the recommenders, don’t test too many things in one experiment.

Run an online evaluation last, after an extensive offline study provides evidence that the models have reasonable performance and perhaps after a user study that measures the user’s attitude towards the system.

78 of 86

Accuracy Metrics

How closely can you model historical data

Root mean square error (RMSE) and its variation (normalized, average, mean absolute error etc)
Precision@N (Number of items that is relevant to customer among the 10 recommendations)
Ranking: Spearman’s rank correlation, Breese R-Score, Normalized Cumulative Discounted Gain (NDCG)
ROC curve and its derivation (F-measure, area under curve, global ROC, customer ROC)

79 of 86

Beyond accuracy -

Novelty & Diversity & Serendipity

Recommending a movie that I do not know but by the same actor is novel but not surprising. Recommending me a movie I didn’t know I will like by actors I am not familiar with is serendipity. In terms of distance metrics:

Novelty: recommend items within same cluster.
Diversity: distance between items
Serendipity: recommend items as far away from customer’s current profile as possible while scoring if customer still take up the recommendation / increase activity / give good feedbacks. (need to test over time, some users might click the links just due to the surprise factor)

80 of 86

Beyond accuracy -

Robustness, Adaptivity & Scalability

Robustness is the stability of the recommendation in the presence of fake information or under extreme conditions, such as a large number of request.

Test the system offline by generating fake reviews traffic (how much fake / traffic can the system handle before it goes down)

Adaptivity to promotions, sudden trends or user behavior (after user rates / comments on an item)

Test offline: Amount of difference in recommendation before and after the event.

Scalability: Run offline test with larger and larger data sets, does algo scale linearly or exponentially? Trade off accuracy for speed.

Throughput: recommendation/second
Latency: the required time for making a recommendation/

81 of 86

Beyond accuracy -

Coverage

Item coverage & diversity: the percentage of all items that are recommended to users, weighted by popularity or utility, aka distributional equality. Use Gini Index, Shannon Entropy.

User coverage: how to deal with cold start / outliers

Cold start: What’s the threshold that defines coldness? (less than one week old? no purchase?) might need to build separate models tuned to cold items.

82 of 86

Beyond accuracy -

Trust and Risk

Trust: Build trust and confidence of customers by anchoring the recommendations with a few items the user already knows and likes (and people they respect / like also like). Can only be measured online (count number of recommendations taken up, or increase in user activity) or via asking users for feedbacks.

Risk: Paying low price for a potential fake product / shirt size that does not fit and need to return. Different user different risk appetite.

83 of 86

Beyond accuracy -

Utility & Privacy

System utility: lift, up-sell, cross-sell,conversions.

User utility: how much they find the recommendations useful (click through rates). Recommendations that offend users are penalized heavily (recommending a Xiaomi to an avid Apple user). Also measure ranking of items.

Privacy: Be open about what data is being collected, why and how it will improve experience. Test offline if the additional private data really improves model accuracy, if improvements minor, maybe don’t collect the data.

84 of 86

Hybrid recommenders

Reference: Hybrid Recommender Systems:

Survey and Experiments by Robin Burke

85 of 86

86 of 86

Accuracy Metrics

How closely can you model historical data

It is very complex and expensive to create a single recommender that will work across multiple countries and culture preferences.

Instead, we start with several “good enough” recommenders:

Collaborative filtering, content based, demographic, utility & knowledge based.
Each with their strengths and weaknesses (cold start, data requirements, gray sheep problem, stability vs. plasticity)
Deploy them in hybrid (weighted, switch&mix, feature creation, cascade)
Measure their performances with the various metrics we learnt from first paper.