1 of 86

By Kai Xin

Data Scientist, Lazada

Building large hybrid

recommender systems

2 of 86

What’s a recommender?

3 of 86

We have the

classic retail examples

4 of 86

It is no longer good enough to recommend an average product to an average user.

Check out more of Seth Godin’s thoughts and books:

http://sethgodin.typepad.com/

50% male 50% female?

5 of 86

A good recommender understands consumer needs and provide timely, useful, non-obvious and diverse recommendations to users.

Now. Instant. At time of purchase

Discounts. Discovery of new stuff.

I bought the book, can you recommend me similar books from other authors?

6 of 86

7 of 86

Or in plain language of the me generation:

I demand that you customize your recommendations to me.

Also, I want it now.

8 of 86

Or in plain language of the me generation:

Customize your recommendations to me.

And I want it now.

How? By Data science + Technology

9 of 86

Data science + Technology:

I can build a recommender using 21M movie ratings from MovieLens, optimize it in 30min on Amazon EC2 cluster using a distributed version of collaborative filtering algorithm - alternating least squares from Apache Spark’s MLlib. Total cost: $1

10 of 86

I can tweak my $1 recommender and

personalize education for millions of children to reduce dropouts.

11 of 86

I can tweak my $1 recommender to

customize a healthy lifestyle plan

for individuals.

12 of 86

I can tweak my $1 recommender and

match donors with clients to

improve donation frequency

13 of 86

Data science + Technology =

Power to change the world

14 of 86

Two parts to this talk

Data Science

  • How does it work? What are the algorithms?
  • Cold start problem: how do we recommend to new users?
  • How do we tune & evaluate the model?

Technology

  • How do we scale up the solution to 20M, 200M, 2B, 20B ratings?
  • How can we recommend in real time?

We will start with a small, simple movie recommender and move on to larger recommenders for the me me me generation.

15 of 86

Future work...

Social, psychological aspects of human behavior

  • Motivations & Rewards
  • Predictably Irrational
  • Social network theories

16 of 86

How does a recommender work?

17 of 86

Crowd behavior helps understand individual behavior

18 of 86

User’s actions and history feeds

into system

Recommendation System

19 of 86

Collaborative Filtering

20 of 86

Content Filtering

21 of 86

Content Filtering (Fail?)

22 of 86

Usually we use a mix of both: for example use content to narrow down options and collaborative to recommend most relevant

23 of 86

Part I : Technology

24 of 86

Movie recommender case

GroupLens Research has collected and made available rating data sets from the MovieLens web site (http://movielens.org).

The full data contains 21,000,000 ratings and 470,000 tag applications applied to 27,000 movies by 230,000 users.

Last updated 4/2015.

25 of 86

References

26 of 86

How do we scale up recommenders? What’s the big deal about Spark?

27 of 86

Ways to run Spark

28 of 86

Spark directed acyclic graph

“Compared to MapReduce, which creates a DAG with two predefined stages - Map and Reduce, DAGs created by Spark can contain any number of stages.

This allows some jobs to complete faster than they would in MapReduce, with simple jobs completing after just one stage, and more complex tasks completing in a single run of many stages, rather than having to be split into multiple jobs.”

~ MapR Anatomy of a Spark Job

29 of 86

How Spark works - stages

30 of 86

Spark: Flexible and Lazy

Basically, Sparks’ DAG setup makes it flexible to optimized the workflow and allow iterative and interactive data exploration (great for building data science models).

Also, Spark jobs are lazy in nature: This design enables Spark to run more efficiently, we can realize that a dataset created through map will be used in a reduce and return only the result of the reduce to the driver, rather than the larger mapped dataset.

31 of 86

Resilient Distributed Datasets (RDD)

“Spark jobs perform work on Resilient Distributed Datasets (RDDs), an abstraction for a collection of elements that can be operated on in parallel.

When running Spark in a Hadoop cluster, RDDs are created from files in the distributed file system in any format supported by Hadoop, such as text files, SequenceFiles, or anything else supported by a Hadoop InputFormat.”

~ MapR Anatomy of a Spark Job

32 of 86

Common RDD operations

  • Transformations: return a new, modified RDD:
    • map(), filter(), sample(), and union().
  • Actions: return computations being performed on an RDD:
    • reduce(), count(), first(), and foreach().
  • Hold RDDs in storage (RDD default is non persistent):
    • cache() [memory only] or persist() [memory / disk ]

33 of 86

Spark: Speed & Streaming

34 of 86

Fast and scalable

machine learning libraries (MLlib)

35 of 86

Spark’s collaborative filtering algorithm: Alternating least squares (ALS)

36 of 86

More precisely: Blocked Alternating least squares (ALS)

http://data-artisans.com/als.html , http://www.netlib.org/lapack/

37 of 86

Blocked Alternating least squares (ALS)

http://data-artisans.com/als.html , http://www.netlib.org/lapack/

Why only blocked ALS implemented in Spark? Because many fancy algorithms don’t scale well and are not yet suitable for large systems

38 of 86

Code with Python, Java, Scala

file = spark.textFile("hdfs://...")

file.flatMap(lambda line: line.split())

.map(lambda word: (word, 1))

.reduceByKey(lambda a, b: a+b)

39 of 86

Fairly easy to setup & submit

jobs to cluster

Create cluster with 4 workers on EC2

spark-ec2 -k key -i key.pem --region=ap-southeast-1 -s 4 launch myCluster

Submit python code to cluster

spark-submit code.py hdfs://… --verbose

40 of 86

How do we set up spark on EC2?

(Self Study)

41 of 86

EC2 instances price table:

http://www.ec2instances.info/

42 of 86

Spot Instance price table:

Bid for leftover instances to save money but they can DC you if market bid > your bid.

43 of 86

Read Spark Documentation: https://spark.apache.org/docs/latest/

44 of 86

45 of 86

2. Run example

(if error, check java_home or goto step 0)

  • cd to spark folder
  • ./bin/run-example SparkPi 10

> Pi is roughly 3.13918

46 of 86

3. Code pySpark file

from pyspark.mllib.recommendation import ALS

# load training and test data into (user, product, rating) tuples

def parseRating(line):

fields = line.split()

return (int(fields[0]), int(fields[1]), float(fields[2]))

training = sc.textFile("...").map(parseRating).cache()

test = sc.textFile("...").map(parseRating)

# train a recommendation model

model = ALS.train(training, rank = 10, iterations = 5)

# make predictions on (user, product) pairs from the test data

predictions = model.predictAll(test.map(lambda x: (x[0], x[1])))

47 of 86

Important, do this!

48 of 86

4.1 Create Amazon Keypair

Remember to chmod key to 400!

49 of 86

4.2 Security credentials

Obtain Access Keys (Access Key ID and Secret Access Key)

50 of 86

5. Initiate cluster

(if error, check AWS keys see image in step 4)

  • cd to spark/ec2 folder

  • ./spark-ec2 -k kx-us -i ~/awsKey/kx-us.pem \

--region=us-east-1 --zone=us-east-1c --spot-price=0.141 \

--instance-type=r3.large --slaves 3 launch kx-spark-cluster

51 of 86

5. Initiate cluster

(if error, check AWS keys see image in step 4)

  • ./spark-ec2 ⇒ the command
  • -k kx-us -i ~/awsKey/kx-us.pem ⇒ the keys
  • --region=us-east-1region to launch
  • --zone=us-east-1c ⇒ zone to launch
  • --spot-price=0.141 ⇒ set price you willing to pay
  • -slaves 3 ⇒ number of slaves (worker nodes)
  • launch ⇒ the command to launch clusters
  • kx-spark-cluster ⇒ name of cluster

52 of 86

6. Login to cluster

  • cd to spark/ec2 folder

  • ./spark-ec2 -k kx-us -i ~/awsKey/kx-us.pem \

--region=us-east-1 --zone=us-east-1c \

login kx-spark-cluster

53 of 86

7. Spark command line log too verbose

  • cd to spark/conf folder

  • sed 's/rootCategory=INFO/rootCategory=WARN/' \

log4j.properties.template>log4j.properties

=>

54 of 86

8. Add files to hdfs storage

*HDFS files will be lost once you power off the EC2 instance

  • (on local computer) scp -i ~/awsKey/kx-us.pem \

movielens.tar.bz2 root@[masterNodeIP]:~/data

  • (on master node) tar jxf movielens.tar.bz2

  • (on master node) ~/ephemeral-hdfs/bin/hadoop fs \

-put movielens movielens

55 of 86

9. Submit python job to cluster

~/spark/bin/spark-submit \

~/data/movielens/mlALS2.py \

hdfs:///user/root/movielens/data --verbose

56 of 86

9. Submit python job to cluster

~/spark/bin/spark-submit ⇒ command ~/data/movielens/mlALS2.py ⇒ python code hdfs:///user/root/movielens/data --verbose ⇒ data

57 of 86

10. Stop / Start / Destroy Cluster

  • cd to spark/ec2 folder
  • ./spark-ec2 -k kx-us -i ~/awsKey/kx-us.pem \

--region=us-east-1 --zone=us-east-1c \

stop kx-spark-cluster

[stop / start / destroy]

58 of 86

59 of 86

Coming soon…in part II

Python code

Results�Tuning ALS

Evaluation metrics

More print screens

60 of 86

61 of 86

Part II - Data Science

62 of 86

Overview of recommenders

Non Personalized

  • Based on overall historical data.
  • Best sellers.
  • New.
  • Editor’s pick.
  • Featured.

Personalized

  • Customized to each user’s behavior.
  • People who bought X also bought Y. (Collaborative filtering)
  • Explicit
    • Ratings, reviews comments
  • Implicit
    • What they clicked, purchased, follow.

63 of 86

Overview of recommenders -Audible

Non personalized, overall best sellers

Personalized -

implicit based on past purchase

64 of 86

Overview of recommenders -Audible

Improve the recommender by training it with you preference

65 of 86

Overview of recommenders

Non Personalized

  • Based on overall historical data.
  • Best sellers.
  • New.
  • Editor’s pick.
  • Featured.

Personalized

  • Customized to each user’s behavior.
  • People who bought X also bought Y. (Collaborative filtering)
  • Explicit
    • Ratings, reviews comments
  • Implicit
    • What they clicked, purchased, follow.

66 of 86

67 of 86

68 of 86

Good recommenders are not only about matching past behaviors but also about discovery of new interests and adjusting future behaviors.

69 of 86

First rule of evaluation metric:

Normalization is extremely important!

For a very strict customer, 3 star is "good" for him.

For a lenient user who rates everything 4 star and above, 3 star is “very bad”.

70 of 86

2nd rule of evaluation metric:

We care about long term statistical significance, not one-off 0.1% improvements. And also watch out for confidence intervals

Common tool: sign test or paired Student’s t-test

In essence, we compare between:

  • The number of users for whom algorithm A outperforms algorithm B.
  • The number of users for whom algorithm B outperforms algorithm A.

71 of 86

Overview of metrics

Experiments

  • Offline
  • Focus group study
  • Large online experiments (A-B test)

Historical matching

  • RMSE
  • Ranking
  • ROC curve
  • precision@n

Discovery & Adjustment

  • Cold start
  • Serendipity
  • Diversity
  • Utility (lift, up-sell, cross-sell, conversions rate (CR), Click-Through Rate (CTR).

72 of 86

Experiments

Components

  • Hypothesis
  • Controlling variables (data & algos)
  • Generalization power

Note: Not all users are equal, a user who buys a lot of items may deserve a higher weight to his opinions.

73 of 86

Offline Experiments

(build and run models on machines)

Strength

  • Low cost.
  • Good at evaluating model based on historical data.
  • Good at filtering out really bad ideas / models.

Weakness

  • No interaction with customers.
  • Assume past behavior represent future behavior.
  • Bad at differentiating between good and great models.

74 of 86

Tips for Offline Experiments

  • Watch out for data bias via sampling (care when filtering out low counts or randomly sampling data)

  • Ways to conduct experiments:

Temporal order | random time | fixed time | random sample

  • Simulate user behavior (danger zone, if simulation is wrong, we are building a model based on wrong input)

Cheaper

More accurate

How about streaming - real time order?

75 of 86

User Study Group

Strength

  • Get direct feedbacks from customers and observe usage behavior.
  • Collect qualitative data that is often crucial for interpreting the quantitative results.
  • Good at testing Serendipity, Diversity.

Weakness

  • Expensive for the small number of people tested.
  • Customers’ behavior affected by study group setting, may not give true answers.
  • Self selection of volunteers - not representative.
  • Need to understand proper design of user studies to design questions and setting.

76 of 86

Online Experiments

(A-B testing on website)

Strength

  • Best way to determine Utility and long term metrics (retention, estimate customer lifetime value etc)
  • Able to compare & run multiple algorithms & design on multiple variations of the site.

Weakness

  • Takes time.
  • Can only test one variance at a time.
  • Be careful not to irritate users with constant trials, new design or bad recommendations.

77 of 86

Tips for Online Experiments

  • Sample (redirect) users randomly, so that the comparisons between alternatives are fair.

  • Single out the different aspects of the recommenders, don’t test too many things in one experiment.

  • Run an online evaluation last, after an extensive offline study provides evidence that the models have reasonable performance and perhaps after a user study that measures the user’s attitude towards the system.

78 of 86

Accuracy Metrics

How closely can you model historical data

  • Root mean square error (RMSE) and its variation (normalized, average, mean absolute error etc)
  • Precision@N (Number of items that is relevant to customer among the 10 recommendations)
  • Ranking: Spearman’s rank correlation, Breese R-Score, Normalized Cumulative Discounted Gain (NDCG)
  • ROC curve and its derivation (F-measure, area under curve, global ROC, customer ROC)

79 of 86

Beyond accuracy -

Novelty & Diversity & Serendipity

  • Recommending a movie that I do not know but by the same actor is novel but not surprising. Recommending me a movie I didn’t know I will like by actors I am not familiar with is serendipity. In terms of distance metrics:

    • Novelty: recommend items within same cluster.
    • Diversity: distance between items
    • Serendipity: recommend items as far away from customer’s current profile as possible while scoring if customer still take up the recommendation / increase activity / give good feedbacks. (need to test over time, some users might click the links just due to the surprise factor)

80 of 86

Beyond accuracy -

Robustness, Adaptivity & Scalability

  • Robustness is the stability of the recommendation in the presence of fake information or under extreme conditions, such as a large number of request.
    • Test the system offline by generating fake reviews traffic (how much fake / traffic can the system handle before it goes down)

  • Adaptivity to promotions, sudden trends or user behavior (after user rates / comments on an item)
    • Test offline: Amount of difference in recommendation before and after the event.

  • Scalability: Run offline test with larger and larger data sets, does algo scale linearly or exponentially? Trade off accuracy for speed.
    • Throughput: recommendation/second
    • Latency: the required time for making a recommendation/

81 of 86

Beyond accuracy -

Coverage

  • Item coverage & diversity: the percentage of all items that are recommended to users, weighted by popularity or utility, aka distributional equality. Use Gini Index, Shannon Entropy.

  • User coverage: how to deal with cold start / outliers

  • Cold start: What’s the threshold that defines coldness? (less than one week old? no purchase?) might need to build separate models tuned to cold items.

82 of 86

Beyond accuracy -

Trust and Risk

  • Trust: Build trust and confidence of customers by anchoring the recommendations with a few items the user already knows and likes (and people they respect / like also like). Can only be measured online (count number of recommendations taken up, or increase in user activity) or via asking users for feedbacks.

  • Risk: Paying low price for a potential fake product / shirt size that does not fit and need to return. Different user different risk appetite.

83 of 86

Beyond accuracy -

Utility & Privacy

  • System utility: lift, up-sell, cross-sell,conversions.

  • User utility: how much they find the recommendations useful (click through rates). Recommendations that offend users are penalized heavily (recommending a Xiaomi to an avid Apple user). Also measure ranking of items.

  • Privacy: Be open about what data is being collected, why and how it will improve experience. Test offline if the additional private data really improves model accuracy, if improvements minor, maybe don’t collect the data.

84 of 86

85 of 86

86 of 86

Accuracy Metrics

How closely can you model historical data

  • It is very complex and expensive to create a single recommender that will work across multiple countries and culture preferences.

  • Instead, we start with several “good enough” recommenders:
    • Collaborative filtering, content based, demographic, utility & knowledge based.
    • Each with their strengths and weaknesses (cold start, data requirements, gray sheep problem, stability vs. plasticity)
    • Deploy them in hybrid (weighted, switch&mix, feature creation, cascade)
    • Measure their performances with the various metrics we learnt from first paper.