By Kai Xin
Data Scientist, Lazada
Building large hybrid
recommender systems
What’s a recommender?
“
We have the
classic retail examples
It is no longer good enough to recommend an average product to an average user.
50% male 50% female?
A good recommender understands consumer needs and provide timely, useful, non-obvious and diverse recommendations to users.
Now. Instant. At time of purchase
Discounts. Discovery of new stuff.
I bought the book, can you recommend me similar books from other authors?
“
Or in plain language of the me generation:
I demand that you customize your recommendations to me.
Also, I want it now.
“
Or in plain language of the me generation:
Customize your recommendations to me.
And I want it now.
How? By Data science + Technology
“
Data science + Technology:
I can build a recommender using 21M movie ratings from MovieLens, optimize it in 30min on Amazon EC2 cluster using a distributed version of collaborative filtering algorithm - alternating least squares from Apache Spark’s MLlib. Total cost: $1
“
I can tweak my $1 recommender and
personalize education for millions of children to reduce dropouts.
I can tweak my $1 recommender to
customize a healthy lifestyle plan
for individuals.
I can tweak my $1 recommender and
match donors with clients to
improve donation frequency
Data science + Technology =
Power to change the world
“
Two parts to this talk
Data Science
Technology
We will start with a small, simple movie recommender and move on to larger recommenders for the me me me generation.
Future work...
Social, psychological aspects of human behavior
How does a recommender work?
Crowd behavior helps understand individual behavior
User’s actions and history feeds
into system
Recommendation System
Collaborative Filtering
Content Filtering
Content Filtering (Fail?)
Usually we use a mix of both: for example use content to narrow down options and collaborative to recommend most relevant
Part I : Technology
Movie recommender case
GroupLens Research has collected and made available rating data sets from the MovieLens web site (http://movielens.org).
The full data contains 21,000,000 ratings and 470,000 tag applications applied to 27,000 movies by 230,000 users.
Last updated 4/2015.
How do we scale up recommenders? What’s the big deal about Spark?
“
Ways to run Spark
Spark directed acyclic graph
“Compared to MapReduce, which creates a DAG with two predefined stages - Map and Reduce, DAGs created by Spark can contain any number of stages.
This allows some jobs to complete faster than they would in MapReduce, with simple jobs completing after just one stage, and more complex tasks completing in a single run of many stages, rather than having to be split into multiple jobs.”
~ MapR Anatomy of a Spark Job
How Spark works - stages
Spark: Flexible and Lazy
Basically, Sparks’ DAG setup makes it flexible to optimized the workflow and allow iterative and interactive data exploration (great for building data science models).
Also, Spark jobs are lazy in nature: This design enables Spark to run more efficiently, we can realize that a dataset created through map will be used in a reduce and return only the result of the reduce to the driver, rather than the larger mapped dataset.
Resilient Distributed Datasets (RDD)
“Spark jobs perform work on Resilient Distributed Datasets (RDDs), an abstraction for a collection of elements that can be operated on in parallel.
When running Spark in a Hadoop cluster, RDDs are created from files in the distributed file system in any format supported by Hadoop, such as text files, SequenceFiles, or anything else supported by a Hadoop InputFormat.”
~ MapR Anatomy of a Spark Job
Common RDD operations
Spark: Speed & Streaming
Fast and scalable
machine learning libraries (MLlib)
Spark’s collaborative filtering algorithm: Alternating least squares (ALS)
More precisely: Blocked Alternating least squares (ALS)
http://data-artisans.com/als.html , http://www.netlib.org/lapack/
Blocked Alternating least squares (ALS)
http://data-artisans.com/als.html , http://www.netlib.org/lapack/
Why only blocked ALS implemented in Spark? Because many fancy algorithms don’t scale well and are not yet suitable for large systems
Code with Python, Java, Scala
file = spark.textFile("hdfs://...")
file.flatMap(lambda line: line.split())
.map(lambda word: (word, 1))
.reduceByKey(lambda a, b: a+b)
Fairly easy to setup & submit
jobs to cluster
Create cluster with 4 workers on EC2
spark-ec2 -k key -i key.pem --region=ap-southeast-1 -s 4 launch myCluster
Submit python code to cluster
spark-submit code.py hdfs://… --verbose
How do we set up spark on EC2?
(Self Study)
“
EC2 instances price table:
Spot Instance price table:
Bid for leftover instances to save money but they can DC you if market bid > your bid.
Read Spark Documentation: https://spark.apache.org/docs/latest/
1. Download spark: https://spark.apache.org/downloads.html
2. Run example
(if error, check java_home or goto step 0)
> Pi is roughly 3.13918
3. Code pySpark file
from pyspark.mllib.recommendation import ALS
# load training and test data into (user, product, rating) tuples
def parseRating(line):
fields = line.split()
return (int(fields[0]), int(fields[1]), float(fields[2]))
training = sc.textFile("...").map(parseRating).cache()
test = sc.textFile("...").map(parseRating)
# train a recommendation model
model = ALS.train(training, rank = 10, iterations = 5)
# make predictions on (user, product) pairs from the test data
predictions = model.predictAll(test.map(lambda x: (x[0], x[1])))
4 Read documentation on EC2
Important, do this!
4.1 Create Amazon Keypair
Remember to chmod key to 400!
4.2 Security credentials
Obtain Access Keys (Access Key ID and Secret Access Key)
5. Initiate cluster
(if error, check AWS keys see image in step 4)
--region=us-east-1 --zone=us-east-1c --spot-price=0.141 \
--instance-type=r3.large --slaves 3 launch kx-spark-cluster
5. Initiate cluster
(if error, check AWS keys see image in step 4)
6. Login to cluster
--region=us-east-1 --zone=us-east-1c \
login kx-spark-cluster
7. Spark command line log too verbose
log4j.properties.template>log4j.properties
=>
8. Add files to hdfs storage
*HDFS files will be lost once you power off the EC2 instance
movielens.tar.bz2 root@[masterNodeIP]:~/data
-put movielens movielens
9. Submit python job to cluster
~/spark/bin/spark-submit \
~/data/movielens/mlALS2.py \
hdfs:///user/root/movielens/data --verbose
9. Submit python job to cluster
~/spark/bin/spark-submit ⇒ command ~/data/movielens/mlALS2.py ⇒ python code hdfs:///user/root/movielens/data --verbose ⇒ data
10. Stop / Start / Destroy Cluster
--region=us-east-1 --zone=us-east-1c \
stop kx-spark-cluster
[stop / start / destroy]
Coming soon…in part II
Python code
Results�Tuning ALS
Evaluation metrics
More print screens
Part II - Data Science
Overview of recommenders
Non Personalized
Personalized
Overview of recommenders -Audible
Non personalized, overall best sellers
Personalized -
implicit based on past purchase
Overview of recommenders -Audible
Improve the recommender by training it with you preference
Overview of recommenders
Non Personalized
Personalized
Good recommenders are not only about matching past behaviors but also about discovery of new interests and adjusting future behaviors.
First rule of evaluation metric:
Normalization is extremely important!
For a very strict customer, 3 star is "good" for him.
For a lenient user who rates everything 4 star and above, 3 star is “very bad”.
2nd rule of evaluation metric:
We care about long term statistical significance, not one-off 0.1% improvements. And also watch out for confidence intervals
Common tool: sign test or paired Student’s t-test
In essence, we compare between:
Overview of metrics
Experiments
Historical matching
Discovery & Adjustment
Experiments
Components
Note: Not all users are equal, a user who buys a lot of items may deserve a higher weight to his opinions.
Offline Experiments
(build and run models on machines)
Strength
Weakness
Tips for Offline Experiments
Temporal order | random time | fixed time | random sample
Cheaper
More accurate
How about streaming - real time order?
User Study Group
Strength
Weakness
Online Experiments
(A-B testing on website)
Strength
Weakness
Tips for Online Experiments
Accuracy Metrics
How closely can you model historical data
Beyond accuracy -
Novelty & Diversity & Serendipity
Beyond accuracy -
Robustness, Adaptivity & Scalability
Beyond accuracy -
Coverage
Beyond accuracy -
Trust and Risk
Beyond accuracy -
Utility & Privacy
Accuracy Metrics
How closely can you model historical data