Introduction to
Recommender Systems

Walid Krichene
Nassma Summer School

06/27/2019

Recommendation

YouTube Homepage

Recommended videos

  • related to previous videos I watched
  • currently trending

Recommendation

YouTube video page

Recommended videos

  • related to current video, watch history

Recommendation

Many others applications

  • Retail: recommend product
  • Media: recommend video, news article, social media post
  • Travel: recommend hotel
  • Advertising
  • ...

Amazon Homepage

Why Recommendation?

“More than 70 percent of the time people spend watching videos on the site is now driven by YouTube’s algorithmic recommendations.”

Content discovery
Corpus size can be too large for manual browsing

Revenue

User engagement

Recommender System Overview

Candidate generation

Ranking

Terminology

  • Query: information about the user and context
  • Item: entity to be recommended

YouTube homepage recommendation example

  • Query: user watch history
  • Item: video id, title, categories, etc.

Candidate Generation

Candidate Generation

Goal: given query, find set of relevant candidates from entire corpus.

Example: find say 100 relevant videos from a corpus of 100M.

Measuring similarity

Training: Learn embeddings for queries and items

  • Content-based filtering
  • Collaborative filtering

Retrieval: given query embedding, return best candidates
according to
similarity measure

query

3

2

1

4

5

  • dot-product 3 > 2 > 1 > 4 > 5
  • cosine 4 > 5 > 3 > 2 > 1

Content-based filtering

1

1

1

1

1

1

1

1

1

Music

Math

News

LastWeekTonight

The Beatles

3Blue1Brown

Jazz

Rock

1

1

1

Content-based filtering

1

1

1

1

1

1

1

1

1

Music

Math

News

LastWeekTonight

The Beatles

3Blue1Brown

Jazz

Rock

1

1

1

Content-based filtering

Similarity based on item features

No need of data about many users

Can capture niche interest

Need some domain knowledge, recs only as good as features

Collaborative filtering

Similarity between users and items simultaneously

  • user A and user B are similar (watched similar movies)
  • user B likes movie Inception
  • recommend Inception to user A

Results in serendipitous recommendations

Collaborative filtering

Collaborative filtering

5

4

5

4

4

4

5

5

5

4

4

Collaborative filtering

5

4

5

4

4

4

5

5

5

4

4

Embeddings

Blockbuster

Arthouse

Embeddings

Blockbuster

Arthouse

Embeddings

Blockbuster

Arthouse

Embeddings

Blockbuster

Arthouse

Embeddings

Blockbuster

Arthouse

Embeddings

Blockbuster

Arthouse

Collaborative filtering

5

4

5

4

4

4

5

5

5

4

4

-0.8

0.9

-0.7

-0.6

-1.0

0.4

0.1

-0.7

1.0

-1.0

Embeddings

Blockbuster

Arthouse

Complex

Simple

Embeddings

Blockbuster

Arthouse

Complex

Simple

Embeddings

Blockbuster

Arthouse

Complex

Simple

Collaborative filtering

5

4

5

4

4

4

5

5

5

4

4

0.9

-0.8

-0.4

-0.6

0.2

1.

-0.8

0.9

-0.7

-0.6

-1.0

0.4

-.4

0.1

0.4

-.1

0.1

0.8

0.1

-.7

Embeddings

Blockbuster

Arthouse

Complex

Simple

Embeddings

Blockbuster

Arthouse

Complex

Simple

Serving: Nearest Neighbors in Embedding Space

Given query/user embedding u, look look for items close in embedding space.

  • Exhaustive scoring can be expensive
  • Can do approximate nearest neighbors
    Maximum Inner Product Search (MIPS) sublinear in num. items

Store embeddings

query embedding

Automatically learning the embeddings

Toy example: manually designed embeddings.

Machine learning: can automatically learn the embeddings.

Many techniques:

  • Matrix Factorization
  • Neural Embedding Models

Matrix Factorization

Low-rank approximation

0.9

-0.8

-0.4

-0.6

0.2

1.

-0.8

0.9

-0.7

-0.6

-1.0

0.4

-.4

0.1

0.4

-.1

0.1

0.8

0.1

-.7

Defining a loss function

1

1

1

1

1

1

1

1

1

1

1

Observations

Defining a loss function

1

1

1

1

1

1

1

1

1

1

1

Observations

All entries (SVD)

0

1

1

1

0

0

1

0

0

0

1

1

0

1

0

0

0

1

1

0

1

1

0

0

Defining a loss function

1

1

1

1

1

1

1

1

1

1

1

Observations

All entries (SVD)

Weighted Matrix Factorization

0

1

1

1

0

0

1

0

0

0

1

1

0

1

0

0

0

1

1

0

1

1

0

0

0

1

1

1

0

0

1

0

0

0

1

1

0

1

0

0

0

1

1

0

1

1

0

0

\sum_{(i, j) \in \text{obs}} (A_{ij} - \langle U_i, V_j\rangle)^2 + \alpha \sum_{(i, j) \not\in \text{obs}} A_{ij}^2

Negatives and folding

Suppose data is clustered (colors)

  • query
  • item

and may be close by chance if no negatives are used

\sum_{(i, j) \in \text{obs}} (A_{ij} - \langle U_i, V_j\rangle)^2 + \alpha \sum_{(i, j) \not\in \text{obs}} A_{ij}^2

Optimizing the loss

SGD (Stochastic Gradient Descent)

  • Generic (other losses)
  • Parallel
  • Slower
  • Need to sample unobserved

WALS (Weighted Alternating Least Squares)

  • Specific to this loss
  • Parallel
  • Fast (unconstrained quadratic in each variable)
  • Handles unobserved

\sum_{(i, j) \in \text{obs}} (A_{ij} - \langle U_i, V_j\rangle)^2 + \alpha \sum_{(i, j) \not\in \text{obs}} \langle U_i, V_j \rangle^2

Matrix Factorization

  • No domain knowledge
  • Serendipitous recommendations

  • Bilinear model only
  • Context features?
  • Fresh items / users?

Neural Collaborative Filtering

Neural Collaborative Filtering

Neural embedding models address the limitations of traditional matrix factorization:

  • Non-linear model
  • Seamlessly incorporates features

We will look at a simple Softmax model architecture:
predict a distribution over the corpus.

Softmax model

Inputs:
User id, history of watches, interest

Softmax model

Hidden layers

Softmax model

Softmax layer:

Maps a vector of scores to a probability distribution

\color{OliveGreen}\hat p = s(\psi(x){\color{NavyBlue}V^\top}})

Embedding interpretation

probability of item i is

  • query embedding
  • item embedding

Retrieval:

  • Find i that maximizes

\color{OliveGreen}\hat p = s(\psi(x){\color{NavyBlue}V^\top}})

Loss function

Loss function:


where

  • : model output
  • : ground truth

\ell({\color{OliveGreen} \hat p(x)}, {\color{NavyBlue} p})

Negative sampling

Sum over entire corpus, need to approximate it, e.g. via sampling

Ranking

Ranking

Train a model to score candidates, rank them according to score.

Candidate Generator

Candidate Generator

Candidate Generator

Candidate Generator

Ranker

Ranking

Why not score/rank with candidate generator?

  • Heterogeneous scores
  • Smaller pool of candidates => can use more context / more expensive model

Examples:

  • Learn P(watch | , )
  • Learn pairwise ranking function

Defining an Objective Function

ML like a mischievous genie, careful what you wish for!

Defining an Objective Function

Optimize clicks?

Defining an Objective Function

Optimize watch time?

Defining an Objective Function

Optimize better measure of engagement

Beyond popular items

Want to recommend more than popular items

Solutions:

  • Weight training examples
  • Using more features can help capture relevance

New items (the cold-start problem)

What to do if item is not seen in training?

Solutions:

  • Refresh model often (warm-start)
  • Use models with item features
  • Heuristics to get new item embedding

Summary

Recommendation models rely on embeddings of users/items

  • Matrix Factorization
  • Neural embedding models

Many important topics not covered: Diversity, Fairness, etc.

This afternoon: build a (simple) movie recommendation engine!
Colab link

Thank you

Nassma 2019 | Intro to Rec Systems - Google Slides