1 of 35

Overview of recommendation systems

Important Disclaimer: This presentation is for educational purposes only. Any assumptions, opinions stated here are those of the author alone and do not necessarily represent any official strategy, positioning,statement of author’s current or past employers or any of its affiliates.

2 of 35

Outline

3 of 35

Framework for recommender systems

Static: Given the history of past interactions between users and items, recommend the most item most likely to be engaged with to each user.

Additional information: metadata about items, information about the user and the context.

Temporal: Everything in the framework evolves over time, and the systems need to generalize to future occurrences.

4 of 35

  1. Retrieval ~ from millions to few hundreds�
  2. Reranking ~ score few hundred candidates
    1. Failed impression demotion�
  3. Diversity boosting + multi objective optimization ~ produce final list

Stages in practical recommender systems

5 of 35

Item-Item Similarity

  1. For each pair of items, compute a similarity function �
  2. Given a set if items that the user has interacted with in the past, infer the likelihood of interaction for the user for each remaining item as:�����Where I(u) refers to the set of items the user has interacted with in the past.�
  3. Some examples of item-item similarity:
    1. Jaccard index of categories / topics
    2. Dot-product of document embeddings�
  4. Note that items are sort of helping similar items in being recommended next. As long as the similarity function is relevant to the domain, they are being “collaborative”.�

Example : Cowatch / Covisitation based recommender systems

Notation: We assume we are given interaction scores as a U by I matrix,

6 of 35

User-User Similarity

  1. For each user, from the interactions observed in the past, compute a score S(user, item), that is more positive when the user has interacted more favorably with the item.�
  2. Find users that are similar to the user being served right now using a similarity function over pairs of users�
  3. The personalized score of each item eligible for recommendation is claimed to be a average over scores for that item for all other users, weighted by the similarity of the served-user to the other source user.��

Note:

  1. These scores are “collaborative”. As soon as a user interacts with an item, we should cascade this information to the scores of a number of other users.�
  2. Either, serving recommendations for each user is O(U) or updating recommendations for users after an interaction is O(U). However, in practice, smaller “information networks” should suffice.

7 of 35

Two ideas of progress

  1. Smaller information networks between users i.e. Clustering
  2. Using both user-user and item-item collaboration together

8 of 35

Clustering

  1. Find a clustering of users (often using features or embeddings trained in some other domain)�
  2. For each user find the cluster whose centroid is nearest to the user. �
  3. For each cluster score items: �
  4. Store the ranked list of items per cluster in a table keyed by cluster ids�
  5. Store the cluster numbers for the user in a table keyed by user ids.�
  6. This serving recommendations with a user clustering based approach and is very low latency.

9 of 35

  • Instead of one cluster per user, we find distances to all �
  • Find scores of each item for the cluster�
  • Find scores of items for a user by looking up the ranked lists of multiple clusters and the distances for each cluster.������
  • Helps in diversity of recommendations
  • Closest cluster can change every time we train. This makes recommendations stable.

Clustering | multiple clusters i.e. Gaussian Mixture Model (lite)

10 of 35

For example, in a social network setting with user-user edges and user-group membership edges:

where F(u) is the set of users who are friends of u, F(u, v) is a ��score depicting the strength of this relationship (or activity with each other), S(v,i) is the score of the item for the friend.�Please note that this is similar to the trust propagation setup that was used in the PageRank algorithm (ref: video by Dr. Leskovec).���Similarly, incorporating the group relationship�where G(u) is the set of groups that u is a member of�M(g) is the set of members of group g�A(u, g) is a score of the affinity of user u to group g��

Seed based recommendation

References:

  1. The Youtube Video Recommendation System (2010)
  2. Recommendations using graphs

11 of 35

  1. Find k dimensional vectors for each user and item such that the given score S(u,i) is very similar to the dot product of the user and item vectors.��
  2. The loss function including regularization���Here and are k dimensional embeddings of a user and item respectively.�And refers to the strength of the interaction. �
  3. These embeddings can be trained with Weighted Alternating Least Squares, which keeps one set of embeddings fixed and solves the other one using closed form solutions and then alternates.�

Learn similarity functions from implicit feedback (Matrix Factorization)

References:

  1. Bayesian Personalized Ranking for Implicit Feedback
  2. Collaborative Filtering for implicit datasets

12 of 35

Matrix Factorization (continued)

4. C(u,i) term allows us to � - downsample negatives, especially for users with low number of positives� - encoding features of the interaction, like listen time for a podcast, or a 👎 from the user.��5. Serving recommendations in this approach is also more efficient.�After training, we have k-dimensional embeddings of items. Using approximate nearest neighbor search systems(ref: Hashing with Graphs), the top-N items by closeness to a given user embedding are retrieved.�

Why dot-product?

Since high dot-product retrieval is efficient, we are preferring the loss function that maximizes the dot product of positive examples and minimizes the dot-product of negative examples.

Otherwise a more general formulation would be to add user embedding and item embedding as the input layer of a MLP (ref: Neual Collaborative Filtering)

�Another way to choose recommendations would be to develop a multi-label classification with the (learned) user embedding as input and one label for each of the eligible items. This is similar to the dot-product approach since the softmax weights for an item can be considered its embedding. (ref: End-to-End Deep Attentive Personalized Item Retrieval for Online Content-sharing Platforms)

13 of 35

Some drawbacks of (vanilla) Matrix Factorization

At this point, we are familiar with users and items as points in a k-dimensional space. However some problems with vanilla matrix factorization are:

(a) sparsity of user-item links

(b) data imbalance (ref Matrix Completion from Power-Law Distributed Samples)

(c) the framework does not naturally extend to side features. (Matrix Co-factorization for Recommendation with Rich Side Information and Implicit Feedback, Bayesian Matrix Factorization with Side Information and Dirichlet Process Mixtures, Factorization Machines)

(d) the cold-start problem

However, we have already established:

u as a dense embedding to represent the user, which is currently a function of the user id only and

i, is currently a function of just the item id.�

Idea: To learn embeddings as outputs of trainable functions, i.e. neural networks.

14 of 35

Two tower models

���Where e_u = f ( x_u /*from userid*/, � x_h /*items in history*/,� x_t /*terms in user history*/,� x_f /*user features, context features*/ )

Where e_i = g ( y_i /*from itemid*/, � y_f /*item features*/ )

15 of 35

WARP : for each positive, search a negative that we are not doing well on and use that to compute the gradient��

Negative sampling : Take a lot of negatives, hoping one or more of then we are not doing well on already.

Candidate sampling | WARP loss and Negative sampling

16 of 35

Two tower models of personalized recommendations

Sigmoid,

dot product

Latent embedding

User

history

Terms salient to the user

User/contextfeatures

Latent embedding

Terms in

description

Item features

Embeddings of items in user history

Embeddings of terms in documents in user history, or terms searched by the user.

User Embeddings

Item Embeddings

Embeddings of terms in the document

17 of 35

Two tower models of personalized recommendations

Sigmoid,

dot product

User

history

Terms salient to the user

User/contextfeatures

Terms in

description

Item features

Embeddings of items in user history

Embeddings of terms in documents in user history, or terms searched by the user.

User Embeddings

Item Embeddings

Embeddings of terms in the document

(without trainable latent embeddings)

18 of 35

Personalized Search / Contextual personalized recommendations

Sigmoid,

dot product

Latent embedding

User

History embedding

Terms salient to the user

User/contextfeatures

Latent embedding

Terms in

description

Item features

Embeddings of items in user history

Embeddings of terms in documents in user history, or terms searched by the user.

User Embeddings

Item Embeddings

Embeddings of terms in the document

Element-wise product (user, context)

Context / Query embedding

Query

embedding

19 of 35

Visualizing personalized recommendation as a graph

E.g. in video recommendation

Users

Features

Context

Videos

Features

Channels

Features

Terms

20 of 35

Graph convolutional networks

21 of 35

Heterogeneous Information Networks

Another way to view Seed based recommendations is to look at the information available as a graph/network of nodes of different types and edges depicting different relationships. (ref: Recommendation in heterogeneous information network via dual similarity regularization),

To recommend in this setting of various relationship types, these algorithms construct various “meta-paths” i.e. random walks and then try to find embeddings of nodes whose inner product approximate the observed probability of two nodes co-occurring in these random paths. (ref: DeepWalk: Online Learning of Social Representations )

22 of 35

Deep Dive into GNNs

23 of 35

Agenda

  1. Introduction to Graph Neural Networks (GNNs)
    1. Convolution-based GNNs (“Graph Convolutional Networks”)
    2. GNNs with attention mechanisms
    3. Neural Message Passing formulation
  2. Encoder-decoder GNNs for graph representation learning
    • Link prediction
    • Pre-training
  3. Scaling and temporal graphs

23

24 of 35

Graph Neural Networks (GNNs)

24

Main idea: Pass messages along edges of graph, agglomerate & transform

The bigger picture:

Graph

Node features

Edge features

Scarselli et al., The Graph Neural Network Model (2009)

25 of 35

CNNs (on grids) as message passing

25

(Animation by

Vincent Dumoulin)

Single CNN layer �with 3x3 filter:

Update for a single pixel:

  • Transform messages individually
  • Add everything up

are (hidden layer) activations of a pixel/node

Full update:

26 of 35

Convolution-based Graph Neural Networks

26

Consider this

undirected graph:

Calculate update

for node in red:

Neural FP: Duvenaud et al. (NIPS 2015), Graph Convolutional Networks (GCN): Kipf & Welling (ICLR 2017)

(Optional) normalization constant

Neighbor indices

27 of 35

Different relationships and node types

27

GG-NN: Li et al. (ICLR 2016), R-GCN: Schlichtkrull et al. (2017)

Adding support for relation types�(e.g. knowledge graph relations, or for connecting data from different verticals)

Figure from Vaswani et al. (NIPS 2017)

Attention-based aggregation

GAT: Veličković et al. (ICLR 2018), Transformers: Vaswani et al. (NIPS 2017)

28 of 35

Neural Message Passing formulation of GNNs

28

Message Passing Neural Networks (MPNN): Gilmer et al. (ICML 2017), GraphNets: Battaglia et al. (2018)

Edge update (message)

Node update

Node features

29 of 35

GNNs in summary

29

Message Passing Neural Networks (MPNN): Gilmer et al. (2017), GraphNets: Battaglia et al. (2018)

Edge update (message)

Node update

Definition includes:

  • Convolution-based GNNs (GCNs)�e.g. Duvenaud et al. (2015); Kipf & Welling (2016)
  • Relational GCNs�e.g. Li et al. (2016), Schlichtkrull et al. (2017)
  • Multi-head self-attention�e.g. Vaswani et al. (2017); Veličković et al. (2018)

30 of 35

Encoder-decoder GNNs

30

GNN encoder

Decoder variants

Nodes:

Node prediction

Edge prediction

Graph prediction

31 of 35

Graph Auto-Encoders for link prediction

31

GNN encoder

Node embeddings:

Inner-product decoder

(for undirected graphs)

Trained using negative log likelihood (NLL) loss + negative sampling

Similar to matrix factorization, but using an encoder to amortize inference of embeddings

Two-tower models do the same, but without GNN message passing in the encoder

32 of 35

Thank you

33 of 35

Appendix

34 of 35

Negative sampling is related to WARP loss (ref WSABIE (Weston, et al. 2011)),

where negatives are sampled at random and

the first negative where the difference is less than �The margin is used to update parameters.�However, for better efficiency, one can approximate this with in-batch negatives.

See (ref StarSpace (Weston et al. 2017)), an extension of (ref TagSpace (Weston, et al. 2014)), to see how dot-product loss and two-tower models apply to a number of problems in NLP and recommendation.�

Candidate sampling methods can impact training time and quality. See a short survey here.�

It is also useful to think of two layers of a two-tower model, where hard negatives, i.e. unclicked impressions are used to train the second layer, sometimes known as “reranking layer”.

Sampling negatives in Two Tower Models

35 of 35

Singular Value Decomposition of interaction matrix

S can be decomposed into a set of orthonormal unitary vectors U, a set of orthonormal unitary vectors V and a diagonal matrix with decreasing values Σ�

We can also limit U and V the first k columns and Σ to the first k values. This would be the most optimal k-rank approximation of S, upto mean squared error loss. (Eckart–Young–Mirsky theorem)

The k values in U for each user can be thought of as a k-dimensional “embedding” of the user. Similarly the k values in V for each item can be thought of as a k-dimensional “embedding” of the item. These embeddings have but one purpose, to explain the latent ways (i.e. topics if you will) by which they interact in this domain.��How do you choose ‘k’? (Read Gavish and Donoho here to understand how to model the noise as a Gaussian and under that assumption compute at what level of variance, the eigenvector is likely to be coming from noise)

Please see Dr. Brunton’s video on Singular Value Decomposition to understand the mathematical elegance and universal applicability of this method.