Overview of recommendation systems
Important Disclaimer: This presentation is for educational purposes only. Any assumptions, opinions stated here are those of the author alone and do not necessarily represent any official strategy, positioning,statement of author’s current or past employers or any of its affiliates.
Outline
Framework for recommender systems
Static: Given the history of past interactions between users and items, recommend the most item most likely to be engaged with to each user.
Additional information: metadata about items, information about the user and the context.
Temporal: Everything in the framework evolves over time, and the systems need to generalize to future occurrences.
Stages in practical recommender systems
Item-Item Similarity
Example : Cowatch / Covisitation based recommender systems
Notation: We assume we are given interaction scores as a U by I matrix,
Reference:Video from Mining Massive Datasets
User-User Similarity
Note:
Two ideas of progress
Clustering
Clustering | multiple clusters i.e. Gaussian Mixture Model (lite)
For example, in a social network setting with user-user edges and user-group membership edges:
where F(u) is the set of users who are friends of u, F(u, v) is a ��score depicting the strength of this relationship (or activity with each other), S(v,i) is the score of the item for the friend.�Please note that this is similar to the trust propagation setup that was used in the PageRank algorithm (ref: video by Dr. Leskovec).���Similarly, incorporating the group relationship�where G(u) is the set of groups that u is a member of�M(g) is the set of members of group g�A(u, g) is a score of the affinity of user u to group g��
Seed based recommendation
Learn similarity functions from implicit feedback (Matrix Factorization)
References:
Matrix Factorization (continued)
4. C(u,i) term allows us to � - downsample negatives, especially for users with low number of positives� - encoding features of the interaction, like listen time for a podcast, or a 👎 from the user.��5. Serving recommendations in this approach is also more efficient.�After training, we have k-dimensional embeddings of items. Using approximate nearest neighbor search systems(ref: Hashing with Graphs), the top-N items by closeness to a given user embedding are retrieved.�
Why dot-product?
Since high dot-product retrieval is efficient, we are preferring the loss function that maximizes the dot product of positive examples and minimizes the dot-product of negative examples.
Otherwise a more general formulation would be to add user embedding and item embedding as the input layer of a MLP (ref: Neual Collaborative Filtering)
�Another way to choose recommendations would be to develop a multi-label classification with the (learned) user embedding as input and one label for each of the eligible items. This is similar to the dot-product approach since the softmax weights for an item can be considered its embedding. (ref: End-to-End Deep Attentive Personalized Item Retrieval for Online Content-sharing Platforms)
Some drawbacks of (vanilla) Matrix Factorization
At this point, we are familiar with users and items as points in a k-dimensional space. However some problems with vanilla matrix factorization are:
(a) sparsity of user-item links
(b) data imbalance (ref Matrix Completion from Power-Law Distributed Samples)
(c) the framework does not naturally extend to side features. (Matrix Co-factorization for Recommendation with Rich Side Information and Implicit Feedback, Bayesian Matrix Factorization with Side Information and Dirichlet Process Mixtures, Factorization Machines)
(d) the cold-start problem
However, we have already established:
u as a dense embedding to represent the user, which is currently a function of the user id only and
i, is currently a function of just the item id.�
Idea: To learn embeddings as outputs of trainable functions, i.e. neural networks.
Two tower models
���Where e_u = f ( x_u /*from userid*/, � x_h /*items in history*/,� x_t /*terms in user history*/,� x_f /*user features, context features*/ )
Where e_i = g ( y_i /*from itemid*/, � y_f /*item features*/ )
WARP : for each positive, search a negative that we are not doing well on and use that to compute the gradient��
Negative sampling : Take a lot of negatives, hoping one or more of then we are not doing well on already.
Candidate sampling | WARP loss and Negative sampling
Two tower models of personalized recommendations
Sigmoid,
dot product
Latent embedding
User
history
Terms salient to the user
User/contextfeatures
Latent embedding
Terms in
description
Item features
Embeddings of items in user history
Embeddings of terms in documents in user history, or terms searched by the user.
User Embeddings
Item Embeddings
Embeddings of terms in the document
Two tower models of personalized recommendations
Sigmoid,
dot product
User
history
Terms salient to the user
User/contextfeatures
Terms in
description
Item features
Embeddings of items in user history
Embeddings of terms in documents in user history, or terms searched by the user.
User Embeddings
Item Embeddings
Embeddings of terms in the document
(without trainable latent embeddings)
Personalized Search / Contextual personalized recommendations
Sigmoid,
dot product
Latent embedding
User
History embedding
Terms salient to the user
User/contextfeatures
Latent embedding
Terms in
description
Item features
Embeddings of items in user history
Embeddings of terms in documents in user history, or terms searched by the user.
User Embeddings
Item Embeddings
Embeddings of terms in the document
Element-wise product (user, context)
Context / Query embedding
Query
embedding
Visualizing personalized recommendation as a graph
E.g. in video recommendation
Users
Features
Context
Videos
Features
Channels
Features
Terms
Graph convolutional networks
Heterogeneous Information Networks
Another way to view Seed based recommendations is to look at the information available as a graph/network of nodes of different types and edges depicting different relationships. (ref: Recommendation in heterogeneous information network via dual similarity regularization),
To recommend in this setting of various relationship types, these algorithms construct various “meta-paths” i.e. random walks and then try to find embeddings of nodes whose inner product approximate the observed probability of two nodes co-occurring in these random paths. (ref: DeepWalk: Online Learning of Social Representations )
Deep Dive into GNNs
Agenda
23
Graph Neural Networks (GNNs)
24
Main idea: Pass messages along edges of graph, agglomerate & transform
The bigger picture:
Graph
Node features
Edge features
Scarselli et al., The Graph Neural Network Model (2009)
CNNs (on grids) as message passing
25
(Animation by
Vincent Dumoulin)
Single CNN layer �with 3x3 filter:
Update for a single pixel:
are (hidden layer) activations of a pixel/node
Full update:
Convolution-based Graph Neural Networks
26
Consider this
undirected graph:
Calculate update
for node in red:
Neural FP: Duvenaud et al. (NIPS 2015), Graph Convolutional Networks (GCN): Kipf & Welling (ICLR 2017)
(Optional) normalization constant
Neighbor indices
Different relationships and node types
27
GG-NN: Li et al. (ICLR 2016), R-GCN: Schlichtkrull et al. (2017)
Adding support for relation types�(e.g. knowledge graph relations, or for connecting data from different verticals)
Figure from Vaswani et al. (NIPS 2017)
Attention-based aggregation
GAT: Veličković et al. (ICLR 2018), Transformers: Vaswani et al. (NIPS 2017)
Neural Message Passing formulation of GNNs
28
Message Passing Neural Networks (MPNN): Gilmer et al. (ICML 2017), GraphNets: Battaglia et al. (2018)
Figure: T. N. Kipf, Deep Learning with Graph-Structured Representations, 2020
Edge update (message)
Node update
Node features
GNNs in summary
29
Message Passing Neural Networks (MPNN): Gilmer et al. (2017), GraphNets: Battaglia et al. (2018)
Edge update (message)
Node update
Definition includes:
Encoder-decoder GNNs
30
GNN encoder
Decoder variants
Nodes:
Node prediction
Edge prediction
Graph prediction
Graph Auto-Encoders for link prediction
31
GNN encoder
Node embeddings:
Inner-product decoder
(for undirected graphs)
Trained using negative log likelihood (NLL) loss + negative sampling
Similar to matrix factorization, but using an encoder to amortize inference of embeddings
Two-tower models do the same, but without GNN message passing in the encoder
Thank you
Appendix
Negative sampling is related to WARP loss (ref WSABIE (Weston, et al. 2011)),
where negatives are sampled at random and
the first negative where the difference is less than �The margin is used to update parameters.�However, for better efficiency, one can approximate this with in-batch negatives.
See (ref StarSpace (Weston et al. 2017)), an extension of (ref TagSpace (Weston, et al. 2014)), to see how dot-product loss and two-tower models apply to a number of problems in NLP and recommendation.�
Candidate sampling methods can impact training time and quality. See a short survey here.�
It is also useful to think of two layers of a two-tower model, where hard negatives, i.e. unclicked impressions are used to train the second layer, sometimes known as “reranking layer”.
Sampling negatives in Two Tower Models
Singular Value Decomposition of interaction matrix
S can be decomposed into a set of orthonormal unitary vectors U, a set of orthonormal unitary vectors V and a diagonal matrix with decreasing values Σ�
We can also limit U and V the first k columns and Σ to the first k values. This would be the most optimal k-rank approximation of S, upto mean squared error loss. (Eckart–Young–Mirsky theorem)
The k values in U for each user can be thought of as a k-dimensional “embedding” of the user. Similarly the k values in V for each item can be thought of as a k-dimensional “embedding” of the item. These embeddings have but one purpose, to explain the latent ways (i.e. topics if you will) by which they interact in this domain.��How do you choose ‘k’? (Read Gavish and Donoho here to understand how to model the noise as a Gaussian and under that assumption compute at what level of variance, the eigenvector is likely to be coming from noise)
Please see Dr. Brunton’s video on Singular Value Decomposition to understand the mathematical elegance and universal applicability of this method.