Collaborative Filtering with Implicit Feedback
Loss, Negative Sampling and Embeddings
Challenges with Implicit Feedback
Interactions are likely very sparse
Not able to directly use common matrix factorisation algorithms
Collaborative Filtering Model
Ranking
Retrieval
Embedding model
User (features)
Item (features)
User embeddings
Item embeddings
Scoring model
Loss model
Transform model
Transformed user embeddings
Transformed item embeddings
Loss Functions
Pointwise loss: generally low performance
Pairwise loss: 1 negative sample for each positive example
Setwise loss functions: multiple negative samples
Other loss: no negative samples
Relationship between loss functions
Cosine Contrastive Loss (CCL)
Inspired by contrastive loss in computer vision tasks
Automatically emphasises hard negatives due to hinge loss
Not straightforward to include negative labels (user thumbs down, explicit “not interested”)
Sampled Softmax (SSM), InfoNCE
Inefficient to compute softmax over all items, hence use sampling
Option to include temperature factor to score to control smoothness of distribution
InfoNCE+
Mutual Information Neural Estimator + (MINE+)
Equivalent to below if 𝞊 = 0
Equivalent to InfoNCE if
𝝺 = 1, 𝞊 = 1
DirectAU
Encourages embeddings to be well distributed on the surface of a unit sphere
Negative samples not needed
Negative Sampling Strategy
Random negative sampling
In-batch negative sampling
Hard negative sampling
In-Batch
Negative Sampling
Mixed Negative Sampling
Parameter Reduction with Bloom Embeddings
Standard (dictionary-indexed) embeddings uses a lot of parameters and memory
Hashing trick reduces embedding size with cost of collision
Bloom embeddings further reduces chance of collision
Hashing Trick
Other Embedding Issues
ID-based embeddings unable to learn embeddings for unseen entities
Solution: in addition to ID, also map user/item features to embeddings and aggregate (weighted sum/mean) together
Other Findings
Recent research articles mainly use naive matrix factorisation and LightGCN
Score: dot product / l2 distance (unbounded) vs cosine similarity (bounded)
Evaluation Metrics
If retrieved results are directly shown, use Normalised Discounted Cumulative Gain (NDCG) or Average Precision (AP)
If retrieved results followed by reranking step, use Recall with k equal to number of candidates to be reranked
Implementation Details
Dataset | MovieLens 1M |
Training set | 1st 80% rating by timestamp |
Validation set | Top 2% users by num ratings |
Evaluation metric | NDCG, MAP; other retrieval metrics also tracked |
Optimizer | SparseAdam with learning_rate = 0.1 |
Embeddings | num_hashes = 2, num_embeddings = 2 ^ 16 + 1, dim = 32 |
Batch size | 1024 |
Hyperparameters | max_norm, normalize, train_loss, etc |
ID vs feature embeddings
PHL & BPR performs very strong, MINE is underwhelming
ID + features generally performed worse: poor aggregation of embeddings
Loss Function | ID only (NDCG) | ID only (MAP) | ID + Features (NDCG) | ID + Features (MAP) |
BPR | 0.19279 | 0.29531 | 0.16380 | 0.27287 |
PHL | 0.20747 | 0.32241 | 0.15502 | 0.27447 |
CCL | 0.19973 | 0.29249 | 0.15080 | 0.26334 |
DAU | 0.17919 | 0.31267 | 0.15406 | 0.25741 |
MINE | 0.18659 | 0.28979 | 0.15135 | 0.26323 |
In-Batch vs Mixed Negative Sampling
Large improvements with mixed negative sampling, probably also because num negative samples effectively doubled
PHL & BPR performs best, DAU does not used negative samples effectively
Loss Function | In-Batch NS (NDCG) | In-Batch NS (MAP) | Mixed NS (NDCG) | Mixed NS (MAP) |
BPR | 0.19279 | 0.29531 | 0.34759 | 0.45224 |
PHL | 0.20747 | 0.32241 | 0.36227 | 0.46431 |
CCL | 0.19973 | 0.29249 | 0.32482 | 0.42916 |
DAU | 0.17919 | 0.31267 | 0.26003 | 0.36187 |
MINE | 0.18659 | 0.28979 | 0.34687 | 0.44993 |
In-Batch vs Mixed Negative Sampling
Even with same number of negative samples per interaction, MNS shows significant improvements
Additional random negative samples does not increase performance
Negative multiple | 0 | 1 | 3 |
Batch size | 1024 | 512 | 256 |
Effective num negative samples | 1023 | 1023 | 1023 |
Loss | PairwiseHingeLoss | PairwiseHingeLoss | PairwiseHingeLoss |
NDCG | 0.20371 | 0.35314 | 0.35155 |
MAP | 0.30506 | 0.45319 | 0.45154 |
Mixed Negative Sampling
ID + features improved even more with mixed negative sampling
PHL and BPR performs best in all settings
Loss Function | ID only (NDCG) | ID only (MAP) | ID + Features (NDCG) | ID + Features (MAP) |
BPR | 0.34759 | 0.45224 | 0.36690 | 0.46988 |
PHL | 0.36227 | 0.46431 | 0.38848 | 0.48709 |
CCL | 0.32482 | 0.42916 | 0.34538 | 0.44670 |
DAU | 0.26003 | 0.36187 | 0.24984 | 0.34482 |
MINE | 0.34687 | 0.44993 | 0.36637 | 0.46351 |
Hyperparameters Search with flaml.BlendSearch
| Search space | ID only | ID + features |
Loss | 5 losses | PairwiseHingeLoss | PairwiseHingeLoss |
NDCG | | 0.40475 | 0.42165 |
MAP | | 0.50735 | 0.51984 |
Negative multiple | 0 - 4 | 3 | 3 |
Embedding dim | 4 - 64 | 32 | 32 |
Learning rate | 0.01 - 1.0 | 0.1 | 0.2 |
Num hashes | 1 - 4 | 3 | 2 |
Num embeddings | 1025 - 65537 | 65537 | 65537 |
Precision | | bf16-true | bf16-true |
Matrix Factorisation
Bloom embeddings
User (features)
Item (features)
User embeddings
Item embeddings
Cosine similarity
Pairwise hinge loss
Mixed negatives
Final Comments
Retrieval terminology
Retrieval query is not always user context
Hence, this generic model can be used for both user-item and item-item recommendation
References
Implementation: https://github.com/yxtay/matrix-factorization-pytorch
TensorFlow Recommenders: tfrs.tasks.Retrieval | TensorFlow Recommenders
BPR: [1205.2618] BPR: Bayesian Personalized Ranking from Implicit Feedback
CCL: [2109.12613] SimpleX: A Simple and Strong Baseline for Collaborative Filtering
SSM: [2201.02327] On the Effectiveness of Sampled Softmax Loss for Item Recommendation
DirectAU: [2206.12811] Towards Representation Alignment and Uniformity in Collaborative Filtering
MAWU: [2308.06091] Toward a Better Understanding of Loss Functions for Collaborative Filtering
InfoNCE+, MINE+: [2312.08520] Revisiting Recommendation Loss Functions through Contrastive Learning (Technical Report)
References
[2101.08769] Item Recommendation from Implicit Feedback
Sampling-Bias-Corrected Neural Modeling for Large Corpus Item Recommendations
Sampling : the secret of how to train good embeddings
MNS: Mixed Negative Sampling for Learning Two-tower Neural Networks in Recommendations
Hashing Trick: [0902.2206] Feature Hashing for Large Scale Multitask Learning
Hash Embeddings: [1709.03933] Hash Embeddings for Efficient Word Representations
Unified Embeddings: [2305.12102] Unified Embedding: Battle-Tested Feature Representations for Web-Scale ML Systems
Bloom embeddings: Compact word vectors with Bloom embeddings
Alignment & Uniformity
DirectAU