JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

1 of 34

Collaborative Filtering with Implicit Feedback

Loss, Negative Sampling and Embeddings

2 of 34

Challenges with Implicit Feedback

Interactions are likely very sparse

Most of the time, only positive signals are recorded
No interactions may mean user dislike item or not yet exposed to item

Not able to directly use common matrix factorisation algorithms

Assumes no interactions as zeroes

3 of 34

Collaborative Filtering Model

Ranking

Retrieval

Embedding model

User (features)

Item (features)

User embeddings

Item embeddings

Scoring model

Loss model

Transform model

Transformed user embeddings

Transformed item embeddings

4 of 34

Loss Functions

Pointwise loss: generally low performance

Pairwise loss: 1 negative sample for each positive example

Bayesian Personalised Ranking (BPR): -logsigmoid(distance_neg - distance_pos)
Pairwise Hinge (PHL): relu(distance_pos - distance_neg + margin)

Setwise loss functions: multiple negative samples

Cosine Contrastive Loss (CCL): distance_pos + mean(relu(m - distance_neg))
Sampled Softmax, InfoNCE: cross_entropy(cat(score_pos, score_neg), 0)

Other loss: no negative samples

DirectAU (DAU): distance_pos + uniformity loss

5 of 34

6 of 34

Relationship between loss functions

7 of 34

Cosine Contrastive Loss (CCL)

Inspired by contrastive loss in computer vision tasks

Automatically emphasises hard negatives due to hinge loss

Not straightforward to include negative labels (user thumbs down, explicit “not interested”)

8 of 34

Sampled Softmax (SSM), InfoNCE

Inefficient to compute softmax over all items, hence use sampling

Biased softmax estimation if samples are not uniformly distributed
For example, popularity sampling means popular items are penalised more
There is a concept of log(q) sampling bias correction for unbiased estimate

Option to include temperature factor to score to control smoothness of distribution

9 of 34

InfoNCE+

Mutual Information Neural Estimator + (MINE+)

Equivalent to below if 𝞊 = 0

Equivalent to InfoNCE if

𝝺 = 1, 𝞊 = 1

10 of 34

DirectAU

Encourages embeddings to be well distributed on the surface of a unit sphere

Negative samples not needed

11 of 34

Negative Sampling Strategy

Random negative sampling

Noisy false negatives averaged out during training
Require computing many embeddings for loss and batch update

In-batch negative sampling

Equivalent to popularity sampling
Efficient reuse in-batch computed embeddings
Does not train against trivial negatives

Hard negative sampling

Requires scoring between all pairs every few batch / epoch

12 of 34

In-Batch

Negative Sampling

13 of 34

Mixed Negative Sampling

14 of 34

Parameter Reduction with Bloom Embeddings

Standard (dictionary-indexed) embeddings uses a lot of parameters and memory

Unique embeddings for each id
Also does not allow online training

Hashing trick reduces embedding size with cost of collision

Collision is problematic for matrix factorisation as the embedding completely defines the user / item

Bloom embeddings further reduces chance of collision

Multiple hashes per user / item and aggregate embeddings
Unified embeddings: use 1 set of embeddings for all user and item features

15 of 34

Hashing Trick

16 of 34

17 of 34

Other Embedding Issues

ID-based embeddings unable to learn embeddings for unseen entities

Solution: in addition to ID, also map user/item features to embeddings and aggregate (weighted sum/mean) together

New user/item will get meaningful embeddings as long as there are valid features

18 of 34

Other Findings

Recent research articles mainly use naive matrix factorisation and LightGCN

LightGCN is very simple, yet strong GNN model
However, does not allow online training due to changes in graph structure

Score: dot product / l₂ distance (unbounded) vs cosine similarity (bounded)

Cosine similarity (normalised embeddings) improves training performance significantly

19 of 34

Evaluation Metrics

If retrieved results are directly shown, use Normalised Discounted Cumulative Gain (NDCG) or Average Precision (AP)

Emphasize top ranked results stronger

If retrieved results followed by reranking step, use Recall with k equal to number of candidates to be reranked

20 of 34

Implementation Details

Dataset	MovieLens 1M
Training set	1st 80% rating by timestamp
Validation set	Top 2% users by num ratings
Evaluation metric	NDCG, MAP; other retrieval metrics also tracked
Optimizer	SparseAdam with learning_rate = 0.1
Embeddings	num_hashes = 2, num_embeddings = 2 ^ 16 + 1, dim = 32
Batch size	1024
Hyperparameters	max_norm, normalize, train_loss, etc

21 of 34

ID vs feature embeddings

PHL & BPR performs very strong, MINE is underwhelming

ID + features generally performed worse: poor aggregation of embeddings

Loss Function	ID only (NDCG)	ID only (MAP)	ID + Features (NDCG)	ID + Features (MAP)
BPR	0.19279	0.29531	0.16380	0.27287
PHL	0.20747	0.32241	0.15502	0.27447
CCL	0.19973	0.29249	0.15080	0.26334
DAU	0.17919	0.31267	0.15406	0.25741
MINE	0.18659	0.28979	0.15135	0.26323

22 of 34

23 of 34

24 of 34

25 of 34

In-Batch vs Mixed Negative Sampling

Large improvements with mixed negative sampling, probably also because num negative samples effectively doubled

PHL & BPR performs best, DAU does not used negative samples effectively

Loss Function	In-Batch NS (NDCG)	In-Batch NS (MAP)	Mixed NS (NDCG)	Mixed NS (MAP)
BPR	0.19279	0.29531	0.34759	0.45224
PHL	0.20747	0.32241	0.36227	0.46431
CCL	0.19973	0.29249	0.32482	0.42916
DAU	0.17919	0.31267	0.26003	0.36187
MINE	0.18659	0.28979	0.34687	0.44993

26 of 34

In-Batch vs Mixed Negative Sampling

Even with same number of negative samples per interaction, MNS shows significant improvements

Additional random negative samples does not increase performance

Negative multiple	0	1	3
Batch size	1024	512	256
Effective num negative samples	1023	1023	1023
Loss	PairwiseHingeLoss	PairwiseHingeLoss	PairwiseHingeLoss
NDCG	0.20371	0.35314	0.35155
MAP	0.30506	0.45319	0.45154

27 of 34

Mixed Negative Sampling

ID + features improved even more with mixed negative sampling

PHL and BPR performs best in all settings

Loss Function	ID only (NDCG)	ID only (MAP)	ID + Features (NDCG)	ID + Features (MAP)
BPR	0.34759	0.45224	0.36690	0.46988
PHL	0.36227	0.46431	0.38848	0.48709
CCL	0.32482	0.42916	0.34538	0.44670
DAU	0.26003	0.36187	0.24984	0.34482
MINE	0.34687	0.44993	0.36637	0.46351

28 of 34

Hyperparameters Search with flaml.BlendSearch

	Search space	ID only	ID + features
Loss	5 losses	PairwiseHingeLoss	PairwiseHingeLoss
NDCG		0.40475	0.42165
MAP		0.50735	0.51984
Negative multiple	0 - 4	3	3
Embedding dim	4 - 64	32	32
Learning rate	0.01 - 1.0	0.1	0.2
Num hashes	1 - 4	3	2
Num embeddings	1025 - 65537	65537	65537
Precision		bf16-true	bf16-true