1 of 34

Collaborative Filtering with Implicit Feedback

Loss, Negative Sampling and Embeddings

2 of 34

Challenges with Implicit Feedback

Interactions are likely very sparse

  • Most of the time, only positive signals are recorded
  • No interactions may mean user dislike item or not yet exposed to item

Not able to directly use common matrix factorisation algorithms

  • Assumes no interactions as zeroes

3 of 34

Collaborative Filtering Model

Ranking

Retrieval

Embedding model

User (features)

Item (features)

User embeddings

Item embeddings

Scoring model

Loss model

Transform model

Transformed user embeddings

Transformed item embeddings

4 of 34

Loss Functions

Pointwise loss: generally low performance

Pairwise loss: 1 negative sample for each positive example

  • Bayesian Personalised Ranking (BPR): -logsigmoid(distance_neg - distance_pos)
  • Pairwise Hinge (PHL): relu(distance_pos - distance_neg + margin)

Setwise loss functions: multiple negative samples

  • Cosine Contrastive Loss (CCL): distance_pos + mean(relu(m - distance_neg))
  • Sampled Softmax, InfoNCE: cross_entropy(cat(score_pos, score_neg), 0)

Other loss: no negative samples

  • DirectAU (DAU): distance_pos + uniformity loss

5 of 34

6 of 34

Relationship between loss functions

7 of 34

Cosine Contrastive Loss (CCL)

Inspired by contrastive loss in computer vision tasks

Automatically emphasises hard negatives due to hinge loss

Not straightforward to include negative labels (user thumbs down, explicit “not interested”)

8 of 34

Sampled Softmax (SSM), InfoNCE

Inefficient to compute softmax over all items, hence use sampling

  • Biased softmax estimation if samples are not uniformly distributed
  • For example, popularity sampling means popular items are penalised more
  • There is a concept of log(q) sampling bias correction for unbiased estimate

Option to include temperature factor to score to control smoothness of distribution

9 of 34

InfoNCE+

Mutual Information Neural Estimator + (MINE+)

Equivalent to below if 𝞊 = 0

Equivalent to InfoNCE if

𝝺 = 1, 𝞊 = 1

10 of 34

DirectAU

Encourages embeddings to be well distributed on the surface of a unit sphere

Negative samples not needed

11 of 34

Negative Sampling Strategy

Random negative sampling

  • Noisy false negatives averaged out during training
  • Require computing many embeddings for loss and batch update

In-batch negative sampling

  • Equivalent to popularity sampling
  • Efficient reuse in-batch computed embeddings
  • Does not train against trivial negatives

Hard negative sampling

  • Requires scoring between all pairs every few batch / epoch

12 of 34

In-Batch

Negative Sampling

13 of 34

Mixed Negative Sampling

14 of 34

Parameter Reduction with Bloom Embeddings

Standard (dictionary-indexed) embeddings uses a lot of parameters and memory

  • Unique embeddings for each id
  • Also does not allow online training

Hashing trick reduces embedding size with cost of collision

  • Collision is problematic for matrix factorisation as the embedding completely defines the user / item

Bloom embeddings further reduces chance of collision

  • Multiple hashes per user / item and aggregate embeddings
  • Unified embeddings: use 1 set of embeddings for all user and item features

15 of 34

Hashing Trick

16 of 34

17 of 34

Other Embedding Issues

ID-based embeddings unable to learn embeddings for unseen entities

Solution: in addition to ID, also map user/item features to embeddings and aggregate (weighted sum/mean) together

  • New user/item will get meaningful embeddings as long as there are valid features

18 of 34

Other Findings

Recent research articles mainly use naive matrix factorisation and LightGCN

  • LightGCN is very simple, yet strong GNN model
  • However, does not allow online training due to changes in graph structure

Score: dot product / l2 distance (unbounded) vs cosine similarity (bounded)

  • Cosine similarity (normalised embeddings) improves training performance significantly

19 of 34

Evaluation Metrics

If retrieved results are directly shown, use Normalised Discounted Cumulative Gain (NDCG) or Average Precision (AP)

  • Emphasize top ranked results stronger

If retrieved results followed by reranking step, use Recall with k equal to number of candidates to be reranked

20 of 34

Implementation Details

Dataset

MovieLens 1M

Training set

1st 80% rating by timestamp

Validation set

Top 2% users by num ratings

Evaluation metric

NDCG, MAP; other retrieval metrics also tracked

Optimizer

SparseAdam with learning_rate = 0.1

Embeddings

num_hashes = 2, num_embeddings = 2 ^ 16 + 1, dim = 32

Batch size

1024

Hyperparameters

max_norm, normalize, train_loss, etc

21 of 34

ID vs feature embeddings

PHL & BPR performs very strong, MINE is underwhelming

ID + features generally performed worse: poor aggregation of embeddings

Loss Function

ID only

(NDCG)

ID only

(MAP)

ID + Features

(NDCG)

ID + Features

(MAP)

BPR

0.19279

0.29531

0.16380

0.27287

PHL

0.20747

0.32241

0.15502

0.27447

CCL

0.19973

0.29249

0.15080

0.26334

DAU

0.17919

0.31267

0.15406

0.25741

MINE

0.18659

0.28979

0.15135

0.26323

22 of 34

23 of 34

24 of 34

25 of 34

In-Batch vs Mixed Negative Sampling

Large improvements with mixed negative sampling, probably also because num negative samples effectively doubled

PHL & BPR performs best, DAU does not used negative samples effectively

Loss Function

In-Batch NS

(NDCG)

In-Batch NS

(MAP)

Mixed NS

(NDCG)

Mixed NS

(MAP)

BPR

0.19279

0.29531

0.34759

0.45224

PHL

0.20747

0.32241

0.36227

0.46431

CCL

0.19973

0.29249

0.32482

0.42916

DAU

0.17919

0.31267

0.26003

0.36187

MINE

0.18659

0.28979

0.34687

0.44993

26 of 34

In-Batch vs Mixed Negative Sampling

Even with same number of negative samples per interaction, MNS shows significant improvements

Additional random negative samples does not increase performance

Negative multiple

0

1

3

Batch size

1024

512

256

Effective num negative samples

1023

1023

1023

Loss

PairwiseHingeLoss

PairwiseHingeLoss

PairwiseHingeLoss

NDCG

0.20371

0.35314

0.35155

MAP

0.30506

0.45319

0.45154

27 of 34

Mixed Negative Sampling

ID + features improved even more with mixed negative sampling

PHL and BPR performs best in all settings

Loss Function

ID only

(NDCG)

ID only

(MAP)

ID + Features

(NDCG)

ID + Features

(MAP)

BPR

0.34759

0.45224

0.36690

0.46988

PHL

0.36227

0.46431

0.38848

0.48709

CCL

0.32482

0.42916

0.34538

0.44670

DAU

0.26003

0.36187

0.24984

0.34482

MINE

0.34687

0.44993

0.36637

0.46351

28 of 34

Hyperparameters Search with flaml.BlendSearch

Search space

ID only

ID + features

Loss

5 losses

PairwiseHingeLoss

PairwiseHingeLoss

NDCG

0.40475

0.42165

MAP

0.50735

0.51984

Negative multiple

0 - 4

3

3

Embedding dim

4 - 64

32

32

Learning rate

0.01 - 1.0

0.1

0.2

Num hashes

1 - 4

3

2

Num embeddings

1025 - 65537

65537

65537

Precision

bf16-true

bf16-true

29 of 34

Matrix Factorisation

Bloom embeddings

User (features)

Item (features)

User embeddings

Item embeddings

Cosine similarity

Pairwise hinge loss

Mixed negatives

30 of 34

Final Comments

Retrieval terminology

  • Query: user context
  • Candidate: item context

Retrieval query is not always user context

  • For item-item recommendation, use item context

Hence, this generic model can be used for both user-item and item-item recommendation

31 of 34

References

32 of 34

References

33 of 34

Alignment & Uniformity

34 of 34

DirectAU