You got served: How Deliveroo improved the ranking of restaurants

Jonny Brooks-Bartlett - Data scientist, Algorithms

Click to edit presentation title

Click to edit presentation subtitle

What I’ll be talking about today

  • Introduction
  • Approach to ranking
  • Choosing tools and processes
  • Lessons learned
  • Summary

Rank a list as opposed to restaurants

Introduction

Restaurants

10,000’s

Riders

10,000’s

Consumers

Over 300 cities across 13 countries

Deliveroo

Give example. Perhaps you order from Wagamama’s but they don’t have their own Delivery fleet so Deliveroo provides riders that’ll deliver the food

Deliveroo platform

Web

App

Enter Merchandising Algorithms team

Our initial goal: Present the most relevant restaurants to the consumer at the top of the feed

Creating a ranking model

The objective

Given a list of restaurants, rank them “optimally

Optimal = Rank in order of relevance to the consumer

How do we quantify this?

Quantifying the objective

Online metrics

Order volume

Session-level conversion

=

# of sessions that resulted in an order

# of sessions

Framing the problem

Converted?

0

1

0

0

Session 1

Converted?

1

0

0

0

Session 2

100’s

We only train on converted sessions

Classification problem - pointwise approach

What’s the probability that the user purchases from this restaurant?

0.8

0.6

0.2

0.1

Can use the log loss to train models

Scores are done pointwise

The scores can be ranked in descending order to get the final rank

Relevance to user = Conversion score

Probability of the user purchasing from the restaurant (Conversion)

Popularity,

Estimated Time of Arrival (ETA),

Restaurant rating,

Does the restaurant have an image?,

etc….

~

f

  • Initially used heuristic - mixture of popularity and ETA

  • Allowed us to focus on getting end to end pipeline working

  • Moved on to using logistic regression models

  • Can move on to more complex models later

Start simple and iterate

Evaluating models

Offline metrics (proxy to online metrics)

  • Mean reciprocal rank (MRR)
  • Precision at k
  • Recall at k
  • (Normalised) discounted cumulative gain (NDCG)

1/3

Mean reciprocal rank:

= 41/150 ⋍ 0.273

Calculating the MRR

Reciprocal

rank:

1/4

1/3

1/4

1/5

converted

converted

converted

converted

converted

Mention that we only train on converted sessions

We care about position here whereas online we don’t

The business mainly cares about whether the sessions convert so online metric is closer to business goal.

We can calculate an online version of MRR and have done some work looking to correlate online and offline metrics

Model selection workflow

Validate data

Train multiple models

Model 1

Model 2

Model n

Calculate MRR for each model

MRR 1

MRR 2

MRR n

Choose model with best MRR

Model (best)

Build train/test datasets (SQL)

Data Warehouse

Productionising the model

Run CircleCI - tests push Docker containers

Run Canoe - model building pipeline

Change CircleCI screenshot

Run A/B tests (iterative process)

User-level, 50/50 split

Algorithm A

Algorithm B

Highlight that this is an iterative process

Current work

More complex models and feature engineering

Popularity,

Estimated Time of Arrival (ETA),

Restaurant rating,

Does the restaurant have an image?,

etc….

f

Choosing tools and processes

How to productionise models

  • Wrap the chosen model in a new service that handles requests.
  • Integrate a serialised version of the chosen model into the existing production service
  • Rewrite model from prototype language to production language.

We need to predict in real time due to some features that we can only know at serve time

Choosing the modelling framework

  • Good documentation and community
  • Includes linear models and neural networks
  • Estimator API
  • Can call easily from other languages

Build and train a model with Tensorflow estimator API

Define how data flows into model

Create features

Create estimator

Train model

Inference in Go

Get model

Input features

Output node

Current work

Lessons learned

We’ve run a number of experiments and deployed several iterations

Check for skew in production vs training data

Production rank

Offline rank

Offline vs production rank

% error

Features

Offline vs production feature discrepancy

Talk about feature store to address train-serve skew problem

Log and monitor EVERYTHING

Log and monitor EVERYTHING - and error early

Give example about tracking changes and losing most of the data

Evaluation of ranking models is very hard

  • Single global evaluation metrics like MRR can be misleading

  • Sometimes improvements in MRR doesn’t lead to improvements in online metrics

  • We need to look at several things to be sure that the ranking model is working as expected

Evaluation of ranking models is very hard

Help us to determine whether ranking algorithms are sufficiently different/similar to warrant releasing.

Spearman’s rank correlation near 1 means we likely don't see much change

Evaluation of ranking models is very hard

Employees can look at their individual restaurant lists before we release a model for an A/B test.

This is a sense check and is great for spotting specific issues with algorithms

Look at both the app and the ranking insights tool to get a good idea about whether a ranking algorithm is working as expected.

Periodically read Google’s rules of ML

Need to get a better screenshot tomorrow and hide the numbers in the logs

Lessons learned summary!

  • Check differences between training and production environments. Allows us to work at pace and be sure that we’re impacting the right metrics

  • Log and monitor EVERYTHING!

  • Don’t just rely on global metrics. You may need to look at multiple metrics to be confident that your model works

  • Read (and re-read) Google’s rules of ML

Wrapping up

Summary - what we’ve covered

  • Merchandising algorithms team setup with initial aim to provide user with most relevant restaurants in list

  • We’ve learned a lot along the way and are still learning as we go on

Need to get a better screenshot tomorrow and hide the numbers in the logs

Summary - future work for the merchanding algos team

  • Ultimately we want algorithmically generate the consumer pages

  • Algorithms to impact search results, carousel placement, marketing offers etc.

Need to get a better screenshot tomorrow and hide the numbers in the logs

Summary - other data science teams

  • Pricing algorithms

  • Logistics algorithms

  • Experimentation platform

Need to get a better screenshot tomorrow and hide the numbers in the logs

Thanks for listening

Appendix

(answers to potential questions)

Affinity modelling

Converted = 1

Converted = 0

Session 1

Converted = 0

Converted = 1

Session 2

MRR indicated that we SHOULDN’T downsample

Increasing downsampling of negative class

MRR

Session-based metrics can be misleading

A move towards user-level metrics may be beneficial in any case, given that the interpretation of rate metrics such as conversion can be ambiguous when the unit of analysis, the denominator, is not the randomisation unit. For example, an increase in session-level conversion-rate could indicate either an improved, diminished or un-changed user experience depending on whether the numerator (conversions) or denominator (sessions), or both, have changed.

https://towardsdatascience.com/the-second-ghost-of-experimentation-the-fallacy-of-session-based-metrics-fb65006d30ff

Session-based metrics can be misleading (cont)

https://towardsdatascience.com/the-second-ghost-of-experimentation-the-fallacy-of-session-based-metrics-fb65006d30ff

You got served: How Deliveroo improved the ranking of restaurants - Google Slides