1 of 20

CUPED-like variance reduction

Its usage in SEO metrics and beyond

August 2023

Bunsen metrics

2 of 20

Presentation title

CUPED:

our history with the method

Section

3 of 20

What is CUPED?

Presentation title

Section

CUPED is a well-established variance reduction technique for experiments (paper).

Unlike methods like outlier removal or winsorization, it doesn’t sacrifice any data integrity.

It can be used together with winsorization, which enhances its effectiveness.

It requires pre-experimentation data - which we generally have in Bunsen.

For a rather unpredictable event like ad clicks, we get ~5% reduction in standard deviation (which is ~10% reduction in variance, and experiment run time). For something like sessions, that 5% can increase to something like 30%.

3

4 of 20

That sounds awesome! Why haven’t we used it?

Presentation title

Section

I (dcjkwon) first learned about CUPED in mid-2019. The DS team back then all agreed that this was awesome and that we wanted to implement it.

But there were several problems:

  • CUPED requires linear regression and pre-experiment data.
  • This greatly increases the SQL query complexity.
  • This complex change would have to be implemented for each metric separately.
  • For all that, it often results in only a few percent gains for our most problematic metrics.

The result is that CUPED has not seen widespread use in our experimental analysis.

4

5 of 20

So why should we start using it now?

Presentation title

Section

Several new developments in the last year have come together to negate nearly all of the previous obstacles.

  • The new jinja metrics writing system has macros,
    • which allows the complexity to be encapsulated, and
    • greatly eases the implementation across multiple metrics.
  • SEO metrics naturally require a pre-period, which naturally lends itself to CUPED-like methods.
  • The “simplified diff-in-diff” initially adopted by the SEO metrics is really a CUPED-like approach, so writing such metrics already put me in the right mindset.

All this allowed me a better understanding of the principles behind CUPED, and let me write something CUPED-like in a macro that “just works” for all our metrics.

5

6 of 20

Presentation title

CUPED:

The “official” method

Section

7 of 20

How do we implement CUPED?

Presentation title

Section

You can read the paper here. The part that people remember for implementation goes like this:

Define (eq. 3):

Where Y is the quantity you want to measure (for example, sessions), and

X is the pre-experiment value for that quantity (sessions 1 week before exp), and

Θ is the OLS linear regression coefficient for Y = f(X)

Then the effect size is (eq. 7):

7

8 of 20

How do we implement CUPED?

Example with actual data

Presentation title

Section

For the sake of coding (and my convenience), we will write:

As:

Y_cv = Y_bar - theta * X_bar + theta * E(X).

Then this notebook shows how to implement CUPED line-by-line. Let’s take a look.

8

9 of 20

How do we implement CUPED?

A deeper method:

Presentation title

Section

I, personally, don’t like the above method, because I’m bad at stats notations. Fortunately, there is a deeper, better way of doing CUPED, which the paper mentions. But few pay attention to it because it’s only mentioned in passing, and the key formula can’t be used directly for implementation. The formula is this (eq. 6):

Where f(X) is ANY reasonable regression function for Y based on X. As before, we will write this as

Y_cv = Y_bar - f(X)_bar + E(f(X))

We will next explore the meaning of this formula.

9

10 of 20

Presentation title

CUPED:

An intuitive understanding

Section

11 of 20

Predicted uncertainty is not uncertainty

Presentation title

Section

Predicted uncertainty is not uncertainty.

So the variance in your experiment metric should only take into account the unpredictable part of your measurement. If you predict that:

  • guv A will have 5 sessions during the experiment, and
  • guv B will have 20 sessions,

And when you actually measure the sessions, you get:

  • guv A has 4 sessions (-1 from prediction)
  • guv B has 21 sessions (+1 from prediction),

Then the variance, or the uncertainty, in your measurement can be calculated on the residuals of your predictions, rather than on the measurements themselves. This gives var([-1, +1]), rather than var([4, 21]), which is a great reduction.

11

12 of 20

Connecting with intuition: the general CUPED equation

Presentation title

Section

Consider the general CUPED equation:

Y_cv = Y_bar - f(X)_bar + E(f(X))

First, let’s remove the cohort-wide aggregation. We’ll save it to the end as the last step, as we do in our metrics. Then, on a per-guv basis, we have:

Y_cv = Y - f(X) + E(f(X))

Remember, f(X) is the result of ANY regression for Y based on X. So we’ll re-label f(X) as Y_pred. Then Y - Y_pred is the residual of your prediction, the part you couldn’t predict. We’ll call it Y_residual.

Y_residual is expected to be zero on average. So we add back in E(f(X)), which is the average predicted value - a constant - for the whole experimental population. This simply makes it so that the Y_cv has the same average value as Y.

12

13 of 20

Connecting with intuition: the linear regression equation

Presentation title

Section

So we’ve rewritten: Y_cv = Y_bar - f(X)_bar + E(f(X))

To:

Y_cv = Y_residual + Y_avg_predicted_value.

Remember the linear regression version of CUPED. There the equation (after removing the bars) was:

Y_cv = Y - theta * X + theta * E(X).

The prediction of linear regression is Y_pred = theta * X - theta * E(X) + E(Y). Rearranging and substituting into the above equation, we get:

Y_cv = Y - Y_pred + E(Y)

= Y_residual + Y_avg_predicted_value

So the linear regression CUPED is just a special case of the more general CUPED equation. Remember, You can use ANY reasonable prediction model for CUPED. Linear regression is just one possibility.

13

14 of 20

Connecting with intuition: the overall program

Presentation title

Section

So here’s how to implement: Y_cv = Y_residual + Y_avg_predicted_value.

  1. Use any “reasonable” model to predict Y from X, using data from all the cohorts. We’ll come back to what “reasonable” means later.
  2. Get Y_pred using the model, for each experimental unit.
    1. The difference between Y and Y_pred is Y_residual - a separate value for each experimental unit.
    2. The average of Y_pred is Y_avg_predicted_value - a constant for all experimental units, since its a single aggregated value for the entire experimental population.
  3. Construct Y_cv = Y_residual + Y_avg_predicted_value. Then apply the standard mechanisms of the t-test to get the same treatment effect as the t-test on Y, but with reduced variance.

Let’s see how it works, in this notebook:

14

15 of 20

The rules for a “reasonable” prediction model

Presentation title

Section

But what about the objection that we’re “cheating” in our “prediction”, by using all the data? It turns out that this is okay. In order for our program to work, we only need:

X to be completely independent of the cohorts.

This was the point of using pre-experiment values. This means that the model, and the predictions, DOES NOT KNOW what cohort it’s making the predictions for. This prevents the model from overfitting to Y at the cohort level, and therefore preserves the genuine “surprise” in the experimental treatment effect that shows up in the Y_residual.

avg(f(X)) has to be equal to avg(Y) for any set of X, Y.

The average of the predictions have to be the same as the average of the actual Y values used to train the model. This means that the model isn’t biased, and it’ll will be true for virtually any ML model.

Let’s again see this demonstrated in the notebook.

15

16 of 20

Presentation title

CUPED:

Conclusion, links, etc.

Section

17 of 20

Everything is CUPED

Presentation title

Section

With this understanding, we can think of all kinds of different experimental analysis techniques as CUPED, depending on how you make your predictions.

Prediction: Y_pred = 0.

Leads to: standard t-test (the vast majority of our metrics)

Prediction: Y_pred = X, where X is the same quantity measured in the pre period.

Leads to: “simplified diff-in-diff”, or post-minus-pre analysis

Prediction: Y_pred calculated by stratifying X, and taking the mean of each strata

Leads to: stratification CUPED (current macro in bunsen_metrics)

Prediction: Y_pred calculated by linear regression on X

Leads to: standard CUPED

17

18 of 20

Everything is CUPED

Presentation title

Section

Prediction: Y_pred based on full-blown ML on multi-dimensional X

Leads to: general ML CUPED

Prediction: Y_pred = X, where X is the paired value in the control cohort. Basically, just stratification trained only on the control cohort.

Leads to: paired t-test

Prediction: Y_pred = AVG(X), where X is both values in the pair. Basically, just stratification trained on all cohorts.

Leads to: paired t-test (be careful about covariances!)

18

19 of 20

Next steps

Presentation title

Section

Currently, the macro in bunsen_metrics implements the stratification method,

and the SEO sessions metrics and the guv-based ad click inventory metrics uses the macro.

The macro is quite flexible. Any other method of prediction can be added to it, provided that it can be implemented in SQL.

More metrics should start to use the macro. It’ll basically get you ~5%-40% reduction in standard deviations, at no sacrifice in data integrity, and minor increase in query complexity.

HOWEVER, there is currently an issue with previous experiment_runs in the same experiment interfering with the CUPED process, because previous values would then have current cohort information, since cohorts are not randomized between experiment runs. We should discuss this.

19

20 of 20

Links

Presentation title

Section

20