Ronny Kohavi
July 29, 2023
TL;DR: surprising A/B tests that are published should be viewed with skepticism. Here I review a recent example, where my Twyman’s Law meter spiked, so I am sharing my observations to raise awareness of trust-related evaluations that others should do when reading or listening to a result of an A/B test.
Surprising A/B tests, where a small change resulted in a big change to key metrics, are the hallmark of books, conferences, courses, and articles about A/B testing. My book (https://experimentguide.com) and online class (https://bit.ly/ABClassRKLI) are no exception, as both start with such examples.
The question one should ask is the level of trust you should assign to these. Over the years, I developed healthy skepticism towards extreme results. When these were presented at Microsoft and Airbnb, where I worked, I called Twyman’s Law (Any figure that looks interesting or different is usually wrong) many times, and probably 9 out of 10 times we found an issue that invalidated the result. The surprising results I have shared were properly powered and the extreme ones (like the opening example in the book) replicated several times. I previously called to raise the bar on shared A/B tests, so this is a continuation of that effort.
The example I share below comes from GuessTheTest, a site that publishes A/B tests regularly (about every other week). As expected from the need to publish regularly, the quality and trust-level vary. Over time, Deborah O’Malley has improved the evaluation and now regularly highlights trust issues with every experiment. She even posted criticism of extreme results herself. I like the overall site and I am a “Pro Member” because it provides positive value. I do hope that my criticism of this experiment is constructive and helps readers better evaluate the trust level of experiments that are published, and experiments that they run in their own organization.
The test was run by Optimizely, an A/B Testing Vendor, on their own site. The site uses “Get Started” as a Call to Action (CTA) on its pages in the upper-right, as shown below in Figure 1. Additional details are on GuessTheTest - Which CTA copy won?
The Treatment replaced that copy with “Watch a demo” on the Orchestrate product page, as shown below in Figure 2.
The test ran for 44 days with a 50%/50% design. 22,208 visitors saw the Control and 22,129 visitors saw the Treatment.
The Overall Evaluation Criterion (OEC) was clicks on the button, that is, click-through rate.
The results showed that Control had 0.91% click-through rate and the Treatment had 1.59% click-through rate, a 75% lift. [d][e][f]
Figure 1: Control with "Get started" in the upper-right
Figure 2: Treatment with "Watch a demo" in the upper-right
There are several things that were done well.
Plugging in the 0.91% click-through rate with a relative MDE of 5% into a power calculator that is mentioned in the article, and it yields a minimum sample size of 688,000 per variation.
The test was run with about 22,000 users, so the experiment is highly under-powered.
The GuessTheTest article computes post-hoc power using 74.7% as the MDE. This is a noisy and misleading as shown in A/B Testing Intuition Busters (Section 5). There is a great article: The Abuse of Power: The Pervasive Fallacy of Power Calculations for Data Analysis by Hoenig and Heisey 2001 (official, PDF), which explains this point in detail. Basically, if you have a statistically significant result, the post-hoc power calculation will always show that you have at least 50% power. Since the p-value for this experiment was 0.02, it appears that there is enough power to detect a 74.7% change, but it’s a catch-22 argument that is incorrect. I strongly disagree with Deborah’s use of power-power calculations and her conclusion that the sample size is sufficient.
In this case, the estimated power (spreadsheet) isn’t 50% or anywhere closer to 80%, it’s 7.3% !
What happens when you have an under-powered experiment? The winner’s curse: if you do get a statistically significant result, the treatment effect estimate is likely to be highly exaggerated.
Here is the distribution of p-values you will get from running 10,000 experiments with 5% lift with the above 7.3% power (spreadsheet):
Figure 3: P-value distribution with 7.3% power
The distribution is close to uniform, which is the p-value distribution when there is no effect (when the Null is true). If you do get a statistically significant result, the average lift you will get is 23% (instead of 5%), which exaggerates the lift by a factor of 4.6.
There is a high probability that this is simply a false positive result.
…the wording "Get Started" worked best globally, across all pages on the site. However, the team wondered if this same finding would hold true on specific product pages.
This is the famous Jelly-Bean example by XKCD, where testing 20 colors of Jelly-beans will likely lead to a stat-sig result at the 95% level:
While the p-value claimed in the article is 0.02, we don’t know how many pages were evaluated and the text seems to imply that this wasn’t the only one.
When testing multiple-hypotheses, a lower alpha threshold should be used. For example, if 10 independent pages were tested, Bonferroni correction recommends using an alpha threshold of 0.005 to determine statistically significant.
Alex Deng suggested that to raise the trust level, we should know how many other product pages were evaluated and the distribution of lifts in these other versions.
The button is shown below the fold.
A similar design, where the button is at the bottom is shown on a narrow PC browser windows.
If the user doesn’t see the copy, any treatment effect is likely to be diluted, as the treatment effect for those users is zero.
[g][h][i][j]It isn’t clear from the description whether this was a PC-only test or if triggering was employed to limit to users who actually saw the button.
The article claims that “Overall, this test appears to satisfy the criteria for a trustworthy test result and is an exemplary study.” I disagree and I provided three reasons for my skepticism.
The title of Section 3 of our paper A/B Testing Intuition Busters is “Surprising Results Require Strong Evidence – Lower p-values.” I’ve seen tens of thousands of experiments at Microsoft, Airbnb, and Amazon, and it is extremely rare to see any lift over 10% to a key metric. The metric here is local (click on a button), so larger lifts are possible, but the 75% lift in this example seems unexpectedly large. With the concerns above, I’m skeptical this is a trustworthy result. At a minimum, I would recommend a replication run or two, and I suspect this is simply a nice example of the winner’s curse.
Thanks to Alex Deng and Lukas Vermeer for early feedback.
[a]This document is available at https://bit.ly/WhenNotToTrustAPublishedAB
LinkedIn Post: https://www.linkedin.com/posts/ronnyk_abtesting-experimentguide-cta-activity-7091158457780752385-VjoM
Please comment here
[b]_Marked as resolved_
[c]_Re-opened_
Opened. Please don't resolve
[d]What’s interesting about this, which makes me think the result could be real, is that the level of commitment is much more aligned to where the customer most likely is on their buying journey. “Get started” is a high-commitment action to take, one that most people are unlikely to be ready for so far up in a funnel. “Watch a demo” requires no commitment and is still about discovery vs. action.
The level of mental commitment required in the action is profoundly different. This makes me think the result could be real.
But for all the reasons you suggested here—I’d like to see them replicate it. 😄
[e]Erin, the description in GuessTheTests's description has the following: "the wording 'Get Started' worked best globally, across all pages on the site"?
Any explanation of why "watch a demo" should do better contradicts this statement. The claim in this test is that it worked well on the Orchestrate page, not globally.
[f]This is all just speculation, obviously. But I’d imagine that it could have something to do with it being a CMS product instead of the core testing tool product. There’s a certain level of confidence people need to have to click on “Get started” on anything that isn’t what they’re very well known for—testing. Lots of great CMS tools exist out there already. So it could take a harder sell to move to Optimizely for something that isn’t their core capability.
That said, I know you mentioned that the size of the button probably has little to do with it—but I think perhaps it could also play a bigger role than we’d imagine. Fitt’s Law (about target size) could be at play here, too. In order to understand what’s happening from the design side of things, I’d need to look at base and variant side-by-side, play with them a bit, and go through the “typical” flows one would go through to get here.
FWIW, I don’t think the hypothesis is that great. I disagree with “Get started” being vague as the write-up implied. Though the words are vague, it’s crystal clear in terms of commitment. These are my morning thoughts though. 😂 They could change with more investigation and most importantly—coffee. 😛
[g]James Pelham wrote:
Curious your thoughts on this:
Diluted effects are one problem. I'd also be concerned (depending on the design of experiment, which isn't made clear) that there is some segment imbalance that would result in biased effect sizes. (E.g. more users in treatment making it to this product page on a larger viewpoint)
[h]If you're referring to an SRM (Sample Ratio Mismatch) when you say segment imbalance, then that shouldn't be a problem. The button appears in the same spot for Control/Treatment, so assuming proper triggering, there wouldn't be an SRM.
[i]realized you might have meant comment on linked in so I was going to move there.
Agreed, if we assume proper triggering that shouldn't be a problem. I interpreted the below statement from the original post differently and assumed there might be potential for users in one arm being more likely to see the button by being on a larger viewpoint.
"During this time, over 44 thousand visitors saw the Orchestrate product page with either the original "Get started" CTA or the alternative "Watch a demo" CTA. Traffic was initially split 50/50."
Taking the more narrow interpretation of "we analyzed users who visited the page" and not "users who saw one of the two CTAs" its possible there may have been fewer viewers exposed to the treatment in one variant.
Will concede this may be too narrow of an interpretation generally speaking but nonetheless raised a red flag for me as its something I see go unchecked too often.
[j]James, to Deborah's credit, she did do an SRM test, and the p-value was 0.71 (see https://guessthetest.com/test/which-cta-copy-won-get-started-or-watch-a-demo/?referrer=Guessed), so it's very unlikely that there's a problem here.