3 ways to run experiments faster with valid stats

by Ryan Thomas

May 6, 2024

Ryan Thomas
blog post headers (6)

When you're testing on a lower traffic site, and you go through the process to properly plan a test with a reasonable MDE, power, and significance level, you'll get hit with the stark reality of your situation: instead of being able to complete the test within a matter of weeks, you're going to need months or even years.

Cookie issues aside, it's going to be really difficult to get buy-in for your testing program if you tell the higher-ups that it'll take 6 months to make a decision on which version of the landing page to go with. Business moves pretty fast, and experimentation is supposed to support decision making, not slow it down to a crawl.

In this situation, a lot of practitioners are tempted to play fast and loose with the rules. They'll throw test planning out the window, use a Bayesian calculator (gasp!), and just run the test until they get a signal one way or the other. Or a slightly less reckless version of this is to reduce your significance threshold to 80% or even lower, and/or test with MDEs so high that you need your variant to absolutely dominate in order to detect the lift. While a certain amount of accepting higher error rates is fine as long as everyone understands the risks, if you take this too far you'd be better off just consulting a magic 8 ball.

Luckily there are some options for reducing your runtimes without sacrificing statistical rigor. It all boils down to asking the right questions, and matching your statistical methodology to the context. There's no free lunch though, and each of these has some assumptions, tradeoffs, and risks, but we'll go into them here so you can decide whether any of these approaches fits your situation.

Non-inferiority tests

Normally when people talk about testing for non-inferiority rather than superiority, they mean they will ship the variant as long as it's not significantly worse, which implies that they are running a 2-tailed test. There are a couple of issues with this approach:

  1. A 2-tailed test takes much more sample size (and therefore runtime) to get the same precision as a 1-tailed test, and it's answering a non-directional question that nobody is interested in. The people who advocate for 2-tailed tests are actually using them as an indirect and confusing way to run two one-sided tests (TOST).

  2. Your minimal detectable effect (MDE) applies to the negative tail as well, so if your test is inconclusive, you could potentially have a true negative effect that goes undetected. If your MDE is 5%, then at a true difference of -5% you'd have a 20% chance of missing it, and that percentage goes up as the true difference gets smaller.

This is where the true non-inferiority test comes in. It's a 1-tailed test, so you get the efficiency benefit of answering a directional question and zeroing in on the tail you are interested in, and you get to explicitly set a non-inferiority margin that you are comfortable with, which reduces your sample size needs quite drastically.

How drastic? In the article linked below, Georgi Georgiev's example of a 9% baseline conversion rate and a target MDE of 5%, running the test with a non-inferiority margin of 2% means you only need 8.35% of the sample size that you would need for an equivalent superiority test. Yes you read that right, less than a tenth of the runtime. If the target MDE is lowered to 2%, you still only need 22% of the sample size of a superiority test.

The reason this massive efficiency gain is possible is because you are actually using a different null hypothesis. Instead of the null being that the variant is no better than control, which puts the comparison point at zero, you shift the comparison point to -2%. So by accepting the possibility that the variant is a little bit worse, you make it easier for the variant to "win".

So what's the catch? Well it’s that even if you get a significant result, the variant might perform a bit worse than control. And if you run a series of non-inferiority tests after one another, each time implementing the "winning" variant and using that as the new baseline, you run the risk of what's called "cascading losses", where a 2% loss each time starts to compound into a situation where you're moving backwards quite substantially.

Luckily there are ways to mitigate this risk. Because of the massive efficiency gain, you can run a series of non-inferiority tests, and then group those changes together into a single superiority test to confirm the results. But really the best answer is to make sure that your testing protocol is a match for the context of the test. Is it a change that's quick and cheap to implement, with other business factors pushing for it to be done? Then a non-inferiority test might be a good choice.

Further reading: https://blog.analytics-toolkit.com/2017/case-non-inferiority-designs-ab-testing/

Sequential tests

This is another one that people often get wrong based on the name. "Sequential" makes people think about the sort of "test" where you look at the baseline performance, launch a change, and then see how the performance has or hasn't improved. This is often called "before / after testing" or time series analysis, and it doesn't have anything to do with sequential testing. It also introduces a whole bunch of validity threats since you don't get the benefit of randomization.

As you may know, a typical AB test where you plan your sample size in advance and keep the test running until you hit that sample size, is called a "fixed horizon" test. And in that case there's a concept known as "peeking" which is where you look at the data before the test is concluded and decide whether to stop or keep the test running based on what you see. This may seem harmless, but what you're effectively doing is running two tests one after another (or more than two depending on how many times you peek), which will increase the chances of seeing a false positive result if you don't correct for the inflated error. This is the purpose of a true sequential test. You plan in advance how many times you want to check the data and possibly make a decision before the test has hit its total sample size, and the stats are adjusted to compensate for the extra error that is introduced by the peeking.

There are a few flavours of sequential testing, but the most popular one in the CRO community is the AGILE method developed by Georgi Georgiev. This method appears in his Analytics Toolkit A/B test statistics platform, the ABsmartly experimentation platform, and a slightly modified version is used in Forward Digital's sequential planning & analysis tool. If you prefer not to use a dedicated tool for this, there's a simplified method where you add around 10% to your sample size and then check a table for your upper and lower p-value boundaries, described in this CXL article on peeking and sequential testing by Merritt Aho.

The benefit of sequential testing is pretty clear: on average, you can run tests 20-80% faster than fixed horizon tests. This is a pretty huge improvement in test duration. The downside is that in order to compensate for the optional stopping, you need to increase the final sample size to maintain the same level of statistical power. In practice what this means is that tests with smaller effects will take a bit longer to run. But according to Georgi Georgiev's simulations, this would account for less than 1% of all experiments.

Further reading:



Sneaky bandit tests

This is one you probably haven't heard of at all, and that's because I just made up the name myself. This method is from Matt Gershoff at Conductrics, and he called it "go for it", "testing without p-values", or the "easy method", but neither of those has quite the same ring to it as "sneaky bandit". I'm not 100% settled on that name though so feel free to contribute your suggestions.

The general idea is that you set your alpha to 0.5 instead of the typical 0.05, and your power level to 95% instead of 80% in order to calculate the sample size, and then once the test is done, you simply pick the variant with the highest conversion rate. This might seem a bit.... insane, since on the surface it would seem to result in a 50% false positive rate, but as with everything else in AB statistics, it all depends on what question you are asking and what risk you are comfortable with.

There's a pretty specific scenario where you might consider using this approach: either there is no "control", or you are already committed to abandoning it and just want to choose between two completely new options. Gershoff uses the example of choosing a headline for a new article, but this could also apply to a situation where the HiPPO has decided that they absolutely need to redesign a certain page. Maybe you tried as hard as you could to talk them out of it in favour of testing a new version against the original, but political considerations won out and the compromise you were able to reach was to at least test two new versions of the page.

In that scenario (well actually in all AB testing scenarios) there are two possibilities: there is a real difference between A and B, or there isn't. Remember that observed difference and true difference are, well, different. This is the whole reason we have things like MDE and error rates in the first place. The observed effect gives us evidence of what the true effect might be, but we never really know what the true effect is.

So practically speaking there are two "no difference" possibilities for this method: no true difference, and no observed difference. If there is no true difference, then it doesn't actually matter which one you pick, so you can just let the observed difference guide your decision. But if there's no observed difference, then you can flip a coin, ask that magic 8 ball, read some tea leaves, phone a friend, you get the idea.

If there is a true difference, then this method gives you a very good chance at figuring out which version is best just by picking the one with the highest conversion rate. In fact, the power level (in this case 95%) becomes the % chance that you will pick the better performing variant if there is a true effect equivalent to the MDE.

So what's the benefit of the sneaky bandit? In Gershoff's example, with a baseline conversion rate of 4% and a target MDE of 25%, you would need a total sample size of 4,648 users. If you were to set this up as a more typical AB test with a confidence level of 95%, you'd need almost 4x as many users (18,598 to be exact). So that means you can run this type of test in a quarter the amount of time as a typical AB test.

Oh and if you're wondering why I called it sneaky bandit, it's because this approach is somewhat equivalent to an epsilon-first bandit algorithm.

Further reading: https://blog.conductrics.com/do-no-harm-or-ab-testing-without-p-values/


Now when you hear people say that there's no point in doing AB testing until you have a quarter million visitors per month, you know they just aren't being very creative about what questions they want the statistics to answer. By matching your approach to your context and being aware of the various assumptions and risk tradeoffs, you can drastically reduce the runtime of your experiments and make decisions faster without throwing stats out the window.