If you have plenty of data coming in (lucky you), but you are still not sure whether you can be confident in the difference your test results show (especially if you don't have access to the individual data points, and don't know the underlying distribution or the variance, which isn't reported by e.g. Google Analytics), this back-of-the-envelope calculation may help.

If we measure something in a system, and have a well-established mean value, and then we test a variant of the system and have \(n\) measurements from the variant, their mean can be different from the established mean merely because the measurements are by nature random. That is, it is possible that the variant system performs in exactly the same way as the original system, and the difference in means we see is due to the randomness. (This is the null hypothesis.) We use Chebyshev's Inequality to approximate the probability of this being the case.

In particular, we assume the measurements are independent random variables \(X_i\) which all have the same expectation \(\mathrm{E}(X_i)=\mu\) and variance \(\mathrm{Var}(X_i)=\sigma^2\). We assume that the mean of measurements from the original system provides the expectation (but see below for evaluating two-sided tests). From Chebyshev's Inequality (http://math.mit.edu/~goemans/18310S15/chernoff-notes.pdf) \[\mathrm{Prob}\left(\left|\frac{\sum X_i}{n}-\mu\right|\geq\epsilon\right)\leq\frac{\sigma^2}{n\epsilon^2},\] which is an upper bound of this probability. If \(r\) is the range of the data (the difference between the maximum and minimum values), we know \(\sigma^2\leq \frac{r^2}{4}\). If we also describe the difference of the means as a proportion of the range (\(\epsilon=rd\)), we get \[\mathrm{Prob}\leq\frac{r^2}{4nr^2d^2}=\frac{1}{4nd^2}.\] As usual we may consider the difference significant and reject the null hypothesis if its probability is less than 5%, that is, \[\mathrm{Prob}\leq\frac{1}{4nd^2}\leq\frac{1}{20},\] from which \[n\geq\frac{5}{d^2}.\]

In terms of concrete examples, this means:

\(d\) of this or more is significant with a confidence of 95% | if \(n\) is at least |
---|---|

5% | 2,000 |

1% | 50,000 |

0.5% | 200,000 |

If, instead of comparing one variant to the original system (which we assumed to provide the underlying distribution and expectation), we compare two variants in an A/B test, the probability we calculate is that of seeing a given difference between the means of the measurements if the measurements on the two sides have the same expectation, that is, the two variants do not perform differently in reality. This is the probability of either the measurements on the A side being sufficiently far from the expectation, or the measurements on the B side being far. We approximate this probability with the (always greater) sum of the individual probabilities, and, as \(d\) is the difference between the expectation and the mean, we use \(d=d_{AB}/2\) where \(d_{AB}\) is the difference seen between the means on the A versus the B side. (We do this because in the worst case, the common expectation is halfway between the means, which allows both means to be close to it.) Assuming we have \(n\) measurements on each side we get \[\mathrm{Prob}_{AB}\leq 2\frac{1}{4n{\frac{d_{AB}}{2}}^2} = \frac{2}{nd_{AB}^2}.\] For this probability to be less than 5%, we need \[\mathrm{Prob}_{AB}\leq\frac{2}{nd_{AB}}\leq\frac{1}{20}\] \[n\geq\frac{40}{d_{AB}^2}.\] In terms of some concrete numbers:

\(d_{AB}\) of this or more is significant with a confidence of 95% | if \(n\) is at least |
---|---|

5% | 16,000 |

1% | 400,000 |