How to determine the sample size in AB Testing?

Yanlin Chen
4 min readSep 21, 2020

--

An important step before we start an AB Testing is to determine the sample size of our experiment. It’s part of the design. We need to choose the right performance metrics before this step.

Is the larger the Sample Size the better?

We all know that, if the sample size is too small, then it’s not representative. However, on the other hand, The larger the sample size, the larger the impact will be. For some exploratory experiments, there may be some negative effects. If all users are exposed to it, it is obviously inappropriate. Moreover, the endless increase in sample size will also lead to a waste of traffic and resources.

It’ll be a more sensible choice to choose an appropriate sample size according to the needs of the experiment. So how do we determine the required sample size?

Sample Size also determines the Test Period

Imagine that your website is going to perform AB testing.

Q: There are 100,000 unique visitors on the website every day, we only take 10%(10,000) into our experiment per day. We got a 10% estimated Click-through-probability, which is the performance metric we use in this test. How sure would be about this estimate? In other words, suppose you repeated the measurement, you have a different 10,000 users visit the site, and again, you do the same test on these users, will they act the same? If 10% (10,000) visitors are not enough, we can extend the testing period from 1 day to N days. So that sample size can increase from 10,000 to 10,000*N But how to determine N?

Statistics behind

There’s a balanced relationship between the following indicators and the sample size. When the requirements for these indicators decrease, the demand for sample size increases. vice versa.

  • Significance level(also denoted as alpha or α): it’s a measure of the strength of the evidence that must be present in your sample before you will reject the null hypothesis and conclude that the effect is statistically significant. It governs the chance of a false positive. The lower the significance level is, usually the larger the sample size will be needed. In most cases, set α = 0.05.
  • power(1-β): the probability of a hypothesis test of finding an effect if there is an effect to be found. It represents the probability that you’ll get a false negative (Type II Error). The higher it is, the larger the sample size will be needed. Can take power = 0.8
  • Mean Discrepancy (μA-μB): If the mean values ​​of the two groups are significantly different. We don’t need a very large sample size to achieve statistical significance.
  • Standard Deviation (σ): The smaller the standard deviation, the more stable the trend representing the difference between the two groups. The easier it is to observe significant statistical results.

Here’s the formula to represent the relationship:

  • nA, nB: the sample size of Group A and B
  • K: nA/nB (In most cases, the size of Group A and B should be equal. K = 1)
  • Z-score: it’s statistically based on the probability of P-value. Can transform directly from P-value.

statsmodels

Python has a statistical tool statsmodels, including a NormalIndPower tool that can be used to calculate the Z test.

Solve_power parameters:

  • effect_size: standardized effect size, difference between the two means divided by the standard deviation(|μA-μB|)/σ). If ratio=0, then this is the standardized mean in the one-sample test.
  • nobs1: number of observations of sample 1. The number of observations of sample two is ratio times the size of sample 1, i.e. nobs2 = nobs1 * ratio ratio can be set to zero in order to get the power for a one-sample test.
  • alpha: significance level, use 0.05 in most cases.
  • power: use 0.8 in most cases.
  • ratio: the ratio of the number of observations in sample 2 relative to sample 1. Use 1 by default
  • alternative:[str, ‘two-sided’ (default), ‘larger’, ‘smaller’], choose whether the power is calculated for a two-sided (default) or one-sided test.

Here’s a simple example:

The current click-through rate(CTR) is 0.3. If we want to increase the click-through rate to 0.33 by 10%, the sample size of the test group and the control group is the same.

There’re many calculators to help us with this if you don’t want to do any coding.

For example: if your current Conversion Rate/ Click Through Rate/ Subscription Rate is 10%. And you want to increase 10% to 11% (minimum detectable effect). We can set up significance level = 5%; Power = 80%. The result comes up would be 14751. This means that in this test, the minimum size of Group A and B should be equal to 14751.

--

--

Yanlin Chen

Data Analyst in Fintech: Data Science, Analytics, data visualization specialist. #Python #NLP #Hadoop #AWS