A/B Test review
Some notes when I reviewed A/B test, and add some Bayesian A/B test articles for reference.
A/B test especially point to experiment like conversion rate which output are binomial. For continuous output, we use ANOVA to do the similar work.
Which test should be used, t-test or z-test?
Basically t-test should fit to real world problem better since in many case population mean is unknown, however if the sample size is large enough(e.g. larger than 50) then both these test are having little differences.
Quick view of A/B test
A/B test is an experiment that:
- Test two or more variants against each other
- To evaluate which performs best
- In the context of randomized experiment
How to run a test:
- Choose variant to test, and split subject into control and treat group.
- Decide target value(e.g. conversion rate improved from 0.13 β> 0.15)
- Parameter setting
- alpha
- beta
- sample size
- Collecting data
- Run t-test / z-test and calculate p-value/ confidence interval to see if improvement is significant
- p-value > alpha, there is no significant difference between control/treat group
- see if confidence interval larger than expected target.
General parameter settings:
- alpha: false positive rate ~5% or 10%
- beta: false negative rate ~20%
- power: 1-beta, if beta = 20%, which means 80% of real true positive data would be detected
- sensitivity(improve rate): the expectation range of control and treatment
- sample size: need to be calculated, since we have target beta, and beta would reduce as sample size increased.
Few Things to Know About A/B Test
- Do not stop the test before sample size reach the setting target, otherwise the statistical poower is not sufficient to conclude statistical infernce. In other word, do not use p-value as significant measurement during experiment, p-value shoulld only be calculated when trial ended.
- Design parameters in advance, doing the tradeoff between what percentage of improvement wanna to be examined and the time/cost the trial has.
Why use Bayesian A/B Test?
- The Slightly Better Model
- When p-values equals to 0.11? Umm, seems different form origin but not knowing how better it was
- No p-value here, replaced by expected loss
- Since there is no p-value, the experiment can be βpeekβ at anytime, and if expected loss is less than epsilon (user define, say we want at least 5% of improvement), the test can be terminated more quickly.
Basic Idea of Beta Distribution
- Posterior Distribution can be evaluate by Beta distribution
- A useful feature in binary output experiment is that the real probability of the group can be capture by Beta distribution at any step
- ππ΄βΌBeta(πΌπ΄,π½π΄)
- ππ΅βΌBeta(πΌπ΅,π½π΅)
- Ex. if control group collect two points, 1 success and 1 failure, then P(A) ~ Beta(1, 1). And after several tests with 40 success points and 38 fail points, then P(A) ~ Beta(41, 39).
- A useful feature in binary output experiment is that the real probability of the group can be capture by Beta distribution at any step
[3]The power of Baysian A/B test
[4]A/B testing β Is there a better way? An exploration of multi-armed bandits