Umbrellas often appear just before it pours,but banning them will not stop the rain;it will just make everyone more wet
What we want to know (Rubin Causal Model)
The fundamental problem of
causal inference
We cannot sample the entire population
We cannot expose units to both treatments exclusively
We cannot directly observe underlying probabilities
What we can measure (Rubin Causal Model)
We will try to answer two
key questions
Is there any causal effect?
What is the size of the causal effect?
(If the size is non-zero, there is an effect.)
What randomization gives us (Rubin Causal Model)
Repeating the same experiment (Expectation)
Our answers will be correct
in expectation
If we show A and B to random samples of the population,then on average the fraction of yes in both groups will be equal to the true underlying means,and on average across replications of the experiment the difference between the means will be equal to the average treatment effect
Up, down or neither?(I don't know either)
Randomization ensures only three things can
explain a difference
Causation resulted in people behaving differently when treatment was applied
Pure chance resulted in a difference between the two groups unrelated to the treatment
Mistakes resulted in an unintended difference in results unrelated to the treatment
Is this die
Fair?
⚅⚅⚅
Not fair; I cheated.
How did you decide you were confident
that I was cheating?
You did not need to know what to expect from a loaded die.Instead, you simply rejected the idea that it was fair.
We want to reject the
null hypothesis
The null hypothesis assumes there is no treatment effect for any unit; any difference we observe is simply due to chance
If we could reasonably rule out mistakes and chance, we might reject the null and consider this to be evidence for an alternative
We compute a
p-value
Assuming there is no effect, the p-value is the probability of seeing a particular result or more extreme by chance.
How likely is this result assuming the null is true?
(The probability of rolling three sixes on a fair die is 0.00462962.)
Randomization inference (Rubin Causal Model)
When the null is false p-values
tend to be small
When the null is true p-values will be
uniformly distributed
We need to pick a
threshold
One swallow does not a summer make, nor one fine day,but how many swallows do we count before we pack away our umbrellas?
Scientific standard for significance: p < 0.05
Results under our threshold are called
statistically significant
Two types of
errors
Type-I is the incorrect rejection of a true null hypothesis; we cried foul when there was none
Type-II is the failure to reject a false null hypothesis; we failed to detect a real effect
Repeating the same experiment (No effect)
Repeating the same experiment (Small effect)
The importance of
Statistical power
Statistical power is the probability that the test correctly rejects the null hypothesis when the alternative hypothesis is true
Two main things affect statistical power:
Sample size (more is better)
Effect size (more is better)
Repeating the same experiment (More power)
Repeating the same experiment (MOAR POWER!!11)
Up, down or neither?(I don't know either)
The importance of sticking to
Protocol
The methods described assume strict adherence to protocol;violations of protocol such as peeking and multiple testing increase the type-I error rate
Can telekenisis influence
Three dice?
Fair die; I still cheated.
(The probability of rolling three sixes on a fair die if you keep trying is 1.)
Repeating the same experiment (Peeking twice)
Repeating the same experiment (Peeking 100x)
Reasons for violating
Protocol
More flexible protocols may be desirable
early stopping rules to mitigate damage
early shipping to minimize opportunity cost
multiple variants to test several alternatives
multiple metrics to guard business KPIs
All these are possible[4], but require protocol adjustments
[4] Alex Deng, Tianxi Li, Yu Guo 2014 “Statistical Inference in Two-Stage Online Controlled Experiments with Treatment Selection and Validation” WWW '14. 609–618.
Design of experiments
Statistical foundations for causal inference
Lukas Vermeer is Director of Experimentation at Booking.com
Rubin, Donald B. 1974. “Estimating Causal Effects of Treatments in Randomized and Nonrandomized Studies.” Journal of Educational Psychology 66 (5): 688–701. (link)
Goodman, Steve 2008. “A dirty dozen: twelve P-value misconceptions.” Seminars in Hematology, 45 (2008), pp. 135-140. (link)
Kohavi, R., Longbotham, R., Sommerfield, D. et al. 2009 “Controlled experiments on the web: survey and practical guide” Data Min Knowl Disc (2009) 18: 140. (link)
Alex Deng, Tianxi Li, Yu Guo 2014 “Statistical Inference in Two-Stage Online Controlled Experiments with Treatment Selection and Validation” WWW '14. 609–618.(link)