zalando / expan

Open-source Python library for statistical analysis of randomised control trials (A/B tests)

MIT License

331 stars 50 forks source link

Test differences date by date #206

Open alexisrosuel opened 6 years ago

alexisrosuel commented 6 years ago

Context

It is very useful when running ab test to see the evolution of the difference / pvalues / credible interval / etc. through time. For instance if I start an experiment on 2018-04-01, and finish it on 2018-04-30, I would like to know what was the state (in term of pvalue, etc.) each day. It helps to visualize if the test has "converged" or not. airbnb (source : https://medium.com/airbnb-engineering/experiments-at-airbnb-e2db3abf39e7 )

Proposition

Would it be possible to apply sequentially the statistical analysis date by date (it could apply the analysis to the sequence [df[df.date <= dt.datetime(2018-04-01) + dt.timedelta(days=i)] for i in range(30)], and then report the same json, but with a date level at the top. (Maybe there is a much cleaner architecture than this !)

Thanks

gbordyugov commented 6 years ago

Dear Alexis @alexisrosuel,

thanks a lot for the suggestion. What you're talking about here seems to me an instance of the 'early stopping' problem to me and is a subject of the multiple hypothesis testing issues. The more often you look at your p-value, the higher the probability to see spurious significance by chance.

Expan kind of supports early stopping in a highly experimental mode and tries to mitigate the risk of spurious early stopping by applying a stricter p-value threshold when there is less data than expected. But it always consume all the date which is present in the dataframe.

Let me know if I understood your question correctly.

Best, Grisha

alexisrosuel commented 6 years ago

Hi Grisha,

In fact the idea behind this chart (and the whole airbnb medium article) is the opposite. They wanted to point out that the pvalue can fluctuate through time, go below the signifiance threshold, and then stay there forever or not.

The chart show this : if you stop the experiment represented here around day 10, you commit type 1 error. But I you let the experiment run for a few more days, you see that the pvalue in fact "converges" around its true value.

To recap, this does not provide an early stopping criteria. This helps to monitor wether the pvalue has still an erratic behaviour (so we can't stop the experiment at this moment), or if it hasn't changed sinced a "long time" (to be defined). For me the ideal criteria is :

look at the true statistical early stopping criteria (the aim of this package)
accept this results iff the pvalue graph has converged

What do you think of it?

gbordyugov commented 6 years ago

Please pardon my poor expression: What I meant in my first reply is exactly what you're talking about

The more often you look at your p-value, the higher the probability to see spurious significance by chance.

Our early stopping logic counteracts the effects like this by reducing the alpha-threshold at the beginning of experiment (where you've got less data), so it's not 0.05, but much larger for small quantities of data in the first days.

alexisrosuel commented 6 years ago

Oh indeed I see your point now too :)

Yes, expan use some kind of "dynamic pvalue threshold", so we could draft this value day by day, along with the observed pvalue?

shansfolder commented 6 years ago

Yes the "dynamic threshold" is based on information fraction, which is ratio of current sample size and estimated sample size for the experiment.

Here is the method we use: https://github.com/zalando/expan/blob/master/expan/core/early_stopping.py#L24-L36

shansfolder commented 6 years ago

Whether it is day-by-day analysis or other periods, will depends on how your code calls ExpAn.