zalando / expan

Open-source Python library for statistical analysis of randomised control trials (A/B tests)
MIT License
331 stars 50 forks source link

Chi-squared test for the variant split check #228

Closed daryadedik closed 6 years ago

daryadedik commented 6 years ago

I have removed the previous chi-square test of homogeneity, since this is not exactly what we need.

A test of homogeneity tests whether two (or more) sets of multinomial counts come from different sets of population proportions. Or it tries to find out whether two different unknown populations have the same distribution as each other. A goodness of fit test is for testing whether a set of multinomial counts is distributed according to a pre-specified set of population proportions. The second is what we need: we try to answer the question whether our data or, more specifically, our variants splits consistent with the expected proportions we claim.

If you have comments or doubts feel free to approach and discuss.

coveralls commented 6 years ago

Coverage Status

Coverage increased (+0.06%) to 92.432% when pulling 99b333e6de6134fe8e1e7e62353dfd24b6221bc1 on chi-square-test-simplified into ab392e65bf86ef8cbc0d8a59071ff46e89a3a187 on master.

daryadedik commented 6 years ago

Note: I am still a bit concerned about the correction we need to perform for multiple testing again (since chi-square is also a statistical test). If we log and peek into the results again, we need to correct alpha again. Maybe we can avoid this by not conducting the statistical test, but rather by introducing the threshold. But I will investigate which threshold will be meaningful.

aaron-mcdaid-zalando commented 6 years ago

We could consider an alternative approach entirely. We could simply arrange to store the raw data (the number of assignments) and then do the statistical analysis in some external pipeline connected to our monitoring tool. We might want to do a lot of debugging and experimenting with different analysis methods for this data.

Then, in the longer term, when we understand this issue better and we understand the multiple testing issue, we can consider how much of the analysis can be added into expan.

In general, I think we should "gather good data first, ask questions later"

daryadedik commented 6 years ago

In decided to stick with the chi-square test goodness of fit test, because it's very straightforward (you just need a variant column and expected variant weights) and the other: if you test multiple times it's easy to correct with our correction methods: just collect p values and call method from out correction module and get the new alpha to make a decision.

daryadedik commented 6 years ago

@shansfolder @gbordyugov @aaron-mcdaid-zalando please review or approve, I need it for further development.

shansfolder commented 6 years ago

@daryadedik it looks good from my side, but I would suggest to wait for a second reviewer for such new functionality. ;)

daryadedik commented 6 years ago

@shansfolder sure.

aaron-mcdaid-zalando commented 6 years ago

Is this true?:

This adds two new function, chi_square_test_result_and_statistics and chi_square, but those tests are not called anywhere (except in the tests).

If that is true, then this can be safely merged. I'm not in a position to fully review the maths, but I'll give a thumbs up if the above is true

daryadedik commented 6 years ago

@aaron-mcdaid-zalando chi_square is called inside of chi_square_test_result_and_statistics. The letter is not called anywhere in expan, except tests

aaron-mcdaid-zalando commented 6 years ago

:+1:

(feel free to make more changes, and ping me again for a thumbs up)

daryadedik commented 6 years ago

@aaron-mcdaid-zalando let me know if you are ok with the changes.

aaron-mcdaid-zalando commented 6 years ago

:+1: