zalando / expan

Open-source Python library for statistical analysis of randomised control trials (A/B tests)
MIT License
331 stars 50 forks source link

Two-sided outlier filtering mode #249

Closed gbordyugov closed 5 years ago

gbordyugov commented 5 years ago

Two-sided outlier filtering mode

coveralls commented 5 years ago

Coverage Status

Coverage increased (+0.1%) to 92.38% when pulling c01448e235e7d86512b720868b10717e454d80db on filter-negative-kpi-values into be6d6633c9b4f20ecc3f5eeef2d004c8ba7cd17d on master.

igusher commented 5 years ago

LGTM. Approve.

gbordyugov commented 5 years ago

@aaron-mcdaid-zalando what do you think?

aaron-mcdaid-zalando commented 5 years ago

Can you say more about what this is about. I recall us having a discussion internally some time ago, but I would like it specified somewhere. Perhaps as documentation.

The default is the compute the X percentile (where X is 99.9 by default for some users) and to discard values above that threshold

The change proposed here is:

  1. Compute percentile X for all non-negative values and discard above that.
  2. Compute percentile ~X~ (1-X) for all negative values, and to discard values below

This scheme is proposed in order to have backwards compatibility for datasets that do not have any negative values - correct?

gbordyugov commented 5 years ago

@aaron-mcdaid-zalando three methods that are implemented do the following: 1) drop x% of the largest values 2) drop x% of the smallest values 3) drop x/2% on both sides of the distribution.

Plus there is this heuristics that goes for 1) if all values are non-negative, for 2) if all values are non-positive, and for 3) if the values are sowohl als auch.

gbordyugov commented 5 years ago

@aaron-mcdaid-zalando could you pls review the changes?

aaron-mcdaid-zalando commented 5 years ago

@aaron-mcdaid-zalando could you pls review the changes?

Conversations resolved, and another positive review added. :+1: