yandex / rep

Machine Learning toolbox for Humans
http://yandex.github.io/rep/
Other
687 stars 144 forks source link

FoldingClassifier: KFold vs StratifiedKFold #92

Closed jonas-eschle closed 8 years ago

jonas-eschle commented 8 years ago

Hey,

first of all a compliment: I really like your repo and I build a lot of code on it, it's so useful! About the FoldingClassifier: There was already a request to implement the StratifiedKFolding additionally to the "normal" KFolding. I would be very glad to see this but I'd even go a step further: why don't you completely replace the KFold with a StratifiedKFold?

I think, from an ML point of view, it is always better (or, in best case, equally good) to use a stratified one. Using a normal KFolding only introduces different class-balances which (usually) result in "shifted" probabilities among the different classifier, whereas a stratified one does not and therefore makes each trained classifiers predictions "comparable".

Or in other words: I cannot think of any case where you want to have a non-stratified KFolding instead of a stratified one.

What do you think?

Best, Mayou

arogozhnikov commented 8 years ago

Hi Mayou, thanks for your feedback!

From statistical point I'm completely for using stratified kFold, it's an implementation problem. Stratified kFold requires some magic with remembering target of original dataset, which is somewhat I don't want to be default behavior.

Simple kFold splitting requires only length of dataset (one number) and random state (one number). With fixed random state splitting doesn't depend on the dataset, but situation changes in the stratified case - splitting should depend on the dataset labels, thus this information should be somehow kept inside a classifier (we shouldn't ask for labels during prediction, otherwise we break interface). In rare (but still existing) situations this drives to spending much more memory.

So I'd like it to be an additional option for FoldingClassifier, but not default behavior. Let me see what I can do for this.

arogozhnikov commented 8 years ago

@mayou36 I've added test implementation for stratified kFolding, you can try it now:

pip uninstall rep
pip install https://github.com/yandex/rep/archive/stratifiedkfold.zip --no-dependencies

Please report if you face any bugs

jonas-eschle commented 8 years ago

Thank you very much, checked it out and it looks good so far.

Sure, I'll let you know

jonas-eschle commented 8 years ago

Just an update: I've been using the stratified classifier without any problems; no bugs considered so far.

arogozhnikov commented 8 years ago

@mayou36 Nice to know, then we'll have it in the next release. I'll close the issue and close PR.