yandex / rep

Machine Learning toolbox for Humans
http://yandex.github.io/rep/
Other
689 stars 144 forks source link

Add ability to initialise FoldingBase objects with external parser #77

Closed mschlupp closed 8 years ago

mschlupp commented 8 years ago

If you would like to run rep with eg a StratifiedKFold instead of a normal KFold, this will be possible after the pull request. If no external folder-object is parsed, the default KFold algorithm is used.

jonas-eschle commented 8 years ago

Good idea, thank you! Hope it will be pulled in soon.

Btw it seems somehow neglected... are you still active maintaining this repo? I personally find it very useful and make a lot use of it.

arogozhnikov commented 8 years ago

@mayou36

are you still active maintaining this repo?

rarely doing something. As the things we use work fine, there is no reason to spoil the situation :)

@mschlupp

Max, thanks for idea, but this patch isn't going to work (even if the code was correct). The reason is FoldingClassifier should not only work with training dataset, but also be able to predict any arbitrary dataset (which may have e.g. different size - this isn't going to work in your case).

So, it's more complicated. I'll think how the case of stratified folding can be added.

mschlupp commented 8 years ago

I don't quite understand why this should be an issue. You can predict arbitrary datasets and you can validate via CV. The only thing is that you substitute the KFold default by an folder of your choice. What am I missing here?

arogozhnikov commented 8 years ago

@mschlupp

There are several basic thing expected from FoldingClassifier:

  1. fitting on arbitrary dataset
  2. correctly predict the same dataset (meaning that fold ids used here and in training should be the same)
  3. correctly predict any other dataset with same columns, but maybe different length
  4. ability to work as a part of more complex classifier (e.g. Bagging over FoldingClassifier)

At the same time, for fitting we have labels, but not for predicting. Currently this is done by fixing internal random state, which is later together with length used to correctly generate folds indices.

When you pass StratifiedKFold, you can't fulfill points 3&4, since you need different folding for new test dataset.

arogozhnikov commented 8 years ago

closed in favor of #92 (stratified folding)