zachmayer / caretEnsemble

caret models all the way down :turtle:

http://zachmayer.github.io/caretEnsemble/

Other

226 stars 74 forks source link

Add checks to extractBestPreds #45

Closed zachmayer closed 9 years ago

zachmayer commented 10 years ago

Todo in the code:

Insert checks here: observeds are all equal, row indexes are equal, Resamples are equal

Probably a matter of writing 3 functions: checkObserveds checkRowIndexes checkResamples

Each one should give an explanatory error message of what's wrong, which model(s) are the culprit, and why we can't make an ensemble in this situation.

This is pulled out of #3, which kind of grew into many separate bug reports.

zachmayer commented 10 years ago

also checkIndexes

jknowles commented 9 years ago

@zachmayer Have you started tackling this or are you looking for a contribution here? Let me know. I want to help push caretEnsemble to CRAN before 2015 if possible.

zachmayer commented 9 years ago

I have not started on this yet. I think I have some dummy functions in helper_functions.R and caretList.R. I also want this out before 2015!

zachmayer commented 9 years ago

Closed by #106

nkurz commented 9 years ago

Is passing checkResamples() actually a requirement for building a valid stack or ensemble? Instinctively, I'd think that as long as the model is able to make a prediction for each value it doesn't matter how it arrived at that model. Presumably you wouldn't want a model that's internally snooping to make its predictions, but the exact method used for resampling should be immaterial to whether it can be combined with other models. Or do I misunderstand what's actually being checked here?

zachmayer commented 9 years ago

We use the predictions on the the re-samples to build the ensemble. Therefore, each model needs the same set of resembles.

nkurz commented 9 years ago

Yes, the cross-validated predictions of the "Resamples" are used to weight the models in the caretList to produce a caretStack or caretEnsemble. My confusion is whether this is checking that the same set of training data is being used (and hence the predictions are for the same data), or that the set was created in the same manner (4-fold vs 10-fold; bootstrap vs k-fold)? The first seems like a reasonable safety check, but the second seems like it would prevent some reasonable use cases.

zachmayer commented 9 years ago

We are testing that the test-sets for each resample use the exact same data. The trainsets in theory could be different.

caretStack and caretEnsemble do not do any cross-validation on their own. They merely exploit the existing cross-validation folds used to fit the original caret model. In order for this scheme to work (ensembling without re-fitting), each model in the ensemble needs to use the same number of resamples, with the same rows of data in the test set for each fold.

I don't think we'll ever support ensembling models with a different number of resamples or different test sets for each resample. We may eventually support models with different training sets for each resample so long as the test sets are identical.