Differences between caret and caretEnsemble accuracy reporting

EllenConsidine commented 5 years ago

Hello,

I am a student research assistant working with caretEnsemble to estimate wildfire air pollution exposures across the western US with a suite of environmental variables (satellite imagery, temp, humidity, etc). I am wondering if you might help me understand perplexing differences I'm seeing between individual and ensemble performance metrics. Specifically, when I run the exact same "ranger" algorithm individually and then in a caretStack with another algorithm ("glmnet", "xgbTree", etc), the ensemble model reports a lower R^2 than the individual model, which seems wrong to me and my mentor.

To ensure accurate test metrics, I have been holding out an entirely separate 10% of the data in addition to running 10-fold cross validation via a caret/caretEnsemble trainControl object. When I calculate R^2 and RMSE on this 10% separate test set, the results are (a) poorer than those reported by caret and (b) improved in the ensemble compared to the individual ranger model, which is what I'd expect.

Here is a table summarizing these results. "Training" results refer to those reported by caret/caretEnsemble and "Testing" results refer to those I calculate on my separate 10% test set. (Note: I run each model (individual ranger & caretStack with ranger and xgbTree) on the same four datasets.)

Model	Dataset	Training R2	Training RMSE	Testing R2	Testing RMSE
Ranger	Fire 2010	0.964	1.389	0.771	3.177
Ranger	Fire 2017	0.889	6.989	0.664	9.456
Ranger	Not Fire 2010	0.958	1.639	0.781	3.591
Ranger	Not Fire 2017	0.964	2.223	0.826	4.487
Stack (RF + XGBT)	Fire 2010	0.778	3.028	0.797	2.139
Stack (RF + XGBT)	Fire 2017	0.571	11.083	0.807	5.393
Stack (RF + XGBT)	Not Fire 2010	0.727	3.665	0.796	2.400
Stack (RF + XGBT)	Not Fire 2017	0.761	5.127	0.891	2.977

Can you help me diagnose why caret is reporting higher R^2 values (and lower RMSE values) for the individual random forest model than the caretStack? I can include code if that would be useful.

Thank you!

zachmayer commented 5 years ago

Are you using the same trainControl object, with explicitly defined partitioning, for the models in the ensembles vs the single models?

Can you share a script to reproduce this table?

EllenConsidine commented 5 years ago

Thank you for your question about the partitions.

It turns out that I was re-sampling the folds used in the cross-validation (so the data wasn't exactly the same even though the trainControl parameters were). I also realized that I was specifying an extra argument (num.trees) in the train function for ranger that wasn't in the ensemble.

My apologies for not catching these details sooner. Thanks again.

zachmayer commented 5 years ago

No worries! This always happens to me too!

zachmayer / caretEnsemble

Differences between caret and caretEnsemble accuracy reporting #245