zachmayer / caretEnsemble

caret models all the way down :turtle:
Other
226 stars 75 forks source link

Differences between caret and caretEnsemble accuracy reporting #245

Closed EllenConsidine closed 5 years ago

EllenConsidine commented 5 years ago

Hello,

I am a student research assistant working with caretEnsemble to estimate wildfire air pollution exposures across the western US with a suite of environmental variables (satellite imagery, temp, humidity, etc). I am wondering if you might help me understand perplexing differences I'm seeing between individual and ensemble performance metrics. Specifically, when I run the exact same "ranger" algorithm individually and then in a caretStack with another algorithm ("glmnet", "xgbTree", etc), the ensemble model reports a lower R^2 than the individual model, which seems wrong to me and my mentor.

To ensure accurate test metrics, I have been holding out an entirely separate 10% of the data in addition to running 10-fold cross validation via a caret/caretEnsemble trainControl object. When I calculate R^2 and RMSE on this 10% separate test set, the results are (a) poorer than those reported by caret and (b) improved in the ensemble compared to the individual ranger model, which is what I'd expect.

Here is a table summarizing these results. "Training" results refer to those reported by caret/caretEnsemble and "Testing" results refer to those I calculate on my separate 10% test set. (Note: I run each model (individual ranger & caretStack with ranger and xgbTree) on the same four datasets.)

Model Dataset Training R2 Training RMSE Testing R2 Testing RMSE
Ranger Fire 2010 0.964 1.389 0.771 3.177
Ranger Fire 2017 0.889 6.989 0.664 9.456
Ranger Not Fire 2010 0.958 1.639 0.781 3.591
Ranger Not Fire 2017 0.964 2.223 0.826 4.487
Stack (RF + XGBT) Fire 2010 0.778 3.028 0.797 2.139
Stack (RF + XGBT) Fire 2017 0.571 11.083 0.807 5.393
Stack (RF + XGBT) Not Fire 2010 0.727 3.665 0.796 2.400
Stack (RF + XGBT) Not Fire 2017 0.761 5.127 0.891 2.977

Can you help me diagnose why caret is reporting higher R^2 values (and lower RMSE values) for the individual random forest model than the caretStack? I can include code if that would be useful.

Thank you!

zachmayer commented 5 years ago

Are you using the same trainControl object, with explicitly defined partitioning, for the models in the ensembles vs the single models?

Can you share a script to reproduce this table?

EllenConsidine commented 5 years ago

Thank you for your question about the partitions.

It turns out that I was re-sampling the folds used in the cross-validation (so the data wasn't exactly the same even though the trainControl parameters were). I also realized that I was specifying an extra argument (num.trees) in the train function for ranger that wasn't in the ensemble.

My apologies for not catching these details sooner. Thanks again.

zachmayer commented 5 years ago

No worries! This always happens to me too!