zachmayer / caretEnsemble

caret models all the way down :turtle:
Other
226 stars 75 forks source link

Understanding caretEnsemble #217

Closed univ12 closed 7 years ago

univ12 commented 7 years ago

Hi, thanks for this nice package. From the documentation it is still unclear to me how this ensembling works and I'm not good in reading the source code. So could you please help me understand the principle. From my understanding it is like this:

Let X be the data and y the outcome. Since bootstrap resampling is the default option, we sample with replacement from X. We now have a training and test set, X[train] and X[test] We then train models A and B (e.g. linear model, random forest) on the training data and predict on the test data. But how do we proceed from here? Do we input these test data predictions in a linear model where we use y[test] as outcome? Do we now predict on the test data again? That would be strange since we used these data to build the linear model? And if we repeat the bootstrapping 25 times, we get 25 different models. How are those averaged?

Thanks

zachmayer commented 7 years ago

You use the out-of-sample predictions for the training set to train the ensemble. E.g. if you did cross-validation, you can use the out-of-sample predictions from each cross-validation fold.

zachmayer commented 7 years ago

And if we repeat the bootstrapping 25 times, we get 25 different models. How are those averaged?

You use the out-of-sample predictions from each of the 25 models. You stack them up, and then train a single model.

zachmayer commented 7 years ago

See also "stacked generalization" here: http://mlwave.com/kaggle-ensembling-guide/

univ12 commented 7 years ago

thanks for the link that was interesting. However, I still do not quite understand it. So,

  1. split data X into X[train] and X[test]
  2. bootstrap only X[train] into X_a and X_b
  3. train model A on X_a and predict on X_b, call the predictions A_pred_b
  4. train model B on X_a and predict on X_b, call the predictions B_pred_b
  5. train a linear model using A_pred_b and B_pred_b as covariates
  6. train A on complete X[train] and predict on X[test]
  7. train B on complete X[train] and predict on X[test]

and then?

  1. use the regression coefficients from 5. to re-weight both predictions from 6. and 7. into a single prediction?

or

  1. use the linear model from 5. to predict on X[test] ?
zachmayer commented 7 years ago
  1. split data X into X[train] and X[test]
  2. bootstrap only X[train] into X_a and X_b (you can use cross-validation too)
  3. train model A on X_a and predict on X_b, call the predictions A_pred_b
  4. train model B on X_a and predict on X_b, call the predictions B_pred_b
  5. train model A on X_b and predict on X_a, call the predictions A_pred_a
  6. train model B on X_b and predict on X_a, call the predictions B_pred_a
  7. cbind A_pred_b and B_pred_b. Call it pred_b (2 columns in this dataset, A and B)
  8. cbind A_pred_a and B_pred_a. Call it pred_a (2 columns in this dataset, A and B)
  9. rbind pred_b and pred_a. Call it pred. (2 columns in this dataset, A and B)
  10. train a linear model on the pred dataset
  11. train A on complete X[train] and predict on X[test]. Call this A_pred_test
  12. train B on complete X[train] and predict on X[test]. Call this B_pred_test
  13. cbind A_pred_test and B_pred_test. Call it pred_test (2 columns in this dataset, A and B)
  14. Predict with your linear model from step 8, using pred_test as your input. These are your final predictions

You only train the linear model once; you don't re-train it.

univ12 commented 7 years ago

now it's clear. So easy, thank you!