zachmayer / caretEnsemble

caret models all the way down :turtle:
http://zachmayer.github.io/caretEnsemble/
Other
226 stars 74 forks source link

as.caretList method for lists of train models #104

Closed zachmayer closed 3 months ago

zachmayer commented 9 years ago

So if someone manually constructs some models, they can coerce them to caretList. This would run the check in #45 to validate the list of models.

Laurae2 commented 8 years ago

Assuming the sample strategy is the same (and using probs models), it is possible to combine train models into a caretList.

The creation of indexes must be stored into a variable, else the coercing to caretList (sometimes) fails to run (it fails if you re-use AND re-affect the folds to another train model that must be coerced to caretList).

Example of cv:

CVfolds <- 5
CVrepeats <- 3
indexPreds <- createMultiFolds(train$churn, CVfolds, CVrepeats) #for repeatedcv
#adaptive_cv uses createFolds(train$churn, CVfolds)
#using Adaptive Cross-Validation with createMultiFolds does work
#all the folds must be stored BEFORE in a variable unless you use a method which create new folds
#creating folds on the fly will NOT work as the resamples will be different

Then use this:

ctrl <- trainControl(method = "repeatedcv", repeats = CVrepeats, number = CVfolds, returnResamp = "all", savePredictions = "all", classProbs = TRUE, index = indexPreds)
#notice the usage of index pointing to a variable content because it is the only safest way of doing it

Run your training of each model:

model_list1 <- train(Survived ~ ., data = train, method = "rpart2", trControl = ctrl, tuneLength = tune[1])
model_list2 <- train(Survived ~ ., data = train, method = "gbm", trControl = ctrl, tuneLength = tune[2])

Coerce the trains into a list:

multimodel <- list(rpart2 = model_list1, gbm = model_list2)

Convert the freshly created list into caretList:

class(multimodel) <- "caretList"

And now you can run caretEnsemble or caretStack without any issues, example:

multiensemble <- caretEnsemble(multimodel)

And predict:

predictedValues <- predict(multiensemble, newdata = test)

Creating train models separately allows to create a model without having to create a caretList. This is good when you have different train models to merge together to use into a caretEnsemble/caretStack. However, they must use the same resamples, hence specifying a variable as indexes. Evident advantages are using train models on a known extra test sample, computing confusion matrix statistics, and merging the data output into a single matrix. Once you predicted values using a caretList, if you are predicting classes you have to use:

test$Churn[predictedValues < 0.50] <- 0
test$Churn[predictedValues >= 0.50] <- 1

In order to reaffect the right binary classes from probabilities.

In fact, to solve this issue, what need to be equal are not the exactly the resamples, but the folds from the indexes being used (or the boot lists). If you have the same sampling strategy using the same indexes on different train models, you simply merge them using:

my_multimodel <- list(name1 = my_train1, name2 = my_train2...) #keep adding if you have more variables, list names do not appear
class(my_multimodel) <- "caretList"

If the indexes were different (or some stuff were not stored as "not/wrongly" specified in trainControl), it will throw an error when trying to use caretEnsemble or caretStack. Which is the expected behavior, obviously.

zachmayer commented 3 months ago

we now have as.caretList for lists