Open thomaskern opened 10 years ago
Currently jknowles is leading the charge on this. If you would like to add code, please feel free to submit pull requests on the dev branch.
@thomaskern you can probably start by modifying the package to have a 3rd type of model type to identify -- "multiclass" or something like that. This would require modifying the checkModels...
function in the utilities.R
After you do that, you just need to add methods for the "multiclass" type to the stack
, ensemble
, and predict
functions and write some unit tests to make sure it is performing as you like.
What optimization function would you use in the multi-class classification setting?
@jknowles I am less worried about the R code and more worried about the optimization function ;) I have no OF in mind and was hoping for some pointers. Have you given this any thought by chance?
I have not. That's the reason I skipped multiclass models for now. It seemed non-trival to write a greedy selection function for them.
I think you should still be able to use another caret model for the ensembling though.
On Fri, Feb 28, 2014 at 2:56 PM, Thomas notifications@github.com wrote:
@jknowles https://github.com/jknowles I am less worried about the R code and more worried about the optimization function ;) I have no OF in mind and was hoping for some pointers. Have you given this any thought by chance?
Reply to this email directly or view it on GitHubhttps://github.com/zachmayer/caretEnsemble/issues/8#issuecomment-36388211 .
This is also post 1.0. Once we have a working, stable release on CRAN, we'll figure out the multi-class RMSE and AUC optimization functions.
We also need to add multiclass support to caretList and caretStack (caretStack might be easier to do), etc.
Is there any quick way to hack the multiclass classification with greedy optimization? I am participating a kaggle competition...
The quickest way would be to use caretStack
with glmnet
as the stacker.
In general, I find myself using caretStack
more than caretList
. It's a lot more flexible, and if you use method='glm'
you'll get almost the exact same results as with a greedy ensemble.
@ajing The real key is the optimization function. I haven't even thought of a hack that will work here.... Once someone's written the multiclass optimization function work, we can work on multiclass greedy ensembling.
In methodList=c('rpart', 'nnet'), could I specify a new model as a variable? Because, for xgboost, caret hasn't include that yet, but they can specific a variable modelInfo for xgboost for train function. Could I do something like methodList = c(modelInfo, 'nnet')?
We don't support that yet, but it's definitely on the to-do list.
For now, you can fit the model separately with a call to caret::train
, and then add it to the list. To quote from the vignette (please do read it!):
Finally, you should note that caretList does not support custom caret models. Fitting those models are beyond the scope of this vignette, but if you do so, you can manually add them to the model list (e.g.
model_list_big[['my_custom_model']] <- my_custom_model)
. Just be sure to use the same re-sampling indexes in trControl as you use in the caretList models!
Sorry to ask the question again... For multiclass classification, is that possible to build the ensemble model? I got the error for caretStack with glmnet also...
stack_model <- caretStack(model_list, method='glmnet')
Error in check_caretList_model_types(list_of_models) : Not yet implemented for multiclass problems
Darn. I thought I'd fixed that.
Ok, multi class for caretStack should be easier implement than for caretEnsemble. I'll take a look.
No promises on when I can get to it!
For adding a new customized model, In the example: "Just be sure to use the same re-sampling indexes in trControl as you use in the caretList models!"
What portion I need to change in modelInfo to use the same resampling indexes?
I use the modelInfo of xgboost provided by topepo.
Lets say x is your other model (the one you're ensembling xgboost with):
new_train_control <- trainControl(index=x$control$index, indexOut=x$control$indexOut)
Use the new control to fit your xgboost model.
Thanks! I was thinking about still using caretEnsemble framework. Can I do something like:
my_control <- trainControl(
method='boot',
number=25,
classProbs=TRUE,
index=createResample(training$Class, 25),
summaryFunction=twoClassSummary
)
model_list <- caretList(
Class~., data=training,
trControl=my_control,
tuneList=list(
rf1=caretModelSpec(method='rf', tuneGrid=data.frame(.mtry=2)),
xgb = caretModelSpec(method=modelInfo, tuneLength = 3)
)
As explained in the vignette: "you should note that caretList does not support custom caret models"
You have to fit the second model with a second call to train. E.g.:
my_control <- trainControl(
method='boot',
number=25,
classProbs=TRUE,
index=createResample(training$Class, 25),
summaryFunction=twoClassSummary
)
model_list <- caretList(
Class~., data=training,
trControl=my_control,
tuneList=list(
rf1=caretModelSpec(method='rf', tuneGrid=data.frame(.mtry=2))
)
new_model <- train(
Class~.,
data=training,
trControl=my_control,
method=modelInfo,
tuneLength = 3)
model_list$xgb <- new_model
Thanks for the code! Yes, it works now.
Good! Glad to help
Sorry to bother again.. I got the following error, because some assumption about method variable..
model_list$xgb <- new_model greedy_ensemble <- caretEnsemble(model_list) Error in paste("Model '", method, "' is not in the ", "set of existing models", : object 'method' not found
crap. that's a bug. If you can make a simple test case, I'll add it to the queue to fix.
This could be a test case:
training = data.frame(matrix(runif(120),nrow=20,ncol=6), Class = factor(sample(c("Yes", "No"), 20, replace=T)))
my_control <- trainControl(
method='boot',
number=5,
savePredictions=TRUE,
classProbs=TRUE,
index=createResample(training$Class, 5),
summaryFunction=twoClassSummary
)
model_list <- caretList(
Class~., data=training,
trControl=my_control,
methodList=c('glm', 'rpart')
)
new_model <- train(Class~., data=training, trControl=my_control, method=modelInfo, metric = "ROC", tuneGrid = expand.grid(eta = 0.01,max_depth = 5, nrounds=100))
model_list$xgb <- new_model
greedy_ensemble <- caretEnsemble(model_list)
modelInfo Is
modelInfo <- list(label = "eXtreme Gradient Boosting",
library = c("xgboost", "plyr"),
type = c("Regression", "Classification"),
parameters = data.frame(parameter = c('nrounds', 'max_depth', 'eta'),
class = rep("numeric", 3),
label = c('# Boosting Iterations', 'Max Tree Depth',
'Shrinkage')),
grid = function(x, y, len = NULL) expand.grid(max_depth = seq(1, len),
nrounds = floor((1:len) * 50),
eta = .3),
loop = function(grid) {
loop <- ddply(grid, c("eta", "max_depth"),
function(x) c(nrounds = max(x$nrounds)))
submodels <- vector(mode = "list", length = nrow(loop))
for(i in seq(along = loop$nrounds)) {
index <- which(grid$max_depth == loop$max_depth[i] &
grid$eta == loop$eta[i])
trees <- grid[index, "nrounds"]
submodels[[i]] <- data.frame(nrounds = trees[trees != loop$nrounds[i]])
}
list(loop = loop, submodels = submodels)
},
fit = function(x, y, wts, param, lev, last, classProbs, ...) {
if(is.factor(y)) {
if(length(lev) == 2) {
y <- ifelse(y == lev[1], 1, 0)
dat <- xgb.DMatrix(as.matrix(x), label = y)
out <- xgb.train(list(eta = param$eta,
max_depth = param$max_depth),
data = dat,
nrounds = param$nrounds,
objective = "binary:logistic",
...)
} else {
y <- as.numeric(y) - 1
dat <- xgb.DMatrix(as.matrix(x), label = y)
out <- xgb.train(list(eta = param$eta,
max_depth = param$max_depth),
data = dat,
num_class = length(lev),
nrounds = param$nrounds,
objective = "multi:softprob",
...)
}
} else {
dat <- xgb.DMatrix(as.matrix(x), label = y)
out <- xgb.train(list(eta = param$eta,
max_depth = param$max_depth),
data = dat,
nrounds = param$nrounds,
objective = "reg:linear",
...)
}
out
},
predict = function(modelFit, newdata, submodels = NULL) {
newdata <- xgb.DMatrix(as.matrix(newdata))
out <- predict(modelFit, newdata)
if(modelFit$problemType == "Classification") {
if(length(modelFit$obsLevels) == 2) {
out <- ifelse(out >= .5,
modelFit$obsLevels[1],
modelFit$obsLevels[2])
} else {
out <- matrix(out, ncol = length(modelFit$obsLevels), byrow = TRUE)
out <- modelFit$obsLevels[apply(out, 1, which.max)]
}
}
if(!is.null(submodels)) {
tmp <- vector(mode = "list", length = nrow(submodels) + 1)
tmp[[1]] <- out
for(j in seq(along = submodels$nrounds)) {
tmp_pred <- predict(modelFit, newdata, ntreelimit = submodels$nrounds[j])
if(modelFit$problemType == "Classification") {
if(length(modelFit$obsLevels) == 2) {
tmp_pred <- ifelse(tmp_pred >= .5,
modelFit$obsLevels[1],
modelFit$obsLevels[2])
} else {
tmp_pred <- matrix(tmp_pred, ncol = length(modelFit$obsLevels), byrow = TRUE)
tmp_pred <- modelFit$obsLevels[apply(tmp_pred, 1, which.max)]
}
}
tmp[[j+1]] <- tmp_pred
}
out <- tmp
}
out
},
prob = function(modelFit, newdata, submodels = NULL) {
newdata <- xgb.DMatrix(as.matrix(newdata))
out <- predict(modelFit, newdata)
if(length(modelFit$obsLevels) == 2) {
out <- cbind(out, 1 - out)
colnames(out) <- modelFit$obsLevels
} else {
out <- matrix(out, ncol = length(modelFit$obsLevels), byrow = TRUE)
colnames(out) <- modelFit$obsLevels
}
out <- as.data.frame(out)
if(!is.null(submodels)) {
tmp <- vector(mode = "list", length = nrow(submodels) + 1)
tmp[[1]] <- out
for(j in seq(along = submodels$nrounds)) {
tmp_pred <- predict(modelFit, newdata, ntreelimit = submodels$nrounds[j])
if(length(modelFit$obsLevels) == 2) {
tmp_pred <- cbind(tmp_pred, 1 - tmp_pred)
colnames(tmp_pred) <- modelFit$obsLevels
} else {
tmp_pred <- matrix(tmp_pred, ncol = length(modelFit$obsLevels), byrow = TRUE)
colnames(tmp_pred) <- modelFit$obsLevels
}
tmp_pred <- as.data.frame(tmp_pred)
tmp[[j+1]] <- tmp_pred
}
out <- tmp
}
out
},
predictors = function(x, ...) {
imp <- xgb.importance(x$xNames, model = x)
x$xNames[x$xNames %in% imp$Feature]
},
varImp = function(object, numTrees = NULL, ...) {
imp <- xgb.importance(x$xNames, model = x)
imp <- as.data.frame(imp)[, 1:2]
rownames(imp) <- as.character(imp[,1])
imp <- imp[,2,drop = FALSE]
colnames(imp) <- "Overall"
imp
},
levels = function(x) x$obsLevels,
tags = c("Tree-Based Model", "Boosting", "Ensemble Model", "Implicit Feature Selection"),
sort = function(x) {
# This is a toss-up, but the # trees probably adds
# complexity faster than number of splits
x[order(x$nrounds, x$max_depth, x$eta),]
})
Is there any recommended way to add a new caret model to existing model_list?
Like, I have a model_list, and I want to add a new model:
I can do
model9 <- train(X[train,], Y[train], method='gam', trControl=myControl)
Then,
model_list$gam = model9
Is this right?
Yup!
@ajing I had the same issue. I have a dirty workaround that works for me. The function 'makePredObsMatrix' in the caretEnsemble package is throwing the error. Below is the source code of the function with the initial two lines commented (these are just checks. I assume if you know what you're doing, uncommenting this is fine - perhaps @zachmayer can throw more light). I sourced the modified function and added the following lines to my script. It worked. Hope it helps.
source('/media/3TB/kag/caretEnsemble-master/R/helper_functions.R')
assignInNamespace("makePredObsMatrix",makePredObsMatrix, ns="caretEnsemble")
#######
makePredObsMatrix <- function(list_of_models){
#caretList Checks
#check_caretList_classes(list_of_models)
#check_caretList_model_types(list_of_models)
#Make a list of models
modelLibrary <- extractBestPreds(list_of_models)
model_names <- names(modelLibrary)
#Model library checks
check_bestpreds_resamples(modelLibrary) #Re-write with data.table?
check_bestpreds_indexes(modelLibrary) #Re-write with data.table?
check_bestpreds_obs(modelLibrary) #Re-write with data.table?
check_bestpreds_preds(modelLibrary) #Re-write with data.table?
#Extract model type (class or reg)
type <- extractModelTypes(list_of_models)
#Add names column
for(i in seq_along(modelLibrary)){
set(modelLibrary[[i]], j="modelname", value=names(modelLibrary)[[i]])
}
#Remove parameter columns
keep <- Reduce(intersect, lapply(modelLibrary, names))
for(i in seq_along(modelLibrary)){
rem <- setdiff(names(modelLibrary[[i]]), keep)
if(length(rem) > 0){
for(r in rem){
set(modelLibrary[[i]], j=r, value=NULL)
}
}
}
modelLibrary <- rbindlist(modelLibrary, fill=TRUE)
#For classification models that produce probs, use the probs as preds
#Otherwise, just use class predictions
if (type=="Classification"){
positive <- as.character(unique(modelLibrary$obs)[2]) #IMPROVE THIS!
pos <- as.numeric(modelLibrary[[positive]])
good_pos_values <- which(is.finite(pos))
set(modelLibrary, j="pred", value=as.numeric(modelLibrary[["pred"]]))
set(modelLibrary, i=good_pos_values, j="pred", value=modelLibrary[good_pos_values,positive,with=FALSE])
}
#Reshape wide for meta-modeling
modelLibrary <- data.table::dcast.data.table(
modelLibrary,
obs + rowIndex + Resample ~ modelname,
value.var = "pred"
)
#Return
return(list(obs=modelLibrary$obs, preds=as.matrix(modelLibrary[,model_names,with=FALSE]), type=type))
}
@tsuresh83 Could you make a PR on github with the new version of your function so I can see what you changed? If the fix looks good to me and @jknowles I'll merge it into master.
@zachmayer I am afraid this is not a fix. From what I've seen the actual fix is the addition of an entry to the list of models in 'models.RData' from the caret package. What I suggested is only a stop gap measure until the actual fix.
Has there been any progress or workarounds developed for using caretStack for multi-class classfication?
Nope, but I welcome pull requests!
At the moment we're in the middle of some code cleanup and bug fixes, which will probably be followed by a CRAN release. Then I'll figure out how to get caretList and caretStack to work with multiclass.
Getting caretEnsemble to work with multiclass will be harder.
Thanks for the quick response @zachmayer! By the way, what is your reference material for the implementation of Stacking? Are you using "Elements of Statistical Learning" or something?
Hmmm, I don't really have a reference I recall. It's pretty simple actually: I just use the out-of-sample predictions from caret::train(..., trControl=trainControl(savePredictions=TRUE))
to train the next model.
It's not super sophisticated, but it works.
@zachmayer - thanks for the help. I just wanted to let you know that I managed to implement caretStack for a multi-class classification for a project that I'm working on. It's a bit hacky, but it works. I'd love to contribute it to your repository but I'm new to coding and this'd be my first contribution to a project. Could I chat with you or @jknowles at some point about how to best merge it with your codebase?
@ajing @nithum Could one of you checkout the latest master and confirm that this bug still occurs?
When I was trying to reinstall caretEnsemble, it says:
Error in loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]]) :
there is no package called 'pbkrtest'
ERROR: lazy loading failed for package 'caretEnsemble'
* removing 'C:/Users/ajing/Documents/R/win-library/3.2/caretEnsemble'
Restart R (and maybe restart your computer too), then:
remove.packages('caretEnsemble')
install.packages('caretEnsemble')
I don't know why it doesn't work on windows, but I get the installation right on ubuntu system. I have the following error, but I may do it in the wrong way.
> training = data.frame(matrix(runif(120),nrow=20,ncol=6), Class = factor(sample(c("Yes", "No"), 20, replace=T)))
> my_control <- trainControl(
+ method='boot',
+ number=5,
+ savePredictions=TRUE,
+ classProbs=TRUE,
+ index=createResample(training$Class, 5),
+ summaryFunction=twoClassSummary
+ )
> model_list <- caretList(
+ Class~., data=training,
+ trControl=my_control,
+ methodList=c('glm', 'rpart')
+ )
Loading required package: rpart
Warning messages:
1: In train.default(x, y, weights = w, ...) :
The metric "Accuracy" was not in the result set. ROC will be used instead.
2: glm.fit: algorithm did not converge
3: glm.fit: fitted probabilities numerically 0 or 1 occurred
4: glm.fit: algorithm did not converge
5: glm.fit: fitted probabilities numerically 0 or 1 occurred
6: In train.default(x, y, weights = w, ...) :
The metric "Accuracy" was not in the result set. ROC will be used instead.
> new_model <- train(Class~., data=training, trControl=my_control, method=modelInfo, metric = "ROC", tuneGrid = expand.grid(eta = 0.01,max_depth = 5, nrounds=100))
Loading required package: xgboost
Loading required package: plyr
> model_list$xgb <- new_model
>
> greedy_ensemble <- caretEnsemble(model_list)
Error in paste("Model '", method, "' is not in the ", "set of existing models", :
object 'method' not found
That's a 2-class example. Here's a 3-class example, which currently fails:
library(caretEnsemble)
data(iris)
my_control <- caret::trainControl(
method='boot',
number=5,
savePredictions="final",
classProbs=TRUE,
index=caret::createResample(iris$Species, 5)
)
model_list <- caretList(
x=iris[,-5],
y=iris[,5],
trControl=my_control,
methodList=c('glmnet', 'rpart')
)
ens <- caretStack(model_list, method='rpart')
Yes.
Error in check_caretList_model_types(list_of_models) :
Not yet implemented for multiclass problems
Also, I cleaned up your example, and it runs fine for me:
library(caretEnsemble)
library(caret)
training = data.frame(matrix(runif(120),nrow=20,ncol=6), Class = factor(sample(c("Yes", "No"), 20, replace=T)))
my_control <- trainControl(
method='boot',
number=5,
savePredictions=TRUE,
classProbs=TRUE,
index=createResample(training$Class, 5),
summaryFunction=twoClassSummary
)
model_list <- caretList(
metric='ROC',
Class~., data=training,
trControl=my_control,
methodList=c('glm', 'rpart')
)
Make sure you're using the latest version of caretEnsemble.
That example is for adding a new model, which is not included in caret.
@ajing This issue is for running caretStack
with multiclass models. Custom model support in caretList is a different issue: https://github.com/zachmayer/caretEnsemble/issues/109
I began with asking multiclass, then asked something else... So, multi-class is working?
No, but it will be soon. Please use separate issues for separate discussions in the future.
Sure.
Any update on multiclass models support?
See pending PR here: https://github.com/zachmayer/caretEnsemble/pull/191
The basic concept works, but I still need to re-write all the intermediate functions to return (and operate) on matrices in the multiclass case.
Thank you, can't wait for the release!
Was looking for multiclass ensembl method, and google search led me here. Can't wait for multiclass ensemble as well!
Hi Zack, May I ask please , is there any way around that I can use CaretEnsemble with different training dataset ?
@zee86 It still doesn't work with multiclass
i see that branch Dev has some more progress regarding multi-class classification ensemble stacking but unfortunately it is not yet done. do you plan on implementing this and/or could you point me in the right direction so i might be able to finish it? I don't seem to understand what the problem/holdup is (no offense intended)