zachmayer / caretEnsemble

caret models all the way down :turtle:
Other
227 stars 75 forks source link

Multi-class classification greedy optimization #8

Open thomaskern opened 10 years ago

thomaskern commented 10 years ago

i see that branch Dev has some more progress regarding multi-class classification ensemble stacking but unfortunately it is not yet done. do you plan on implementing this and/or could you point me in the right direction so i might be able to finish it? I don't seem to understand what the problem/holdup is (no offense intended)

zachmayer commented 10 years ago

Currently jknowles is leading the charge on this. If you would like to add code, please feel free to submit pull requests on the dev branch.

jknowles commented 10 years ago

@thomaskern you can probably start by modifying the package to have a 3rd type of model type to identify -- "multiclass" or something like that. This would require modifying the checkModels... function in the utilities.R

After you do that, you just need to add methods for the "multiclass" type to the stack, ensemble, and predict functions and write some unit tests to make sure it is performing as you like.

What optimization function would you use in the multi-class classification setting?

thomaskern commented 10 years ago

@jknowles I am less worried about the R code and more worried about the optimization function ;) I have no OF in mind and was hoping for some pointers. Have you given this any thought by chance?

zachmayer commented 10 years ago

I have not. That's the reason I skipped multiclass models for now. It seemed non-trival to write a greedy selection function for them.

I think you should still be able to use another caret model for the ensembling though.

On Fri, Feb 28, 2014 at 2:56 PM, Thomas notifications@github.com wrote:

@jknowles https://github.com/jknowles I am less worried about the R code and more worried about the optimization function ;) I have no OF in mind and was hoping for some pointers. Have you given this any thought by chance?

Reply to this email directly or view it on GitHubhttps://github.com/zachmayer/caretEnsemble/issues/8#issuecomment-36388211 .

zachmayer commented 9 years ago

This is also post 1.0. Once we have a working, stable release on CRAN, we'll figure out the multi-class RMSE and AUC optimization functions.

zachmayer commented 9 years ago

We also need to add multiclass support to caretList and caretStack (caretStack might be easier to do), etc.

ajing commented 9 years ago

Is there any quick way to hack the multiclass classification with greedy optimization? I am participating a kaggle competition...

zachmayer commented 9 years ago

The quickest way would be to use caretStack with glmnet as the stacker.

In general, I find myself using caretStack more than caretList. It's a lot more flexible, and if you use method='glm' you'll get almost the exact same results as with a greedy ensemble.

Take a look at the vignette for some examples

zachmayer commented 9 years ago

@ajing The real key is the optimization function. I haven't even thought of a hack that will work here.... Once someone's written the multiclass optimization function work, we can work on multiclass greedy ensembling.

ajing commented 9 years ago

In methodList=c('rpart', 'nnet'), could I specify a new model as a variable? Because, for xgboost, caret hasn't include that yet, but they can specific a variable modelInfo for xgboost for train function. Could I do something like methodList = c(modelInfo, 'nnet')?

zachmayer commented 9 years ago

We don't support that yet, but it's definitely on the to-do list.

For now, you can fit the model separately with a call to caret::train, and then add it to the list. To quote from the vignette (please do read it!):

Finally, you should note that caretList does not support custom caret models. Fitting those models are beyond the scope of this vignette, but if you do so, you can manually add them to the model list (e.g. model_list_big[['my_custom_model']] <- my_custom_model). Just be sure to use the same re-sampling indexes in trControl as you use in the caretList models!

ajing commented 9 years ago

Sorry to ask the question again... For multiclass classification, is that possible to build the ensemble model? I got the error for caretStack with glmnet also...

stack_model <- caretStack(model_list, method='glmnet')
Error in check_caretList_model_types(list_of_models) : Not yet implemented for multiclass problems

zachmayer commented 9 years ago

Darn.  I thought I'd fixed that.

Ok, multi class for caretStack should be easier implement than for caretEnsemble.  I'll take a look.

No promises on when I can get to it!

ajing commented 8 years ago

For adding a new customized model, In the example: "Just be sure to use the same re-sampling indexes in trControl as you use in the caretList models!"

What portion I need to change in modelInfo to use the same resampling indexes?

I use the modelInfo of xgboost provided by topepo.

zachmayer commented 8 years ago

Lets say x is your other model (the one you're ensembling xgboost with):

new_train_control <- trainControl(index=x$control$index, indexOut=x$control$indexOut)

Use the new control to fit your xgboost model.

ajing commented 8 years ago

Thanks! I was thinking about still using caretEnsemble framework. Can I do something like:

my_control <- trainControl(
  method='boot',
  number=25,
  classProbs=TRUE,
  index=createResample(training$Class, 25),
  summaryFunction=twoClassSummary
)

model_list <- caretList(
  Class~., data=training,
  trControl=my_control,
  tuneList=list(
    rf1=caretModelSpec(method='rf', tuneGrid=data.frame(.mtry=2)),
    xgb = caretModelSpec(method=modelInfo, tuneLength = 3)
  )
zachmayer commented 8 years ago

As explained in the vignette: "you should note that caretList does not support custom caret models"

You have to fit the second model with a second call to train. E.g.:

my_control <- trainControl(
  method='boot',
  number=25,
  classProbs=TRUE,
  index=createResample(training$Class, 25),
  summaryFunction=twoClassSummary
)

model_list <- caretList(
  Class~., data=training,
  trControl=my_control,
  tuneList=list(
    rf1=caretModelSpec(method='rf', tuneGrid=data.frame(.mtry=2))
  )

new_model <- train(
  Class~., 
  data=training, 
  trControl=my_control, 
  method=modelInfo, 
  tuneLength = 3)

model_list$xgb <- new_model
ajing commented 8 years ago

Thanks for the code! Yes, it works now.

zachmayer commented 8 years ago

Good! Glad to help

ajing commented 8 years ago

Sorry to bother again.. I got the following error, because some assumption about method variable..

model_list$xgb <- new_model greedy_ensemble <- caretEnsemble(model_list) Error in paste("Model '", method, "' is not in the ", "set of existing models", : object 'method' not found

zachmayer commented 8 years ago

crap. that's a bug. If you can make a simple test case, I'll add it to the queue to fix.

ajing commented 8 years ago

This could be a test case:

training = data.frame(matrix(runif(120),nrow=20,ncol=6), Class = factor(sample(c("Yes", "No"), 20, replace=T)))
my_control <- trainControl(
  method='boot',
  number=5,
  savePredictions=TRUE,
  classProbs=TRUE,
  index=createResample(training$Class, 5),
  summaryFunction=twoClassSummary
)
model_list <- caretList(
  Class~., data=training,
  trControl=my_control,
  methodList=c('glm', 'rpart')
)
new_model <- train(Class~., data=training, trControl=my_control, method=modelInfo, metric = "ROC", tuneGrid = expand.grid(eta = 0.01,max_depth = 5, nrounds=100))
model_list$xgb <- new_model

greedy_ensemble <- caretEnsemble(model_list)

modelInfo Is

modelInfo <- list(label = "eXtreme Gradient Boosting",
                  library = c("xgboost", "plyr"),
                  type = c("Regression", "Classification"),
                  parameters = data.frame(parameter = c('nrounds', 'max_depth', 'eta'),
                                          class = rep("numeric", 3),
                                          label = c('# Boosting Iterations', 'Max Tree Depth', 
                                                    'Shrinkage')),
                  grid = function(x, y, len = NULL) expand.grid(max_depth = seq(1, len),
                                                                nrounds = floor((1:len) * 50),
                                                                eta = .3),
                  loop = function(grid) {     
                    loop <- ddply(grid, c("eta", "max_depth"),
                                  function(x) c(nrounds = max(x$nrounds)))
                    submodels <- vector(mode = "list", length = nrow(loop))
                    for(i in seq(along = loop$nrounds)) {
                      index <- which(grid$max_depth == loop$max_depth[i] & 
                                       grid$eta == loop$eta[i])
                      trees <- grid[index, "nrounds"] 
                      submodels[[i]] <- data.frame(nrounds = trees[trees != loop$nrounds[i]])
                    }    
                    list(loop = loop, submodels = submodels)
                  },
                  fit = function(x, y, wts, param, lev, last, classProbs, ...) { 
                    if(is.factor(y)) {
                      if(length(lev) == 2) {
                        y <- ifelse(y == lev[1], 1, 0) 
                        dat <- xgb.DMatrix(as.matrix(x), label = y)
                        out <- xgb.train(list(eta = param$eta, 
                                              max_depth = param$max_depth), 
                                         data = dat,
                                         nrounds = param$nrounds,
                                         objective = "binary:logistic",
                                         ...)
                      } else {
                        y <- as.numeric(y) - 1
                        dat <- xgb.DMatrix(as.matrix(x), label = y)
                        out <- xgb.train(list(eta = param$eta, 
                                              max_depth = param$max_depth), 
                                         data = dat,
                                         num_class = length(lev),
                                         nrounds = param$nrounds,
                                         objective = "multi:softprob",
                                         ...)
                      }     
                    } else {
                      dat <- xgb.DMatrix(as.matrix(x), label = y)
                      out <- xgb.train(list(eta = param$eta, 
                                            max_depth = param$max_depth), 
                                       data = dat,
                                       nrounds = param$nrounds,
                                       objective = "reg:linear",
                                       ...)
                    }
                    out

                  },
                  predict = function(modelFit, newdata, submodels = NULL) {
                    newdata <- xgb.DMatrix(as.matrix(newdata))
                    out <- predict(modelFit, newdata)
                    if(modelFit$problemType == "Classification") {
                      if(length(modelFit$obsLevels) == 2) {
                        out <- ifelse(out >= .5, 
                                      modelFit$obsLevels[1], 
                                      modelFit$obsLevels[2])
                      } else {
                        out <- matrix(out, ncol = length(modelFit$obsLevels), byrow = TRUE)
                        out <- modelFit$obsLevels[apply(out, 1, which.max)]
                      }
                    }

                    if(!is.null(submodels)) {
                      tmp <- vector(mode = "list", length = nrow(submodels) + 1)
                      tmp[[1]] <- out
                      for(j in seq(along = submodels$nrounds)) {
                        tmp_pred <- predict(modelFit, newdata, ntreelimit = submodels$nrounds[j])
                        if(modelFit$problemType == "Classification") {
                          if(length(modelFit$obsLevels) == 2) {
                            tmp_pred <- ifelse(tmp_pred >= .5, 
                                               modelFit$obsLevels[1], 
                                               modelFit$obsLevels[2])
                          } else {
                            tmp_pred <- matrix(tmp_pred, ncol = length(modelFit$obsLevels), byrow = TRUE)
                            tmp_pred <- modelFit$obsLevels[apply(tmp_pred, 1, which.max)]
                          }
                        }
                        tmp[[j+1]]  <- tmp_pred
                      }
                      out <- tmp
                    }
                    out  
                  },
                  prob = function(modelFit, newdata, submodels = NULL) {
                    newdata <- xgb.DMatrix(as.matrix(newdata))
                    out <- predict(modelFit, newdata)
                    if(length(modelFit$obsLevels) == 2) {
                      out <- cbind(out, 1 - out)
                      colnames(out) <- modelFit$obsLevels
                    } else {
                      out <- matrix(out, ncol = length(modelFit$obsLevels), byrow = TRUE)
                      colnames(out) <- modelFit$obsLevels
                    }
                    out <- as.data.frame(out)

                    if(!is.null(submodels)) {
                      tmp <- vector(mode = "list", length = nrow(submodels) + 1)
                      tmp[[1]] <- out
                      for(j in seq(along = submodels$nrounds)) {
                        tmp_pred <- predict(modelFit, newdata, ntreelimit = submodels$nrounds[j])
                        if(length(modelFit$obsLevels) == 2) {
                          tmp_pred <- cbind(tmp_pred, 1 - tmp_pred)
                          colnames(tmp_pred) <- modelFit$obsLevels
                        } else {
                          tmp_pred <- matrix(tmp_pred, ncol = length(modelFit$obsLevels), byrow = TRUE)
                          colnames(tmp_pred) <- modelFit$obsLevels
                        }
                        tmp_pred <- as.data.frame(tmp_pred)
                        tmp[[j+1]]  <- tmp_pred
                      }
                      out <- tmp
                    }
                    out  
                  },
                  predictors = function(x, ...) {
                    imp <- xgb.importance(x$xNames, model = x)
                    x$xNames[x$xNames %in% imp$Feature]
                  },
                  varImp = function(object, numTrees = NULL, ...) {
                    imp <- xgb.importance(x$xNames, model = x)
                    imp <- as.data.frame(imp)[, 1:2]
                    rownames(imp) <- as.character(imp[,1])
                    imp <- imp[,2,drop = FALSE]
                    colnames(imp) <- "Overall"
                    imp   
                  },
                  levels = function(x) x$obsLevels,
                  tags = c("Tree-Based Model", "Boosting", "Ensemble Model", "Implicit Feature Selection"),
                  sort = function(x) {
                    # This is a toss-up, but the # trees probably adds
                    # complexity faster than number of splits
                    x[order(x$nrounds, x$max_depth, x$eta),] 
                  })
ajing commented 8 years ago

Is there any recommended way to add a new caret model to existing model_list?

Like, I have a model_list, and I want to add a new model:

I can do

model9 <- train(X[train,], Y[train], method='gam', trControl=myControl)

Then,

model_list$gam = model9

Is this right?

zachmayer commented 8 years ago

Yup!

tsuresh83 commented 8 years ago

@ajing I had the same issue. I have a dirty workaround that works for me. The function 'makePredObsMatrix' in the caretEnsemble package is throwing the error. Below is the source code of the function with the initial two lines commented (these are just checks. I assume if you know what you're doing, uncommenting this is fine - perhaps @zachmayer can throw more light). I sourced the modified function and added the following lines to my script. It worked. Hope it helps.

source('/media/3TB/kag/caretEnsemble-master/R/helper_functions.R')
assignInNamespace("makePredObsMatrix",makePredObsMatrix, ns="caretEnsemble")

#######

modified the function in helper_function.R file from caretEnsemble repo
makePredObsMatrix <- function(list_of_models){

  #caretList Checks
  #check_caretList_classes(list_of_models)
  #check_caretList_model_types(list_of_models)

  #Make a list of models
  modelLibrary <- extractBestPreds(list_of_models)
  model_names <- names(modelLibrary)

  #Model library checks
  check_bestpreds_resamples(modelLibrary) #Re-write with data.table?
  check_bestpreds_indexes(modelLibrary) #Re-write with data.table?
  check_bestpreds_obs(modelLibrary) #Re-write with data.table?
  check_bestpreds_preds(modelLibrary) #Re-write with data.table?

  #Extract model type (class or reg)
  type <- extractModelTypes(list_of_models)

  #Add names column
  for(i in seq_along(modelLibrary)){
    set(modelLibrary[[i]], j="modelname", value=names(modelLibrary)[[i]])
  }

  #Remove parameter columns
  keep <- Reduce(intersect, lapply(modelLibrary, names))
  for(i in seq_along(modelLibrary)){
    rem <- setdiff(names(modelLibrary[[i]]), keep)
    if(length(rem) > 0){
      for(r in rem){
        set(modelLibrary[[i]], j=r, value=NULL)
      }
    }
  }
  modelLibrary <- rbindlist(modelLibrary, fill=TRUE)

  #For classification models that produce probs, use the probs as preds
  #Otherwise, just use class predictions
  if (type=="Classification"){
    positive <- as.character(unique(modelLibrary$obs)[2]) #IMPROVE THIS!
    pos <- as.numeric(modelLibrary[[positive]])
    good_pos_values <- which(is.finite(pos))
    set(modelLibrary, j="pred", value=as.numeric(modelLibrary[["pred"]]))
    set(modelLibrary, i=good_pos_values, j="pred", value=modelLibrary[good_pos_values,positive,with=FALSE])
  }

  #Reshape wide for meta-modeling
  modelLibrary <- data.table::dcast.data.table(
    modelLibrary,
    obs + rowIndex + Resample ~ modelname,
    value.var = "pred"
  )

  #Return
  return(list(obs=modelLibrary$obs, preds=as.matrix(modelLibrary[,model_names,with=FALSE]), type=type))
}
zachmayer commented 8 years ago

@tsuresh83 Could you make a PR on github with the new version of your function so I can see what you changed? If the fix looks good to me and @jknowles I'll merge it into master.

tsuresh83 commented 8 years ago

@zachmayer I am afraid this is not a fix. From what I've seen the actual fix is the addition of an entry to the list of models in 'models.RData' from the caret package. What I suggested is only a stop gap measure until the actual fix.

nithum commented 8 years ago

Has there been any progress or workarounds developed for using caretStack for multi-class classfication?

zachmayer commented 8 years ago

Nope, but I welcome pull requests!

At the moment we're in the middle of some code cleanup and bug fixes, which will probably be followed by a CRAN release. Then I'll figure out how to get caretList and caretStack to work with multiclass.

Getting caretEnsemble to work with multiclass will be harder.

nithum commented 8 years ago

Thanks for the quick response @zachmayer! By the way, what is your reference material for the implementation of Stacking? Are you using "Elements of Statistical Learning" or something?

zachmayer commented 8 years ago

Hmmm, I don't really have a reference I recall. It's pretty simple actually: I just use the out-of-sample predictions from caret::train(..., trControl=trainControl(savePredictions=TRUE)) to train the next model.

It's not super sophisticated, but it works.

nithum commented 8 years ago

@zachmayer - thanks for the help. I just wanted to let you know that I managed to implement caretStack for a multi-class classification for a project that I'm working on. It's a bit hacky, but it works. I'd love to contribute it to your repository but I'm new to coding and this'd be my first contribution to a project. Could I chat with you or @jknowles at some point about how to best merge it with your codebase?

zachmayer commented 8 years ago

@ajing @nithum Could one of you checkout the latest master and confirm that this bug still occurs?

ajing commented 8 years ago

When I was trying to reinstall caretEnsemble, it says:

Error in loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]]) : 
  there is no package called 'pbkrtest'
ERROR: lazy loading failed for package 'caretEnsemble'
* removing 'C:/Users/ajing/Documents/R/win-library/3.2/caretEnsemble'
zachmayer commented 8 years ago

Restart R (and maybe restart your computer too), then:

remove.packages('caretEnsemble')
install.packages('caretEnsemble')
ajing commented 8 years ago

I don't know why it doesn't work on windows, but I get the installation right on ubuntu system. I have the following error, but I may do it in the wrong way.

> training = data.frame(matrix(runif(120),nrow=20,ncol=6), Class = factor(sample(c("Yes", "No"), 20, replace=T)))
> my_control <- trainControl(
+   method='boot',
+   number=5,
+   savePredictions=TRUE,
+   classProbs=TRUE,
+   index=createResample(training$Class, 5),
+   summaryFunction=twoClassSummary
+ )
> model_list <- caretList(
+   Class~., data=training,
+   trControl=my_control,
+   methodList=c('glm', 'rpart')
+ )
Loading required package: rpart
Warning messages:
1: In train.default(x, y, weights = w, ...) :
  The metric "Accuracy" was not in the result set. ROC will be used instead.
2: glm.fit: algorithm did not converge 
3: glm.fit: fitted probabilities numerically 0 or 1 occurred 
4: glm.fit: algorithm did not converge 
5: glm.fit: fitted probabilities numerically 0 or 1 occurred 
6: In train.default(x, y, weights = w, ...) :
  The metric "Accuracy" was not in the result set. ROC will be used instead.
> new_model <- train(Class~., data=training, trControl=my_control, method=modelInfo, metric = "ROC", tuneGrid = expand.grid(eta = 0.01,max_depth = 5, nrounds=100))
Loading required package: xgboost
Loading required package: plyr
> model_list$xgb <- new_model
> 
> greedy_ensemble <- caretEnsemble(model_list)
Error in paste("Model '", method, "' is not in the ", "set of existing models",  : 
  object 'method' not found
zachmayer commented 8 years ago

That's a 2-class example. Here's a 3-class example, which currently fails:

library(caretEnsemble)
data(iris)
my_control <- caret::trainControl(
  method='boot',
  number=5,
  savePredictions="final",
  classProbs=TRUE,
  index=caret::createResample(iris$Species, 5)
)
model_list <- caretList(
  x=iris[,-5],
  y=iris[,5],
  trControl=my_control,
  methodList=c('glmnet', 'rpart')
)

ens <- caretStack(model_list, method='rpart')
ajing commented 8 years ago

Yes.

Error in check_caretList_model_types(list_of_models) : 
  Not yet implemented for multiclass problems
zachmayer commented 8 years ago

Also, I cleaned up your example, and it runs fine for me:

library(caretEnsemble)
library(caret)
training = data.frame(matrix(runif(120),nrow=20,ncol=6), Class = factor(sample(c("Yes", "No"), 20, replace=T)))
my_control <- trainControl(
  method='boot',
  number=5,
  savePredictions=TRUE,
  classProbs=TRUE,
  index=createResample(training$Class, 5),
  summaryFunction=twoClassSummary
)
model_list <- caretList(
  metric='ROC',
  Class~., data=training,
  trControl=my_control,
  methodList=c('glm', 'rpart')
)

Make sure you're using the latest version of caretEnsemble.

ajing commented 8 years ago

That example is for adding a new model, which is not included in caret.

zachmayer commented 8 years ago

@ajing This issue is for running caretStack with multiclass models. Custom model support in caretList is a different issue: https://github.com/zachmayer/caretEnsemble/issues/109

ajing commented 8 years ago

I began with asking multiclass, then asked something else... So, multi-class is working?

zachmayer commented 8 years ago

No, but it will be soon. Please use separate issues for separate discussions in the future.

ajing commented 8 years ago

Sure.

kprimice commented 8 years ago

Any update on multiclass models support?

zachmayer commented 8 years ago

See pending PR here: https://github.com/zachmayer/caretEnsemble/pull/191

The basic concept works, but I still need to re-write all the intermediate functions to return (and operate) on matrices in the multiclass case.

kprimice commented 8 years ago

Thank you, can't wait for the release!

grepinsight commented 8 years ago

Was looking for multiclass ensembl method, and google search led me here. Can't wait for multiclass ensemble as well!

zee86 commented 7 years ago

Hi Zack, May I ask please , is there any way around that I can use CaretEnsemble with different training dataset ?

zachmayer commented 7 years ago

@zee86 It still doesn't work with multiclass