zachmayer / caretEnsemble

caret models all the way down :turtle:
http://zachmayer.github.io/caretEnsemble/
Other
226 stars 75 forks source link

Two errors depending on metric: auc_(actual, predicted, ranks) | evalSummaryFunction(y, wts = weights, ctrl = trControl, lev = classLevels, #240

Closed yogat3ch closed 6 years ago

yogat3ch commented 6 years ago

Issue I'm attempting to classify a binary response variable with the code in the first reproducible example below. I would like to use ROC as the metric but get the following error when doing so: Error in evalSummaryFunction(y, wts = weights, ctrl = trControl, lev = classLevels, : train()'s use of ROC codes requires class probabilities. See the classProbs option of trainControl() When I leave metric out of the call, I assume it defaults to Accuracy as the response variable is a factor, it results in the following error: Error in auc_(actual, predicted, ranks) : Not compatible with requested type: [type=list; target=double].

In the attempt to create a minimal reproducible example with the UCI Breast Cancer dataset I'm getting an entirely different error - so I've opted to just include the data I am using that caused the errors above as they are higher priority to resolve. I've included the code with the actual data below and the UCI minrepro example at the bottom of the thread. All required packages are listed in the req.packages vector.

Any assistance is appreciated!

Exact, reproducible example with actual data:

set.seed(1)
req.packages <- c("doParallel","kernlab","caTools","C50","parallel","iterators","MASS","foreach","caret","tidyverse","dplyr","htmltools","magrittr")
for (q in seq_along(req.packages)) {
  suppressPackageStartupMessages(library(req.packages[q],character.only = T))
}
repmis::source_data("https://github.com/yogat3ch/da5030/blob/master/matchedlevels.RData?raw=true")
  data.train <- caret::createMultiFolds(matchedlevels$olddata[["Deductible"]],times = 2)
  data.train <- caret::trainControl(method="repeatedcv",
                             index=data.train, 
                             number=10,
                             repeats=1, 
                             search = "grid",
                             allowParallel = T, 
                             classProbs=T, 
                             savePredictions = "all",
                             summaryFunction = caret::twoClassSummary,
                             returnResamp = "all")
  form <- as.formula(paste0("Deductible"," ~ ."))
  cl <- makeCluster(detectCores()-1)
registerDoParallel(cl)
getDoParWorkers()
  mod.list <- caretEnsemble::caretList(formula = form,
                                       data = matchedlevels$olddata,
                                       trControl = data.train,
                                       methodList = c("svmRadial","LogitBoost","adaboost","C5.0"),
                                       tuneList = list("svmRadial"=caretEnsemble::caretModelSpec(
                                         method="svmRadial", tuneGrid = tuneGrids$svmRadial),"LogitBoost"=caretEnsemble::caretModelSpec(
                                         method="LogitBoost", tuneGrid = tuneGrids$LogitBoost),"adaboost"=caretEnsemble::caretModelSpec(
                                         method="adaboost", tuneLength = 10),"C5.0"=caretEnsemble::caretModelSpec(
                                         method="C5.0", tuneGrid = tuneGrids$C5.0)))

stopCluster(cl)
registerDoSEQ()

Session Info:

R version 3.4.3 (2017-11-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets 
[7] methods   base     

other attached packages:
 [1] caret_6.0-79      lattice_0.20-35   C50_0.1.1        
 [4] caTools_1.17.1    kernlab_0.9-25    klaR_0.6-14      
 [7] MASS_7.3-47       doParallel_1.0.11 iterators_1.0.9  
[10] foreach_1.4.4     magrittr_1.5      htmltools_0.3.6  
[13] forcats_0.3.0     stringr_1.3.0     dplyr_0.7.4      
[16] purrr_0.2.4       readr_1.1.1       tidyr_0.8.0      
[19] tibble_1.4.2      ggplot2_2.2.1     tidyverse_1.2.1  

loaded via a namespace (and not attached):
 [1] Cubist_0.2.1          colorspace_1.3-2      class_7.3-14         
 [4] rprojroot_1.3-2       rstudioapi_0.7.0-9000 DRR_0.0.3            
 [7] prodlim_1.6.1         mvtnorm_1.0-7         lubridate_1.7.3      
[10] xml2_1.2.0            R.methodsS3_1.7.1     codetools_0.2-15     
[13] splines_3.4.3         mnormt_1.5-5          robustbase_0.92-8    
[16] libcoin_1.0-1         knitr_1.20            RcppRoll_0.2.2       
[19] Formula_1.2-2         jsonlite_1.5          broom_0.4.3          
[22] ddalpha_1.3.1.1       R.oo_1.21.0           sfsmisc_1.1-1        
[25] shiny_1.0.5           compiler_3.4.3        httr_1.3.1           
[28] backports_1.1.2       assertthat_0.2.0      Matrix_1.2-12        
[31] lazyeval_0.2.1        cli_1.0.0             tools_3.4.3          
[34] bindrcpp_0.2          partykit_1.2-0        gtable_0.2.0         
[37] glue_1.2.0            reshape2_1.4.3        Rcpp_0.12.16         
[40] cellranger_1.1.0      nlme_3.1-131          repmis_0.5           
[43] psych_1.8.3.3         timeDate_3043.102     inum_1.0-0           
[46] gower_0.1.2           rvest_0.3.2           mime_0.5             
[49] miniUI_0.1.1          DEoptimR_1.0-8        scales_0.5.0         
[52] ipred_0.9-6           hms_0.4.1             RColorBrewer_1.1-2   
[55] curl_3.1              yaml_2.1.18           pbapply_1.3-4        
[58] gridExtra_2.3         rpart_4.1-11          stringi_1.1.7        
[61] highr_0.6             lava_1.6              bitops_1.0-6         
[64] rlang_0.2.0           pkgconfig_2.0.1       evaluate_0.10.1      
[67] bindr_0.1             recipes_0.1.2         CVST_0.2-1           
[70] tidyselect_0.2.4      plyr_1.8.4            R6_2.2.2             
[73] combinat_0.0-8        dimRed_0.1.0          pillar_1.1.0         
[76] haven_1.1.1           foreign_0.8-69        withr_2.1.1          
[79] RCurl_1.95-4.10       survival_2.41-3       nnet_7.3-12          
[82] modelr_0.1.1          crayon_1.3.4          questionr_0.6.2      
[85] rmarkdown_1.9         grid_3.4.3            readxl_1.0.0.9000    
[88] data.table_1.10.4-3   ModelMetrics_1.1.0    digest_0.6.15        
[91] R.cache_0.13.0        xtable_1.8-2          caretEnsemble_2.0.0  
[94] httpuv_1.3.5          R.utils_2.6.0         stats4_3.4.3         
[97] munsell_0.4.3  

Minimal, reproducible example with UCI data:

The code below results in the following error: Error in names(res$trainingData) %in% as.character(form[[2]]) : argument "form" is missing, with no default Running this code requires the libraries from the above example to be loaded.

bc <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data")
nms <- readLines("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.names")[110:126] %>% str_match("^\\\t?[a-z1-2]\\)\\s?(\\w+\\s?\\w+)") %>% na.omit %>% .[,2]
names(bc) <- c(nms[1:2],nms[3:12] %>% paste0(".mean"),nms[3:12] %>% paste0(".se"),nms[3:12] %>% paste0(".wst")) %>% gsub("\\s","\\.",.)
rownames(bc) <- bc[,1]
bc <- bc[,-1]
cl <- makeCluster(detectCores()-1)
registerDoParallel(cl)
getDoParWorkers()
data.train <- caret::createMultiFolds(bc$Diagnosis,times = 2)
  data.train <- caret::trainControl(method="repeatedcv",
                             index=data.train, 
                             number=10,
                             repeats=1, 
                             search = "grid",
                             allowParallel = T, 
                             classProbs=T, 
                             savePredictions = "all",
                             returnResamp = "all",
                             summaryFunction = caret::twoClassSummary
                             )
  f <- as.formula(paste0("Diagnosis"," ~ ."))
  mod.list <- caretEnsemble::caretList(formula = f,
                                       data = bc,
                                       trControl = data.train,
                                       methodList = c("svmRadial","LogitBoost","adaboost","C5.0"),
                                       metric = "ROC",
                                       tuneList = list("svmRadial"=caretEnsemble::caretModelSpec(
                                         method="svmRadial", tuneGrid = tuneGrids$svmRadial),"LogitBoost"=caretEnsemble::caretModelSpec(
                                         method="LogitBoost", tuneGrid = tuneGrids$LogitBoost),"adaboost"=caretEnsemble::caretModelSpec(
                                         method="adaboost", tuneLength = 10),"C5.0"=caretEnsemble::caretModelSpec(
                                         method="C5.0", tuneGrid = tuneGrids$C5.0)))
  stopCluster(cl)
registerDoSEQ()
yogat3ch commented 6 years ago

It looks like the call to the formula is denoted by form rather than formula (ie form=form in the first variable of caretList) - thus the train function was not using the formula provided and used the first column in the data as the response variable - which happened to be a numeric form of date and thus attempted to perform a regression.