zachmayer / caretEnsemble

caret models all the way down :turtle:
http://zachmayer.github.io/caretEnsemble/
Other
226 stars 75 forks source link

Error when predicting on testset after caretstack #236

Closed stijnjas closed 6 years ago

stijnjas commented 6 years ago

Minimal, reproducible example:

Text and example code modified from the R FAQ on stackoverflow

Minimal dataset:

set.seed(1)
dat <- caret::twoClassSim(100)
X <- dat[,1:5]
y <- dat[["Class"]]

If you have some data that would be too difficult to construct using caret::twoClassSim or caret::SLC14_1, then you can always make a subset of your original data, using e.g. head(), subset() or the indices. Then use e.g. dput() to give us something that can be put in R immediately, e.g. dput(head(iris,4))

If you must use dput(head()), please first remove an columns from your dataset that are not necessary to reproduce the error.

If your data frame has a factor with many levels, the dput output can be unwieldy because it will still list all the possible factor levels even if they aren't present in the the subset of your data. To solve this issue, you can use the droplevels() function. Notice below how species is a factor with only one level: dput(droplevels(head(iris, 4)))

Minimal, runnable code:

library(caretEnsemble)
models <- caretList(
  X, y, 
  methodList=c('glm', 'rpart'),
  trControl=trainControl(
    method="cv", 
    number=5,
    classProbs=TRUE, 
    savePredictions="final")
)
ens <- caretStack(models)
print(ens)

Session Info:

SessionInfo()

R version 3.3.3 (2017-03-06)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Sierra 10.12.6

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
 [1] parallel  splines   grid      stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] plyr_1.8.4          gbm_2.1.3           survival_2.40-1     randomForest_4.6-12 DMwR_0.4.1          kernlab_0.9-25     
 [7] caretEnsemble_2.0.0 topicmodels_0.2-6   caret_6.0-77        ggplot2_2.2.1       lattice_0.20-34     e1071_1.6-8        
[13] tm_0.7-1            NLP_0.1-11         

loaded via a namespace (and not attached):
 [1] ddalpha_1.3.1       sfsmisc_1.1-1       foreach_1.4.3       prodlim_1.6.1       gtools_3.5.0        assertthat_0.2.0   
 [7] TTR_0.23-1          stats4_3.3.3        DRR_0.0.2           robustbase_0.92-7   slam_0.1-40         ipred_0.9-6        
[13] glue_1.1.1          digest_0.6.12       colorspace_1.3-2    recipes_0.1.0       Matrix_1.2-8        timeDate_3012.100  
[19] pkgconfig_2.0.1     CVST_0.2-1          purrr_0.2.3         scales_0.5.0        gdata_2.18.0        gower_0.1.2        
[25] lava_1.5.1          tibble_1.3.4        withr_2.0.0         ROCR_1.0-7          pbapply_1.3-3       nnet_7.3-12        
[31] lazyeval_0.2.0      quantmod_0.4-11     magrittr_1.5        nlme_3.1-131        SnowballC_0.5.1     MASS_7.3-45        
[37] gplots_3.0.1        xts_0.9-7           dimRed_0.1.0        class_7.3-14        tools_3.3.3         data.table_1.10.4-2
[43] stringr_1.2.0       munsell_0.4.3       bindrcpp_0.2        compiler_3.3.3      RcppRoll_0.2.2      caTools_1.17.1     
[49] rlang_0.1.2.9000    iterators_1.0.8     bitops_1.0-6        gtable_0.2.0        ModelMetrics_1.1.0  codetools_0.2-15   
[55] abind_1.4-5         reshape2_1.4.2      R6_2.2.2            gridExtra_2.3       zoo_1.8-0           lubridate_1.6.0    
[61] dplyr_0.7.4         bindr_0.1           KernSmooth_2.23-15  modeltools_0.2-21   stringi_1.1.5       Rcpp_0.12.13       
[67] rpart_4.1-10        DEoptimR_1.0-8  

#############################################

Original Post

#############################################Dear Zach, please find below a reproducible example of a prediction error I am getting after having created a stacked ensemble. The error was not there a weak ago, so no idea what goes wrong.

SessionInfo()

R version 3.3.3 (2017-03-06) Platform: x86_64-apple-darwin13.4.0 (64-bit) Running under: macOS Sierra 10.12.6

locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages: [1] parallel splines grid stats graphics grDevices utils datasets methods base

other attached packages: [1] plyr_1.8.4 gbm_2.1.3 survival_2.40-1 randomForest_4.6-12 DMwR_0.4.1 kernlab_0.9-25
[7] caretEnsemble_2.0.0 topicmodels_0.2-6 caret_6.0-77 ggplot2_2.2.1 lattice_0.20-34 e1071_1.6-8
[13] tm_0.7-1 NLP_0.1-11

loaded via a namespace (and not attached): [1] ddalpha_1.3.1 sfsmisc_1.1-1 foreach_1.4.3 prodlim_1.6.1 gtools_3.5.0 assertthat_0.2.0
[7] TTR_0.23-1 stats4_3.3.3 DRR_0.0.2 robustbase_0.92-7 slam_0.1-40 ipred_0.9-6
[13] glue_1.1.1 digest_0.6.12 colorspace_1.3-2 recipes_0.1.0 Matrix_1.2-8 timeDate_3012.100
[19] pkgconfig_2.0.1 CVST_0.2-1 purrr_0.2.3 scales_0.5.0 gdata_2.18.0 gower_0.1.2
[25] lava_1.5.1 tibble_1.3.4 withr_2.0.0 ROCR_1.0-7 pbapply_1.3-3 nnet_7.3-12
[31] lazyeval_0.2.0 quantmod_0.4-11 magrittr_1.5 nlme_3.1-131 SnowballC_0.5.1 MASS_7.3-45
[37] gplots_3.0.1 xts_0.9-7 dimRed_0.1.0 class_7.3-14 tools_3.3.3 data.table_1.10.4-2 [43] stringr_1.2.0 munsell_0.4.3 bindrcpp_0.2 compiler_3.3.3 RcppRoll_0.2.2 caTools_1.17.1
[49] rlang_0.1.2.9000 iterators_1.0.8 bitops_1.0-6 gtable_0.2.0 ModelMetrics_1.1.0 codetools_0.2-15
[55] abind_1.4-5 reshape2_1.4.2 R6_2.2.2 gridExtra_2.3 zoo_1.8-0 lubridate_1.6.0
[61] dplyr_0.7.4 bindr_0.1 KernSmooth_2.23-15 modeltools_0.2-21 stringi_1.1.5 Rcpp_0.12.13
[67] rpart_4.1-10 DEoptimR_1.0-8

Example

Load packages

library(tm) library(e1071) library(caret) library(topicmodels) library(caretEnsemble)

abstract_final<-read.csv("use_as_train.csv")

------------------------------------------------------------

Prepare data to be used as input to classification methods

------------------------------------------------------------

Create the corpus that will be used

review_corpus = VCorpus(VectorSource(abstracts_final$Abstract))

review_corpus = tm_map(review_corpus, content_transformer(tolower)) #no uppercase review_corpus = tm_map(review_corpus, removeNumbers) #no numerical values review_corpus = tm_map(review_corpus, removePunctuation) #remove punctuation review_corpus = tm_map(review_corpus, removeWords, c("the", "and", stopwords("english"))) # remove stopwords review_corpus = tm_map(review_corpus, stripWhitespace) # remove whitespace review_corpus = tm_map(review_corpus, stemDocument, language = "english") #stemming (bring back to common base word)

Term - Document matrices

1. just using the frequencies of the words

review_dtm <- DocumentTermMatrix(review_corpus) review_dtm = removeSparseTerms(review_dtm, 0.98) #To reduce the dimension of the DTM, we can remove the less frequent terms such that the sparsity is less than 0.99

use_dtm<-review_dtm

abstracts_final$Abstract =NULL names(abstracts_final) abstracts_final = cbind(abstracts_final, as.matrix(use_dtm)) # select the desired term-document matrix abstracts_final$Indicator<-as.factor(abstracts_final$Indicator) abstracts_final<-abstracts_final[,c('Indicator',attr(as.matrix(use_dtm),"dimnames")$Terms)] names(abstracts_final) = make.names(names(abstracts_final)) levels(abstracts_final$Indicator)<-c("irrelevant","relevant")

Select training and test set

set.seed(1988) splitIndex <- createDataPartition(abstracts_final$Indicator, p = .10, list = FALSE, times = 1) trainSplit <- abstracts_final[ splitIndex,] testSplit <- abstracts_final[-splitIndex,]

###########################################

Fitting the different models

###########################################

Support Vector Machines

Linear

svm_grid_Linear<-expand.grid( C = c(0.00001,0.0001,0.001,0.01,1,10))

set.seed(5627) svm_Linear_orig <- train(Indicator ~ ., data = trainSplit, method = "svmLinear",metric = "Kappa", tuneGrid = svm_grid_Linear, trControl = trainControl(method = "cv", number = 3, sampling= NULL,savePredictions = "final",classProbs = TRUE,index = createResample(trainSplit$Indicator,3)))

set.seed(5627) svm_Linear_smote <- train(Indicator ~ ., data = trainSplit, method = "svmLinear",metric = "Kappa", tuneGrid = svm_grid_Linear, trControl = trainControl(method = "cv", number = 3, sampling= "smote",savePredictions = "final",classProbs=TRUE,index = createResample(trainSplit$Indicator,3)))

fitted_models <- list(SVM_Linear_Original = svm_Linear_orig, SVM_Linear_Smote = svm_Linear_smote

)

class(fitted_models) <- "caretList"

stackControl <- trainControl(method="cv", number=10, savePredictions=TRUE, classProbs=TRUE,sampling='smote') set.seed(1988) garbage <- capture.output(stack.rf<-caretStack(fitted_models, method="gbm", metric="Kappa", trControl=stackControl)) predict(stack.rf,testSplit)

Thanks in advance, Stijn

zachmayer commented 6 years ago

You seem to have deleted the issue template when posting your issue. I re-added it. I put your session info output in the correct part of the issue template. Please reformat the rest of your question to fit the template, in particular:

  1. A minimal reproducible example. You posted a lot of code. Try to reduce it to 4 or 5 lines.
  2. A minimal dataset. Do you get the same error when you fit your models to the iris dataset?

Both of these things are explained in more detail in the template.

zachmayer commented 6 years ago

Be sure to update the checklist too!

stijnjas commented 6 years ago

Thanks Zach, i have re installed the package and removed the old versions, and it works again. I am sorry for the inconvenience. Stijn

zachmayer commented 6 years ago

No worries! That's why I have the checklist 😄