zachmayer / caretEnsemble

caret models all the way down :turtle:
http://zachmayer.github.io/caretEnsemble/
Other
226 stars 75 forks source link

ens_preds <- predict(greedy_ensemble, newdata=test) failing #170

Closed cosmos2006 closed 9 years ago

cosmos2006 commented 9 years ago

Hi, I have tried testing two datasets with the caret ensenmble by following https://cran.r-project.org/web/packages/caretEnsemble/vignettes/caretEnsemble-intro.html.

In both the cases, I get the same error:

"Error in [.data.frame(out, , obsLevels, drop = FALSE) : undefined columns selected"

It is exactly same error reported in http://stackoverflow.com/questions/30522709/how-to-predict-on-a-new-dataset-using-caretensemble-package-in-r.

Please note that my test and training sample both have same features.

Where I am failing?

jknowles commented 9 years ago

Is there missing data in your new dataset? Do you have new factor levels in your new data that were not observed in the training data?

zachmayer commented 9 years ago

Can you please provide a minimal reproducible example so we can debug? Thank you.

cosmos2006 commented 9 years ago

Hi Zach and Jared, Thanks for your prompt reply.

Please find below the dataset as well as code I am trying to run.

The training and test files are attached. training file = train2.csv

Test file = test3.csv

> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C             
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8   
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8  
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                
 [9] LC_ADDRESS=C               LC_TELEPHONE=C           
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C      

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods 
[8] base    

other attached packages:
 [1] nnet_7.3-8          randomForest_4.6-10 caretEnsemble_1.0.0
 [4] rpart_4.1-8         pROC_1.7.3          caret_6.0-35      
 [7] ggplot2_1.0.0       lattice_0.20-29     doMC_1.3.3        
[10] iterators_1.0.7     foreach_1.4.2     

loaded via a namespace (and not attached):
 [1] bitops_1.0-6        BradleyTerry2_1.0-5 brglm_0.5-9       
 [4] car_2.0-20          caTools_1.17        codetools_0.2-8   
 [7] colorspace_1.2-4    compiler_3.1.1      digest_0.6.4      
[10] grid_3.1.1          gridExtra_2.0.0     gtable_0.1.2      
[13] gtools_3.4.1        lme4_1.1-7          MASS_7.3-33       
[16] Matrix_1.1-4        minqa_1.2.3         munsell_0.4.2     
[19] nlme_3.1-117        nloptr_1.0.4        pbapply_1.1-1     
[22] plyr_1.8.1          proto_0.3-10        Rcpp_0.11.2       
[25] reshape2_1.4        scales_0.2.4        splines_3.1.1     
[28] stringr_0.6.2     

Code:

library('doMC')
registerDoMC(cores=22)
library('caret')
library('pROC')
library('rpart')
library('caretEnsemble')
library('randomForest')

train<-read.csv("train2.csv")
test<-read.csv("test3.csv")

ctrl <- trainControl(method='cv', number= 2, savePredictions=TRUE, classProbs=TRUE, summaryFunction=twoClassSummary)
model_list_big <- caretList(as.factor(signal)~., data=train, trControl= ctrl, metric='ROC',tuneList=list(rf1=caretModelSpec(method='rf', tuneGrid=data.frame(.mtry=6)) ))
greedy_ensemble <- caretEnsemble(model_list_big)
summary(greedy_ensemble)

Results

The following models were ensembled: rf1
They were weighted:
1
The resulting AUC is: 0.9306
The fit for each individual model on the AUC is:
 method    metric     metricSD
    rf1 0.9306432 3.657902e-05

ens_preds <- predict(greedy_ensemble, newdata=test)

Result
Error in `[.data.frame`(out, , obsLevels, drop = FALSE) :
  undefined columns selected

Another interesting information: When I am run above snippet on R version 3.2.0 (2015-04-16, i.e.

> sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-unknown-linux-gnu (64-bit)
Running under: Fedora 18 (Spherical Cow)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C             
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8   
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8  
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                
 [9] LC_ADDRESS=C               LC_TELEPHONE=C           
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C      

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods 
[8] base    

other attached packages:
 [1] nnet_7.3-11         randomForest_4.6-10 caretEnsemble_1.0.0
 [4] rpart_4.1-9         pROC_1.7.3          caret_6.0-52      
 [7] ggplot2_1.0.1       lattice_0.20-31     doMC_1.3.3        
[10] iterators_1.0.7     foreach_1.4.2     

loaded via a namespace (and not attached):
 [1] Rcpp_0.11.2         splines_3.2.0       MASS_7.3-40       
 [4] munsell_0.4.2       colorspace_1.2-4    pbapply_1.1-1     
 [7] minqa_1.2.4         stringr_0.6.2       car_2.0-20        
[10] plyr_1.8.1          caTools_1.17.1      grid_3.2.0        
[13] gtable_0.1.2        nlme_3.1-120        gtools_3.5.0      
[16] lme4_1.1-9          digest_0.6.8        Matrix_1.2-0      
[19] gridExtra_0.9.1     nloptr_1.0.4        reshape2_1.4.1    
[22] bitops_1.0-6        codetools_0.2-11    BradleyTerry2_1.0-6
[25] scales_0.3.0        stats4_3.2.0        brglm_0.5-9       
[28] proto_0.3-10      

=------------------- when I execute :

model_list_big <- caretList(as.factor(signal)~., data=train, trControl= ctrl, metric='ROC',tuneList=list(rf1=caretModelSpec(method='rf', tuneGrid=data.frame(.mtry=6)) )),

I get the following error:

Error in train.default(x, y, weights = w, ...) :
  At least one of the class levels is not a valid R variable name; This will cause errors when class probabilities are generated because the variables names will be converted to  X0, X1 . Please use factor levels that can be used as valid R variable names  (see ?make.names for help)

But I have defined signal as a factor.

Am i missing something somewhere?

PS: Test and training file is being sent through gmail.

jknowles commented 9 years ago

I didn't get the data, but immediately you need to recode your variable signal from being 0 to 1 to being a value that does not start with a digit. As the error message points out, you cannot have variable names that begin with a digit. So try renaming the variable like this:

train$signal <- ifelse(train$signal == 1, "X", "Y")
test$signal <- ifelse(test$signal == 1, "X", "Y")
cosmos2006 commented 9 years ago

ok.. by the way, the data I have sent at zach.mayer@gmail.com.

Regards Mradul

zachmayer commented 9 years ago

This is a great start: you've provided an example, but you need to do 2 more things:

  1. Make it minimal. This means ripping out all code that is unnecessary to creating the bug. For example, you can take out library('doMC'), registerDoMC(cores=22), savePredicitons=TRUE, and test<-read.csv("test3.csv"). (There's a lot more you can take out too).
  2. Make it reproducible. This means providing the dataset as part of the code, e.g. train <- head(iris, 20) or train <- matrix(runif(100), ncol=5) Emailing datasets don't count. I'm lazy and paranoid, so I'm not going to download files you email to me. I should be able to copy/paste a short block of code into a fresh R session and get the same error you are reporting.

Please read read the user guide on how to create a minimal, reproducible example: http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610

Read it closely and take notes. Ask questions if you don't understand. This is one of the most important skills you have to develop as a data scientist, so it's important to take the time to learn how to do it right.

zachmayer commented 9 years ago

Also, if you put your code in between triple backticks it will show up correctly formatted. e.g.:

    ```{R}
    train <- head(iris, 20)

Will produce:

``` R
train <- head(iris, 20)
cosmos2006 commented 9 years ago

Hi, Following solved the problem in one go

train$signal <- ifelse(train$signal == 1, "X", "Y") test$signal <- ifelse(test$signal == 1, "X", "Y")

Thanks a lot for the prompt response in solving the issue.

Zach, Thanks a lot for your teachings about submitting issues in a proper format. It is highly appreciated.

zachmayer commented 9 years ago

Glad you figured it out!

cosmos2006 commented 9 years ago

It was jknowles suggestion which worked. Thank you guys

On Sat, Sep 19, 2015 at 2:37 AM, Zach Mayer notifications@github.com wrote:

Glad you figured it out!

— Reply to this email directly or view it on GitHub https://github.com/zachmayer/caretEnsemble/issues/170#issuecomment-141607044 .