Closed cosmos2006 closed 9 years ago
Is there missing data in your new dataset? Do you have new factor levels in your new data that were not observed in the training data?
Can you please provide a minimal reproducible example so we can debug? Thank you.
Hi Zach and Jared, Thanks for your prompt reply.
Please find below the dataset as well as code I am trying to run.
The training and test files are attached. training file = train2.csv
> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] nnet_7.3-8 randomForest_4.6-10 caretEnsemble_1.0.0
[4] rpart_4.1-8 pROC_1.7.3 caret_6.0-35
[7] ggplot2_1.0.0 lattice_0.20-29 doMC_1.3.3
[10] iterators_1.0.7 foreach_1.4.2
loaded via a namespace (and not attached):
[1] bitops_1.0-6 BradleyTerry2_1.0-5 brglm_0.5-9
[4] car_2.0-20 caTools_1.17 codetools_0.2-8
[7] colorspace_1.2-4 compiler_3.1.1 digest_0.6.4
[10] grid_3.1.1 gridExtra_2.0.0 gtable_0.1.2
[13] gtools_3.4.1 lme4_1.1-7 MASS_7.3-33
[16] Matrix_1.1-4 minqa_1.2.3 munsell_0.4.2
[19] nlme_3.1-117 nloptr_1.0.4 pbapply_1.1-1
[22] plyr_1.8.1 proto_0.3-10 Rcpp_0.11.2
[25] reshape2_1.4 scales_0.2.4 splines_3.1.1
[28] stringr_0.6.2
Code:
library('doMC')
registerDoMC(cores=22)
library('caret')
library('pROC')
library('rpart')
library('caretEnsemble')
library('randomForest')
train<-read.csv("train2.csv")
test<-read.csv("test3.csv")
ctrl <- trainControl(method='cv', number= 2, savePredictions=TRUE, classProbs=TRUE, summaryFunction=twoClassSummary)
model_list_big <- caretList(as.factor(signal)~., data=train, trControl= ctrl, metric='ROC',tuneList=list(rf1=caretModelSpec(method='rf', tuneGrid=data.frame(.mtry=6)) ))
greedy_ensemble <- caretEnsemble(model_list_big)
summary(greedy_ensemble)
The following models were ensembled: rf1
They were weighted:
1
The resulting AUC is: 0.9306
The fit for each individual model on the AUC is:
method metric metricSD
rf1 0.9306432 3.657902e-05
ens_preds <- predict(greedy_ensemble, newdata=test)
Result
Error in `[.data.frame`(out, , obsLevels, drop = FALSE) :
undefined columns selected
Another interesting information: When I am run above snippet on R version 3.2.0 (2015-04-16, i.e.
> sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-unknown-linux-gnu (64-bit)
Running under: Fedora 18 (Spherical Cow)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] nnet_7.3-11 randomForest_4.6-10 caretEnsemble_1.0.0
[4] rpart_4.1-9 pROC_1.7.3 caret_6.0-52
[7] ggplot2_1.0.1 lattice_0.20-31 doMC_1.3.3
[10] iterators_1.0.7 foreach_1.4.2
loaded via a namespace (and not attached):
[1] Rcpp_0.11.2 splines_3.2.0 MASS_7.3-40
[4] munsell_0.4.2 colorspace_1.2-4 pbapply_1.1-1
[7] minqa_1.2.4 stringr_0.6.2 car_2.0-20
[10] plyr_1.8.1 caTools_1.17.1 grid_3.2.0
[13] gtable_0.1.2 nlme_3.1-120 gtools_3.5.0
[16] lme4_1.1-9 digest_0.6.8 Matrix_1.2-0
[19] gridExtra_0.9.1 nloptr_1.0.4 reshape2_1.4.1
[22] bitops_1.0-6 codetools_0.2-11 BradleyTerry2_1.0-6
[25] scales_0.3.0 stats4_3.2.0 brglm_0.5-9
[28] proto_0.3-10
=------------------- when I execute :
model_list_big <- caretList(as.factor(signal)~., data=train, trControl= ctrl, metric='ROC',tuneList=list(rf1=caretModelSpec(method='rf', tuneGrid=data.frame(.mtry=6)) )),
I get the following error:
Error in train.default(x, y, weights = w, ...) :
At least one of the class levels is not a valid R variable name; This will cause errors when class probabilities are generated because the variables names will be converted to X0, X1 . Please use factor levels that can be used as valid R variable names (see ?make.names for help)
But I have defined signal as a factor.
Am i missing something somewhere?
PS: Test and training file is being sent through gmail.
I didn't get the data, but immediately you need to recode your variable signal
from being 0 to 1 to being a value that does not start with a digit. As the error message points out, you cannot have variable names that begin with a digit. So try renaming the variable like this:
train$signal <- ifelse(train$signal == 1, "X", "Y")
test$signal <- ifelse(test$signal == 1, "X", "Y")
ok.. by the way, the data I have sent at zach.mayer@gmail.com.
Regards Mradul
This is a great start: you've provided an example, but you need to do 2 more things:
library('doMC')
, registerDoMC(cores=22)
, savePredicitons=TRUE
, and test<-read.csv("test3.csv")
. (There's a lot more you can take out too).train <- head(iris, 20)
or train <- matrix(runif(100), ncol=5)
Emailing datasets don't count. I'm lazy and paranoid, so I'm not going to download files you email to me. I should be able to copy/paste a short block of code into a fresh R session and get the same error you are reporting.Please read read the user guide on how to create a minimal, reproducible example: http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610
Read it closely and take notes. Ask questions if you don't understand. This is one of the most important skills you have to develop as a data scientist, so it's important to take the time to learn how to do it right.
Also, if you put your code in between triple backticks it will show up correctly formatted. e.g.:
```{R}
train <- head(iris, 20)
Will produce:
``` R
train <- head(iris, 20)
Hi, Following solved the problem in one go
train$signal <- ifelse(train$signal == 1, "X", "Y") test$signal <- ifelse(test$signal == 1, "X", "Y")
Thanks a lot for the prompt response in solving the issue.
Zach, Thanks a lot for your teachings about submitting issues in a proper format. It is highly appreciated.
Glad you figured it out!
It was jknowles suggestion which worked. Thank you guys
On Sat, Sep 19, 2015 at 2:37 AM, Zach Mayer notifications@github.com wrote:
Glad you figured it out!
— Reply to this email directly or view it on GitHub https://github.com/zachmayer/caretEnsemble/issues/170#issuecomment-141607044 .
Hi, I have tried testing two datasets with the caret ensenmble by following https://cran.r-project.org/web/packages/caretEnsemble/vignettes/caretEnsemble-intro.html.
In both the cases, I get the same error:
"Error in
[.data.frame
(out, , obsLevels, drop = FALSE) : undefined columns selected"It is exactly same error reported in http://stackoverflow.com/questions/30522709/how-to-predict-on-a-new-dataset-using-caretensemble-package-in-r.
Please note that my test and training sample both have same features.
Where I am failing?