ens_preds <- predict(greedy_ensemble, newdata=test) failing #170

Closed cosmos2006 closed 9 years ago

cosmos2006 commented 9 years ago

Hi, I have tried testing two datasets with the caret ensenmble by following

In both the cases, I get the same error:

"Error in [.data.frame(out, , obsLevels, drop = FALSE) : undefined columns selected"

It is exactly same error reported in

Please note that my test and training sample both have same features.

Where I am failing?

jknowles commented 9 years ago

Is there missing data in your new dataset? Do you have new factor levels in your new data that were not observed in the training data?

zachmayer commented 9 years ago

Can you please provide a minimal reproducible example so we can debug? Thank you.

cosmos2006 commented 9 years ago

Hi Zach and Jared, Thanks for your prompt reply.

Please find below the dataset as well as code I am trying to run.

The training and test files are attached. training file = train2.csv

Test file = test3.csv

ctrl <- trainControl(method='cv', number= 2, savePredictions=TRUE, classProbs=TRUE, summaryFunction=twoClassSummary)
model_list_big <- caretList(as.factor(signal)~., data=train, trControl= ctrl, metric='ROC',tuneList=list(rf1=caretModelSpec(method='rf', tuneGrid=data.frame(.mtry=6)) ))
greedy_ensemble <- caretEnsemble(model_list_big)


The following models were ensembled: rf1
They were weighted:
The resulting AUC is: 0.9306
The fit for each individual model on the AUC is:
 method    metric     metricSD
    rf1 0.9306432 3.657902e-05

ens_preds <- predict(greedy_ensemble, newdata=test)

Error in `[.data.frame`(out, , obsLevels, drop = FALSE) :
  undefined columns selected

Another interesting information: When I am run above snippet on R version 3.2.0 (2015-04-16, i.e.

=------------------- when I execute :

model_list_big <- caretList(as.factor(signal)~., data=train, trControl= ctrl, metric='ROC',tuneList=list(rf1=caretModelSpec(method='rf', tuneGrid=data.frame(.mtry=6)) )),

I get the following error:

Error in train.default(x, y, weights = w, ...) :
  At least one of the class levels is not a valid R variable name; This will cause errors when class probabilities are generated because the variables names will be converted to  X0, X1 . Please use factor levels that can be used as valid R variable names  (see ?make.names for help)

But I have defined signal as a factor.

Am i missing something somewhere?

PS: Test and training file is being sent through gmail.

jknowles commented 9 years ago

I didn't get the data, but immediately you need to recode your variable signal from being 0 to 1 to being a value that does not start with a digit. As the error message points out, you cannot have variable names that begin with a digit. So try renaming the variable like this:

train$signal <- ifelse(train$signal == 1, "X", "Y")
test$signal <- ifelse(test$signal == 1, "X", "Y")
cosmos2006 commented 9 years ago

ok.. by the way, the data I have sent at

Regards Mradul

zachmayer commented 9 years ago

This is a great start: you've provided an example, but you need to do 2 more things:

  1. Make it minimal. This means ripping out all code that is unnecessary to creating the bug. For example, you can take out library('doMC'), registerDoMC(cores=22), savePredicitons=TRUE, and test<-read.csv("test3.csv"). (There's a lot more you can take out too).
  2. Make it reproducible. This means providing the dataset as part of the code, e.g. train <- head(iris, 20) or train <- matrix(runif(100), ncol=5) Emailing datasets don't count. I'm lazy and paranoid, so I'm not going to download files you email to me. I should be able to copy/paste a short block of code into a fresh R session and get the same error you are reporting.

Please read read the user guide on how to create a minimal, reproducible example:

Read it closely and take notes. Ask questions if you don't understand. This is one of the most important skills you have to develop as a data scientist, so it's important to take the time to learn how to do it right.

zachmayer commented 9 years ago

Also, if you put your code in between triple backticks it will show up correctly formatted. e.g.:

    train <- head(iris, 20)

Will produce:

``` R
train <- head(iris, 20)
cosmos2006 commented 9 years ago

Hi, Following solved the problem in one go

train$signal <- ifelse(train$signal == 1, "X", "Y") test$signal <- ifelse(test$signal == 1, "X", "Y")

Thanks a lot for the prompt response in solving the issue.

Zach, Thanks a lot for your teachings about submitting issues in a proper format. It is highly appreciated.

zachmayer commented 9 years ago

Glad you figured it out!

cosmos2006 commented 9 years ago

It was jknowles suggestion which worked. Thank you guys

