zachmayer / caretEnsemble

caret models all the way down :turtle:
Other
226 stars 75 forks source link

predict.caretEnsemble returns the same type of data structure, no matter the call #158

Closed zachmayer closed 8 years ago

zachmayer commented 8 years ago

It's really hard to test predict.caretEnsemble, because it sometimes returns a list, sometimes a data.frame, and sometimes the structure of the list is different, depending on the values of keepNA, se, and return_weights. Also, sometimes the weights have one row, and sometimes there's a weight for every row.

Let's make it always return a data.frame.

jknowles commented 8 years ago

I have recently been digging back into caretEnsemble and hope to make some changes and improvements -- particularly here. I have been thinking more about the data structure and the need for it to be consistent.

Currently, most standard R functions with a predict method like glm or lm return either a vector of numeric predictions, or a data.frame with values fit, upr, and lwr if the user requests standard errors. However, figuring out how to calculate a prediction interval is tricky, and in the caretEnsemble case it's more tricky because the predictions may be weighted differently if we allow the user to include cases where not all models produced a prediction.

What I've been thinking is that we could mimic the behavior of predict.lm and return a numeric vector (or store it as a data.frame with one column) of fit if no se is requested, and we could return the three column data.frame with fit, upr, and lwr if the se is requested. We could then attach the weights to this data.frame as an attribute, mimicking the behavior (somewhat) of some of the objects within a merMod in the lme4 package.

How does this sound?

zachmayer commented 8 years ago

I think that works. The default return is a vector, and if anything extra is requested a data.frame is returned. I like adding the weights as an attribute. That seems pretty clean to me.

jknowles commented 8 years ago

@zachmayer What do you think about making predict.caretStack and predict.caretEnsemble behave the same way? I think it's pretty easy to do this. I noticed that you need different options to get the same output currently from these functions. Would you agree it will make more sense to the user to have them be the same?

If so -- I was thinking by default both should return a factor for classification models and the user should be allowed to specify if they want probabilities and if they want standard errors, but this should not be defaulted.

zachmayer commented 8 years ago

Could you give me an example of the different options to get the same predictions?

I was trying to keep predict.caretStack as similar to predict.train as possible.

jknowles commented 8 years ago

Well, what I came across today is that by default:

# predicting caretStack
yhat <- predict(myStack)
# predicting caretEnsemble
yhat2 <- predict(myEnsemble)

For a classification model type the first one will be a vector, class factor, and the second one will be a numeric vector of class probabilities. I can straighten this out pretty easily by rewriting predict.caretEnsemble to match the default behavior of `predict.caretStack'. Just want to confirm that's a good idea.

zachmayer commented 8 years ago

Ahah, I see what you me. Yes, that's a good idea. I think predict.train, predict.caretList, predict.caretEnsemble, and predict.caretStack should all have similar returns (as much as is possible).

jknowles commented 8 years ago

Makes sense to me -- I'll do that all together and rewrite the unit tests accordingly.

zachmayer commented 8 years ago

Awesome, thank you!

jknowles commented 8 years ago

@zachmayer -- an update on fixing the consistency of the predict method:

As I'm working on this I'm uncovering some flaws with the way NA values are handled by predict.caretEnsemble and predict.caretList. This relates back to the way predict.train handles missing values in new data that is passed to an object. Consider the following case:

 load(system.file("testdata/models.class.rda",
                   package="caretEnsemble", mustWork=TRUE))
load(system.file("testdata/models.reg.rda",
                   package="caretEnsemble", mustWork=TRUE))
  load(system.file("testdata/X.class.rda",
                   package="caretEnsemble", mustWork=TRUE))
  load(system.file("testdata/Y.class.rda",
                   package="caretEnsemble", mustWork=TRUE))
  ens.reg <- caretEnsemble(models.reg, iter=1000)
  models.class2 <- models.class[c(2:5)] # hack to cut out randomForest
  class(models.class2) <- "caretList"
  ens.class <- caretEnsemble(models.class2, iter=1000)
  newDat <- ens.class$models[[4]]$trainingData
  newDat[2, 2] <- NA
  newDat[3, 3] <- NA
  newDat[4, 4] <- NA
  newDat <- newDat[1:10, ]
  p1 <- predict(ens.class, newdata = newDat)
  p2 <- predict(ens.class, newdata = newDat[1, ])

In a normal predict method, like for glm, it would work:

gmData <- cbind(Y.class, X.class
gmData <- gmData[, -2]
gmData <- as.data.frame(gmData)
gmData$Y.class <- factor(gmData$Y.class)
gm1 <- glm(Y.class ~ ., data = gmData, family = "binomial")
newDat <- gmData[1:10,]
newDat[2, 4] <- NA
preds <- predict(gm1, newdata = newDat, type = "response")

Results in:

preds
           1            2            3            4            5 
1.825548e-09           NA 7.789075e-10 1.636546e-09 2.094198e-09 
           6            7            8            9           10 
1.226207e-08 1.620147e-09 2.470597e-09 8.010239e-10 1.607467e-09 

Currently, we can't replicate this behavior because of the inconsistent way the methods in train may or may not handle NA values. This limitation should probably be either documented or mitigated with some optional pre-processing of the newdata object.

zachmayer commented 8 years ago

Hmm, on current master, I get the following:

> p1 <- predict(ens.class, newdata = newDat)
 Hide Traceback

 Rerun with Debug
 Error in colnames(tempUnkProb)[tempUnkPred] : 
  invalid subscript type 'list' 
14 extractProb(list(object), unkX = newdata, unkOnly = TRUE, ...) 
13 predict.train(x, type = "prob", ...) 
12 predict(x, type = "prob", ...) 
11 predict(x, type = "prob", ...) 
10 FUN(X[[i]], ...) 
9 lapply(X, FUN, ...) 
8 pblapply(X, FUN, ...) 
7 pbsapply(object, function(x) {
    type <- x$modelType
    if (type == "Classification") {
        if (x$control$classProbs) { ... 
6 predict.caretList(object$models, ...) 
5 predict(object$models, ...) 
4 predict(object$models, ...) 
3 predict.caretEnsemble(ens.class, newdata = newDat) 
2 predict(ens.class, newdata = newDat) 
1 predict(ens.class, newdata = newDat) 

And then:

> p2 <- predict(ens.class, newdata = newDat[1, ])
 Hide Traceback

 Rerun with Debug
 Error in `colnames<-`(`*tmp*`, value = c("glm", "svmRadial", "nnet", "treebag" : 
  attempt to set 'colnames' on an object with less than two dimensions 
8 stop("attempt to set 'colnames' on an object with less than two dimensions") 
7 `colnames<-`(`*tmp*`, value = c("glm", "svmRadial", "nnet", "treebag"
)) 
6 predict.caretList(object$models, ...) 
5 predict(object$models, ...) 
4 predict(object$models, ...) 
3 predict.caretEnsemble(ens.class, newdata = newDat[1, ]) 
2 predict(ens.class, newdata = newDat[1, ]) 
1 predict(ens.class, newdata = newDat[1, ]) 

But the GBM code does work. So yeah, we should add tests for this and fix it.

Also, it looks like our builds are failing on master, so I need to dig into and fix that too.

jknowles commented 8 years ago

Oh yeah, sorry I wasn't clear -- the error is the problem. It is not consistent though -- glm handles prediction with NA values while in a caretList. This is why I missed the inconsistent behavior before, I only tested it with a list of different glm models.

I can see a few different options --

  1. We na.omit newdata in predict and document that
  2. We wrap up the individual model predictions in try and then only use predictions from models that return a prediction (this would mean rf would not be used in a prediction if only one observation had missing values) and a warning.
  3. We predict row-wise for each model using try and then put an NA in for each observation that fails.
  4. We wait for a solution to be proposed and accepted in caret.

What do you think?

zachmayer commented 8 years ago

Ok, I suspected that was the case, but wasn't sure. I don't really like option 2. Option 3 is worth investigating, and I think option 4 is the best bet, but who knows how long that will take =D

jknowles commented 8 years ago

Good, I'll investigate option 3 -- it's my favorite too if we can do it reasonably quickly. In the meantime, maybe we can ask @topepo if caret is going to take this up in the future.

zachmayer commented 8 years ago

Sounds good. Can you open an issue on github?

jknowles commented 8 years ago

On it.

topepo commented 8 years ago

Does this cover it?

jknowles commented 8 years ago

@topepo I think so, but in the example I was working with rf just crashes if newdata has missing values for the predictors. I didn't check a lot of other methods, but I know this problem is not unique to just rf. At any rate, that would be extremely ideal -- just return a vector with predictions and missing values for observations with incomplete data.

topepo commented 8 years ago

Got it. That won't be in the next version (hopefully sent to CRAN early next week) but since this is a common issue, I'll work on it next.

jknowles commented 8 years ago

This can be closed now by #167 and others.