Closed zachmayer closed 8 years ago
I have recently been digging back into caretEnsemble
and hope to make some changes and improvements -- particularly here. I have been thinking more about the data structure and the need for it to be consistent.
Currently, most standard R functions with a predict
method like glm
or lm
return either a vector of numeric predictions, or a data.frame with values fit
, upr
, and lwr
if the user requests standard errors. However, figuring out how to calculate a prediction interval is tricky, and in the caretEnsemble
case it's more tricky because the predictions may be weighted differently if we allow the user to include cases where not all models produced a prediction.
What I've been thinking is that we could mimic the behavior of predict.lm
and return a numeric vector (or store it as a data.frame with one column) of fit
if no se
is requested, and we could return the three column data.frame with fit
, upr
, and lwr
if the se is requested. We could then attach the weights to this data.frame as an attribute, mimicking the behavior (somewhat) of some of the objects within a merMod
in the lme4
package.
How does this sound?
I think that works. The default return is a vector, and if anything extra is requested a data.frame is returned. I like adding the weights as an attribute. That seems pretty clean to me.
@zachmayer What do you think about making predict.caretStack
and predict.caretEnsemble
behave the same way? I think it's pretty easy to do this. I noticed that you need different options to get the same output currently from these functions. Would you agree it will make more sense to the user to have them be the same?
If so -- I was thinking by default both should return a factor for classification models and the user should be allowed to specify if they want probabilities and if they want standard errors, but this should not be defaulted.
Could you give me an example of the different options to get the same predictions?
I was trying to keep predict.caretStack
as similar to predict.train
as possible.
Well, what I came across today is that by default:
# predicting caretStack
yhat <- predict(myStack)
# predicting caretEnsemble
yhat2 <- predict(myEnsemble)
For a classification model type the first one will be a vector, class factor, and the second one will be a numeric vector of class probabilities. I can straighten this out pretty easily by rewriting predict.caretEnsemble
to match the default behavior of `predict.caretStack'. Just want to confirm that's a good idea.
Ahah, I see what you me. Yes, that's a good idea. I think predict.train
, predict.caretList
, predict.caretEnsemble
, and predict.caretStack
should all have similar returns (as much as is possible).
Makes sense to me -- I'll do that all together and rewrite the unit tests accordingly.
Awesome, thank you!
@zachmayer -- an update on fixing the consistency of the predict method:
As I'm working on this I'm uncovering some flaws with the way NA values are handled by predict.caretEnsemble
and predict.caretList
. This relates back to the way predict.train
handles missing values in new data that is passed to an object. Consider the following case:
load(system.file("testdata/models.class.rda",
package="caretEnsemble", mustWork=TRUE))
load(system.file("testdata/models.reg.rda",
package="caretEnsemble", mustWork=TRUE))
load(system.file("testdata/X.class.rda",
package="caretEnsemble", mustWork=TRUE))
load(system.file("testdata/Y.class.rda",
package="caretEnsemble", mustWork=TRUE))
ens.reg <- caretEnsemble(models.reg, iter=1000)
models.class2 <- models.class[c(2:5)] # hack to cut out randomForest
class(models.class2) <- "caretList"
ens.class <- caretEnsemble(models.class2, iter=1000)
newDat <- ens.class$models[[4]]$trainingData
newDat[2, 2] <- NA
newDat[3, 3] <- NA
newDat[4, 4] <- NA
newDat <- newDat[1:10, ]
p1 <- predict(ens.class, newdata = newDat)
p2 <- predict(ens.class, newdata = newDat[1, ])
In a normal predict method, like for glm, it would work:
gmData <- cbind(Y.class, X.class
gmData <- gmData[, -2]
gmData <- as.data.frame(gmData)
gmData$Y.class <- factor(gmData$Y.class)
gm1 <- glm(Y.class ~ ., data = gmData, family = "binomial")
newDat <- gmData[1:10,]
newDat[2, 4] <- NA
preds <- predict(gm1, newdata = newDat, type = "response")
Results in:
preds
1 2 3 4 5
1.825548e-09 NA 7.789075e-10 1.636546e-09 2.094198e-09
6 7 8 9 10
1.226207e-08 1.620147e-09 2.470597e-09 8.010239e-10 1.607467e-09
Currently, we can't replicate this behavior because of the inconsistent way the methods in train
may or may not handle NA values. This limitation should probably be either documented or mitigated with some optional pre-processing of the newdata
object.
Hmm, on current master, I get the following:
> p1 <- predict(ens.class, newdata = newDat)
Hide Traceback
Rerun with Debug
Error in colnames(tempUnkProb)[tempUnkPred] :
invalid subscript type 'list'
14 extractProb(list(object), unkX = newdata, unkOnly = TRUE, ...)
13 predict.train(x, type = "prob", ...)
12 predict(x, type = "prob", ...)
11 predict(x, type = "prob", ...)
10 FUN(X[[i]], ...)
9 lapply(X, FUN, ...)
8 pblapply(X, FUN, ...)
7 pbsapply(object, function(x) {
type <- x$modelType
if (type == "Classification") {
if (x$control$classProbs) { ...
6 predict.caretList(object$models, ...)
5 predict(object$models, ...)
4 predict(object$models, ...)
3 predict.caretEnsemble(ens.class, newdata = newDat)
2 predict(ens.class, newdata = newDat)
1 predict(ens.class, newdata = newDat)
And then:
> p2 <- predict(ens.class, newdata = newDat[1, ])
Hide Traceback
Rerun with Debug
Error in `colnames<-`(`*tmp*`, value = c("glm", "svmRadial", "nnet", "treebag" :
attempt to set 'colnames' on an object with less than two dimensions
8 stop("attempt to set 'colnames' on an object with less than two dimensions")
7 `colnames<-`(`*tmp*`, value = c("glm", "svmRadial", "nnet", "treebag"
))
6 predict.caretList(object$models, ...)
5 predict(object$models, ...)
4 predict(object$models, ...)
3 predict.caretEnsemble(ens.class, newdata = newDat[1, ])
2 predict(ens.class, newdata = newDat[1, ])
1 predict(ens.class, newdata = newDat[1, ])
But the GBM code does work. So yeah, we should add tests for this and fix it.
Also, it looks like our builds are failing on master, so I need to dig into and fix that too.
Oh yeah, sorry I wasn't clear -- the error is the problem. It is not consistent though -- glm
handles prediction with NA values while in a caretList
. This is why I missed the inconsistent behavior before, I only tested it with a list of different glm
models.
I can see a few different options --
na.omit
newdata in predict
and document thattry
and then only use predictions from models that return a prediction (this would mean rf
would not be used in a prediction if only one observation had missing values) and a warning.caret
. What do you think?
Ok, I suspected that was the case, but wasn't sure. I don't really like option 2. Option 3 is worth investigating, and I think option 4 is the best bet, but who knows how long that will take =D
Good, I'll investigate option 3 -- it's my favorite too if we can do it reasonably quickly. In the meantime, maybe we can ask @topepo if caret
is going to take this up in the future.
Sounds good. Can you open an issue on github?
On it.
@topepo I think so, but in the example I was working with rf
just crashes if newdata has missing values for the predictors. I didn't check a lot of other methods, but I know this problem is not unique to just rf
. At any rate, that would be extremely ideal -- just return a vector with predictions and missing values for observations with incomplete data.
Got it. That won't be in the next version (hopefully sent to CRAN early next week) but since this is a common issue, I'll work on it next.
This can be closed now by #167 and others.
It's really hard to test
predict.caretEnsemble
, because it sometimes returns a list, sometimes a data.frame, and sometimes the structure of the list is different, depending on the values ofkeepNA
,se
, andreturn_weights
. Also, sometimes the weights have one row, and sometimes there's a weight for every row.Let's make it always return a
data.frame
.