zmjones / edarf

exploratory data analysis using random forests
MIT License
68 stars 11 forks source link

potential bug in pd inner #5

Closed flinder closed 10 years ago

flinder commented 10 years ago

I'm not 100% sure. But I think in pred <- predict(fit, newdata = df, outcome = "test")$predicted.oob

the outcome argument should be 'train'. The part in the documentation is a bit cryptic:

"If outcome="test", the predictor is calculated by using y-outcomes from the test data (outcome information must be present). In this case, the terminal nodes from the grow-forest are recalculated using the y-outcomes from the test set. This yields a modified predictor in which the topology of the forest is based solely on the training data, but where the predicted value is based on the test data. Error rates and VIMP are calculated by bootstrapping the test data and using out-of-bagging to ensure unbiased estimates. See the examples for illustration"

But I don't think we want to recalculate the terminal nodes from the original forest, but just drop down the test data and get a prediction. I came to it bc i got different and counterintuitive results from it with the outcome='test' but matching and intuitive results with outcome='train'.

I changed it in the package code to do some tests.

zmjones commented 10 years ago

yea. i have thought about it a bit now and i think you are right in that we should not have outcome = "test. however, it is unclear to me from the docs whether predict(fit, newdata = df, outcome = "train") is the same as predict(fit, newdata = df). for one, predicted.oob is not available in the latter option. this suggests to me that they are dropping the unmodified training data down the tree. does the pd actually work in this case? should be easy to detect since you wouldn't be setting the dataframe to its unique values.

zmjones commented 10 years ago

wow my commits got super messed up there for a minute.