zmjones / edarf

exploratory data analysis using random forests
MIT License
68 stars 11 forks source link

Possible bug: Edarf functions give error when using a ranger object with categorical variables #56

Closed cjvanlissa closed 7 years ago

cjvanlissa commented 7 years ago

Dear mr. Jones,

Thank you for developing edarf; it seems very useful but I am getting errors with both variable_importance and partial_dependence when using a model with categorical variables. It is possible that the problem has to do with the fact that several of the levels of these categorical variables are dropped from the model, because they do not have any observations in the training data. For variable_importance the error is:

variable_importance(ranger_model, vars=c("paper", "genus"), data=training_data) Error in variable_importance(ranger_model, vars = c("paper", "genus"), : Assertion on 'y' failed: Must have length 1, but has length 3.

For partial_dependence, the error is: Warning messages: 1: In names(mp)[ncol(mp)] = target : number of items to replace is not a multiple of replacement length 2: In names(mp)[ncol(mp)] = target : number of items to replace is not a multiple of replacement length

The graphical functions do not work in the presence of these errors. Is there anything I can do to prevent them?

Sincerely, Caspar

cjvanlissa commented 7 years ago

Figured out that the first error is a result of this piece of code in variable_importance: "y" = names(data)[!names(data) %in% fit$forest$independent.variable.names],

It expects the dataframe to contain only the outcome and predictor variables. In my case, the dataframe also contains a vector of case weights, which leads to this error. As a temporary fix, I replaced it to get the y-variable from the formula in the original function call: "y" = strsplit(strsplit(as.character(tree_model$call), "formula")[[2]], " ~")[[1]][[1]],

EDIT: Both variable_importance and partial_dependence work as normal if I pass them a data.frame consisting of only the variables mentioned in the ranger formula. I have circumvented the errors I got before by separating the vector of case weights from the data.frame.

zmjones commented 7 years ago

thanks for the report. at the very least i need to make the error messages more informative and improve the documentation.

what do you think the default behavior for ranger should be?

i am thinking that i should maybe have, for both variable importance and variable importance, an argument x and/or y (not necessary for partial dependence) which is only covariates or the outcome, along with better docs and error messages.

one of the problems is that some of the packages edarf supports store all of this internally, and i can just extract it from the model, while others do not. perhaps i should just ignore these cases and make the user always input the data.

cjvanlissa commented 7 years ago

Dear Zach, thank you for your swift response!

what do you think the default behavior for ranger should be?

I would suggest that, at least as a default, variable_importance should take the y and x variables from the formula in the ranger object. This way, you can pass variable_importance the same dataframe you pass to the ranger function.

I am having a third issue: Although partial_dependence is working now, I still get errors using plot_pd. I traced the errors to this line: if (is.character(dat$value)) { dat$value = as.numeric(dat$value) This returns a vector of NA's. If you substitute the following line, it plots partial dependence plots for categorical variables: dat$value = as.numeric(as.factor(dat$value))

Sincerely, Caspar

zmjones commented 7 years ago

Yea I need to rewrite some of the plotting functions I think. I haven't maintained them very well. I'll try to get to that this week. If you can come up with a small reproducible example and open another issue for it that would be helpful.

I'll implement the formula extraction bits you suggested today, and improve the docs a bit as well. I probably won't have time to improve the error handling until next week though.

Just to be sure, you are using the GitHub version of edarf and mmpf? If not could you try using the latest versions of both?

cjvanlissa commented 7 years ago

Dear Zach, I tried installing using devtools and am using the following versions:

packageVersion("mmpf") [1] ‘0.0.3’ packageVersion("edarf") [1] ‘1.1.0’

Regarding plot_pd, the error I'm getting when trying to plot categorical variables is:

plot_pd(partial_dependence(tree_model, vars=c("units", "xtrt"), data=treedata)) Warning messages: 1: attributes are not identical across measure variables; they will be dropped 2: In plot_pd(partial_dependence(tree_model, vars = c("units", "xtrt"), : NAs introduced by coercion 3: Removed 13 rows containing missing values (geom_point).

I tried to make a reproducible example, but that leads to a different error:

require(edarf)
require(ranger)
data(swiss)
swiss$Education<-as.factor(ifelse((swiss$Education<min(quantile(swiss$Education, c(.33, .66)))), 1, ifelse(swiss$Education<max(quantile(swiss$Education, c(.33, .66))),2,3)))
levels(swiss$Education)<-c("low", "med", "high")
rangermodel<- ranger(Fertility ~ ., swiss, importance = "permutation")
pd = partial_dependence(rangermodel, "Education", data = swiss[, -1])

I get: Error in names(pd)[ncol(pd)] = target : replacement has length zero

Not sure why this gives a different error than my own data.

zmjones commented 7 years ago

the error was basically the same, just being caught in a different place.

i'll try to come up with a way that makes this more sensible. i'm going to close this for now. if you want to open an issue for the plotting stuff please do, but otherwise i'll get to it when i can, hopefully relatively soon.

thanks again for the report, hope the rest of it works well.