zmjones / edarf

exploratory data analysis using random forests
MIT License
68 stars 11 forks source link

Fixed Unnecessary duplication when using factors #44

Closed reuning closed 8 years ago

reuning commented 8 years ago

This checks for factor variables, and adjust the cutoffs so it doesn't create duplicate predictions, especially useful for when checking interactions. Example: In the previous version if you were going to generated PD for a continuous variable from Z=1:5 and a factor X=(Yes, No) and if the cutoff was 5, it would create 5^2 variables. (as it didn't realize it would do 5 levels for X even though there were only 2 proper levels). This version will automatically only create 10 predictions in such an instance: Z=1:5 when X=Yes, and Z=1:5 when X=No.

It isn't exactly pretty. But it does work with the cases I tried.

zmjones commented 8 years ago

Good catch. I think I solved this in 4ce5429561289021b192835be79fcbab95d9a1c4 by just changing one line in .ivar_points. Let me know if that solves the problem. Then I'll close this.

reuning commented 8 years ago

I think it still has a problem. Specifically line 197 rng <- as.data.frame(rng)

With different length lists the data.frame does not appreciate there being different length lists in rng. I am not sure what the simplest way is to fix this.

I fixed it an uploaded it to my fork. I havne't run the test though yet and see that it failed last time so it might have issues.

Also the plot_pd has issues now.

Two steps forward 5 backwards?

On Mon, Nov 23, 2015 at 9:22 PM, Zachary M. Jones notifications@github.com wrote:

Good catch. I think I solved this in 4ce5429 https://github.com/zmjones/edarf/commit/4ce5429561289021b192835be79fcbab95d9a1c4 by just changing one line in .ivar_points. Let me know if that solves the problem. Then I'll close this.

— Reply to this email directly or view it on GitHub https://github.com/zmjones/edarf/pull/44#issuecomment-159132833.

zmjones commented 8 years ago

Ah OK. I will look at this some more tonight.

On Tue, Nov 24, 2015, 12:10 PM Kevin Reuning notifications@github.com wrote:

I think it still has a problem. Specifically line 197 rng <- as.data.frame(rng)

With different length lists the data.frame does not appreciate there being different length lists in rng. I am not sure what the simplest way is to fix this.

I fixed it an uploaded it to my fork. I havne't run the test though yet and see that it failed last time so it might have issues.

Also the plot_pd has issues now.

Two steps forward 5 backwards?

On Mon, Nov 23, 2015 at 9:22 PM, Zachary M. Jones < notifications@github.com> wrote:

Good catch. I think I solved this in 4ce5429 < https://github.com/zmjones/edarf/commit/4ce5429561289021b192835be79fcbab95d9a1c4

by just changing one line in .ivar_points. Let me know if that solves the problem. Then I'll close this.

— Reply to this email directly or view it on GitHub https://github.com/zmjones/edarf/pull/44#issuecomment-159132833.

— Reply to this email directly or view it on GitHub https://github.com/zmjones/edarf/pull/44#issuecomment-159342781.

zmjones commented 8 years ago

I made another simple change. Just drop duplicate "observations" in the prediction grid. It worked with the simple example I have (look below). As you noted the plot is broken when interaction = FALSE. I think this should work though. It doesn't seem unreasonable to me to request bivariate partial dependence for a set of features of mixed type. I think unfortunately this will require me to just coerce the factor to an integer and (also unfortunately) ggplot2 won't allow me to disable lines being drawn for unordered factors, but I could at least generate a warning for this case.

n = 100
x = sample(1:10, n, TRUE)
z = as.factor(sample(letters[1:2], n, TRUE))
y = rowSums(model.matrix(~ x + z + x * z)) + rnorm(n)

library(randomForest)

fit = randomForest(y ~ x + z)
pd = partial_dependence(fit, data.frame(x, z, y), c("x", "z"), interaction = TRUE, cutoff = 5)
plot_pd(pd)
zmjones commented 8 years ago

ping!

reuning commented 8 years ago

I am waiting until after I have my paper drafted to get back to this. I need to finish it up and would rather not mess with things until after that :P

On Tue, Dec 1, 2015 at 4:11 PM, Zachary M. Jones notifications@github.com wrote:

ping!

— Reply to this email directly or view it on GitHub https://github.com/zmjones/edarf/pull/44#issuecomment-161096906.

zmjones commented 8 years ago

pishaw

zmjones commented 8 years ago

this is fixed