wilkelab / ggridges

Ridgeline plots in ggplot2
https://wilkelab.org/ggridges
GNU General Public License v2.0
411 stars 31 forks source link

Error when using object of `class` Date on `y-axis` #22

Closed HanjoStudy closed 6 years ago

HanjoStudy commented 6 years ago

Error when using variable of class Date on y-axis plotting densities.

library(tidyverse)
library(ggplot2)
library(ggridges)
gen_date_dist <- function(df_date){
  data.frame(df_date, out =  rnorm(1000,1,100))
}

## Generate  random samples
df_ridge <- seq(as.Date("2010-01-01"), by = "month", length.out = 20) %>% 
  map(~.x %>% gen_date_dist) %>% 
  reduce(rbind) %>% 
  tbl_df

head(df_ridge)

# A tibble: 6 x 2
  df_date       out
  <date>      <dbl>
1 2010-01-01 -126. 
2 2010-01-01  -67.2
3 2010-01-01  -24.4
4 2010-01-01  203. 
5 2010-01-01  -56.5
6 2010-01-01  129. 

## Plot densities using ggridges
df_ridge %>% 
  mutate(df_date = as.character(df_date)) %>%  #coercing to character avoids the error
  ggplot(., aes(x = out , y = df_date)) +
  stat_density_ridges(quantile_lines = TRUE, alpha = 0.7)

#>Picking joint bandwidth of 22.3

Without transformation we get an error:

## Reproduce error
df_ridge %>% 
  ggplot(., aes(x = out , y = df_date)) +
  stat_density_ridges(quantile_lines = TRUE, alpha = 0.7)

#>Picking joint bandwidth of 12.4
#>Error in eval(substitute(list(...)), `_data`, parent.frame()) :   object 'y' not found
#> In addition: Warning messages:
#> 1: In max(data$y) : no non-missing arguments to max; returning -Inf
#> 2: In min(data$y) : no non-missing arguments to min; returning Inf

In essence when we have a very dense y-axis we will want to scale the axis using: scale_y_date(date_labels = "%Y")

p.s tried reprex with multiple fails after reinstalling knitr, reprex, hope this is ok

clauswilke commented 6 years ago

For some reason it needs a group aesthetic.

library(tidyverse)
library(ggridges)

gen_date_dist <- function(df_date){
  data.frame(df_date, out =  rnorm(1000,1,100))
}

## Generate random samples
df_ridge <- seq(as.Date("2010-01-01"), by = "month", length.out = 20) %>% 
  map(~.x %>% gen_date_dist) %>% 
  reduce(rbind) %>% 
  tbl_df

df_ridge %>% 
  ggplot(., aes(x = out, y = df_date, group = df_date)) +
  geom_density_ridges()
#> Picking joint bandwidth of 22.3

Created on 2018-05-15 by the reprex package (v0.2.0).

clauswilke commented 6 years ago

Actually, this is in the documentation:

The grouping aesthetic does not need to be provided if a categorical variable is mapped onto the y axis, but it does need to be provided if the variable is numerical.

treysp commented 6 years ago

Hello, and thank you so much for ggridges, cowplot, and all your contributions to the R ecosystem!

It sounds like a numeric y will always generate an error - is that correct? If so, would it make sense to check for that condition and provide an error message telling the user that they need to convert y or specify group if they want to use the numeric y?

If you think this is a good idea, I'm happy to take a stab if you can point me to the file where it should belong.

Thanks!

clauswilke commented 6 years ago

The first step would be to investigate what exactly the cause of the error is.

treysp commented 6 years ago

Good point! I think the fundamental issue is that ggplot needs to know that density should be calculated separately for each "group." Normally, it infers that groupiness based on whether the relevant variable is.discrete(), which in our case is not true (see last bullet below).

For the details, here's what I've tracked down:

If that mapply() is what's adding y back in it's not clear to me why it works when there are multiple values for data$group but not when there is a single value, and I can't figure out how to drop into that compute_panel() call to inspect further (presumably because of my lack of experience debugging ggproto methods).

For determining whether stats should be calculated separately by group:

clauswilke commented 6 years ago

This can probably be solved somehow by reimplementing compute_panel() in StatDensityRidges. To debug, you could copy that function over from ggplot2 into ggridges and then just add print statements to see what happens.

treysp commented 6 years ago

Ok, so the mapply() call in compute_panel() is adding back in variables that are constant within groups but were removed by compute_group(). Since there aren't any groups when y is numeric its behavior isn't relevant for resolving this problem.

I think a solution will require explicitly making the code treat unique y values a groups. For ridgeline plots y must be categorical, so doing that doesn't require making assumptions about the user's intent.

I think the implementation options are more or less:

  1. Provide informative error and force user to specify the group aesthetic
  2. Add a group aesthetic based on the y values very early in the code and let the plotting machinery work as normal
  3. Modify compute_panel() to induce group-wise processing without using the group aesthetic directly

I'm not sure which of these is the best option.

clauswilke commented 6 years ago

I'm hesitant to introduce code that works around assumptions made in the bowels of ggplot2. They might change at some point and then it's difficult to fix. And ignoring the group aesthetic would also be bad. We still need to be able to group within y values, for example.

So all considered, I think adding an informative error is the right way to go at this time. It's easy to add a group aesthetic if we know we need one.

treysp commented 6 years ago

Sounds good - I will double-check whether it needs to be added for the other geoms as well.

Do you want me to submit a pull request, or would you prefer to add it yourself? If the former, where should the error check go - setup_data()?

clauswilke commented 6 years ago

You can submit a pull request. The check should be right where the error occurs. Check whether there's a y column in the data, and if not throw an error which includes a sentence such as "Did you forget to specify a group aesthetic?"