wilkelab / ggridges

Ridgeline plots in ggplot2
https://wilkelab.org/ggridges
GNU General Public License v2.0
411 stars 31 forks source link

quantiles missing from exam score plots #35

Closed MCMaurer closed 5 years ago

MCMaurer commented 5 years ago

Hi there,

I've been making some ggridges plots of score distributions for each question on an exam, and I noticed that some of the questions end up having missing quantiles. Here's a plot that shows what I'm talking about:

midterm2_scores_ridges

It's important to note that while the scores for each question are shown on a 0-100 scale (score percentage), the actual questions are graded on a 0-11 scale (except for Q10 and Total).

A lot of the questions have a fair number of perfect scores, so the density is pushed up against 100, and some of those are missing their 5th quantile, which seems to make sense. However, something like Q1 has me perplexed. The 5th quantile is displayed, but the 2nd and 4th are missing.

I put together a reproducible example here:

test <- sample(seq(from=0, to = 11, by = 1), size = 250, replace = T, prob = pbeta(seq(0,1,1/11), 5, 1)) %>% 
  enframe(name = "student") %>% 
  mutate(question = 1)

hist(test$value)

test %>% 
  ggplot(aes(x = value, y = question, fill = factor(..quantile.., levels = c("1", "2", "3", "4", "5")))) +
  ggridges::stat_density_ridges(geom = "density_ridges_gradient", quantile_lines = T, quantiles = 5, from = 0, to = 11) +
  theme_bw() +
  scale_fill_viridis_d(name = "Quantiles") +
  scale_x_continuous(breaks = seq(from=0, by=1, to=11), minor_breaks = F)

Note that the score distributions are from a beta distribution, and running it a few times in a row can generate slightly different results. Sometimes you get quantiles 1, 2, and 3, sometimes you get 1, 2, and 4, but I have never gotten all 5 quantiles. I have also tried removing the from = 0, to = 11, and that usually leaves me with 4 quantiles shown, with either 3 or 4 being skipped.

I'm guessing this has to do with the fact that the scores fall into such discrete bins and how they are heavily skewed to the left, but I don't have a great sense for what exactly is going on. What I'd ultimately like is to have all 5 quantiles displayed on all the exam questions- any insight you might have would be wonderful! Also, just wanted to say thanks for the package, I really love making these plots to look at how our grading is going!

session_info
``` ─ Session info ───────────────────────────────────────────────────────────────────────────────────────── setting value version R version 3.6.0 (2019-04-26) os macOS Mojave 10.14.5 system x86_64, darwin15.6.0 ui RStudio language (EN) collate en_US.UTF-8 ctype en_US.UTF-8 tz America/Los_Angeles date 2019-06-03 ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────── package * version date lib source assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.6.0) backports 1.1.4 2019-04-10 [1] CRAN (R 3.6.0) broom 0.5.2 2019-04-07 [1] CRAN (R 3.6.0) callr 3.2.0 2019-03-15 [1] CRAN (R 3.6.0) cellranger 1.1.0 2016-07-27 [1] CRAN (R 3.6.0) cli 1.1.0 2019-03-19 [1] CRAN (R 3.6.0) clipr 0.6.0 2019-04-15 [1] CRAN (R 3.6.0) colorspace 1.4-1 2019-03-18 [1] CRAN (R 3.6.0) crayon 1.3.4 2017-09-16 [1] CRAN (R 3.6.0) desc 1.2.0 2018-05-01 [1] CRAN (R 3.6.0) devtools 2.0.2 2019-04-08 [1] CRAN (R 3.6.0) digest 0.6.18 2018-10-10 [1] CRAN (R 3.6.0) dplyr * 0.8.1 2019-05-14 [1] CRAN (R 3.6.0) fansi 0.4.0 2018-10-05 [1] CRAN (R 3.6.0) forcats * 0.4.0 2019-02-17 [1] CRAN (R 3.6.0) fs 1.3.1 2019-05-06 [1] CRAN (R 3.6.0) generics 0.0.2 2018-11-29 [1] CRAN (R 3.6.0) ggplot2 * 3.1.1 2019-04-07 [1] CRAN (R 3.6.0) ggridges 0.5.1 2018-09-27 [1] CRAN (R 3.6.0) glue 1.3.1.9000 2019-05-18 [1] Github (tidyverse/glue@ea0edcb) gtable 0.3.0 2019-03-25 [1] CRAN (R 3.6.0) haven 2.1.0 2019-02-19 [1] CRAN (R 3.6.0) hms 0.4.2 2018-03-10 [1] CRAN (R 3.6.0) httr 1.4.0 2018-12-11 [1] CRAN (R 3.6.0) jsonlite 1.6 2018-12-07 [1] CRAN (R 3.6.0) labeling 0.3 2014-08-23 [1] CRAN (R 3.6.0) lattice 0.20-38 2018-11-04 [1] CRAN (R 3.6.0) lazyeval 0.2.2 2019-03-15 [1] CRAN (R 3.6.0) lubridate 1.7.4 2018-04-11 [1] CRAN (R 3.6.0) magrittr 1.5 2014-11-22 [1] CRAN (R 3.6.0) MCMsBasics 0.1.0 2019-05-18 [1] Github (MCMaurer/MCMsBasics@a63998c) memoise 1.1.0 2017-04-21 [1] CRAN (R 3.6.0) modelr 0.1.4 2019-02-18 [1] CRAN (R 3.6.0) munsell 0.5.0 2018-06-12 [1] CRAN (R 3.6.0) nlme 3.1-140 2019-05-12 [1] CRAN (R 3.6.0) pillar 1.4.0 2019-05-11 [1] CRAN (R 3.6.0) pkgbuild 1.0.3 2019-03-20 [1] CRAN (R 3.6.0) pkgconfig 2.0.2 2018-08-16 [1] CRAN (R 3.6.0) pkgload 1.0.2 2018-10-29 [1] CRAN (R 3.6.0) plyr 1.8.4 2016-06-08 [1] CRAN (R 3.6.0) prettyunits 1.0.2 2015-07-13 [1] CRAN (R 3.6.0) processx 3.3.1 2019-05-08 [1] CRAN (R 3.6.0) ps 1.3.0 2018-12-21 [1] CRAN (R 3.6.0) purrr * 0.3.2 2019-03-15 [1] CRAN (R 3.6.0) R6 2.4.0 2019-02-14 [1] CRAN (R 3.6.0) Rcpp 1.0.1 2019-03-17 [1] CRAN (R 3.6.0) readr * 1.3.1 2018-12-21 [1] CRAN (R 3.6.0) readxl 1.3.1 2019-03-13 [1] CRAN (R 3.6.0) remotes 2.0.4 2019-04-10 [1] CRAN (R 3.6.0) rlang 0.3.4 2019-04-07 [1] CRAN (R 3.6.0) rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.6.0) rstudioapi 0.10 2019-03-19 [1] CRAN (R 3.6.0) rvest 0.3.4 2019-05-15 [1] CRAN (R 3.6.0) scales 1.0.0 2018-08-09 [1] CRAN (R 3.6.0) sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.6.0) stringi 1.4.3 2019-03-12 [1] CRAN (R 3.6.0) stringr * 1.4.0 2019-02-10 [1] CRAN (R 3.6.0) testthat 2.1.1 2019-04-23 [1] CRAN (R 3.6.0) tibble * 2.1.1 2019-03-16 [1] CRAN (R 3.6.0) tidyr * 0.8.3 2019-03-01 [1] CRAN (R 3.6.0) tidyselect 0.2.5 2018-10-11 [1] CRAN (R 3.6.0) tidyverse * 1.2.1 2017-11-14 [1] CRAN (R 3.6.0) usethis 1.5.0 2019-04-07 [1] CRAN (R 3.6.0) utf8 1.1.4 2018-05-24 [1] CRAN (R 3.6.0) vctrs 0.1.0 2018-11-29 [1] CRAN (R 3.6.0) viridisLite 0.3.0 2018-02-01 [1] CRAN (R 3.6.0) withr 2.1.2 2018-03-15 [1] CRAN (R 3.6.0) xml2 1.2.0 2018-01-24 [1] CRAN (R 3.6.0) yaml 2.2.0 2018-07-25 [1] CRAN (R 3.6.0) zeallot 0.1.0 2018-01-28 [1] CRAN (R 3.6.0) [1] /Users/MJ/R_Packages_3.6 [2] /Library/Frameworks/R.framework/Versions/3.6/Resources/library ```
clauswilke commented 5 years ago

The plot accurately represents your data. Because your data values are integers, it is possible to have multiple quantiles all bunched up at one end of the scale.

library(tidyverse)

set.seed(25)

test <- sample(seq(from=0, to = 11, by = 1), size = 250, replace = T, prob = pbeta(seq(0,1,1/11), 5, 1)) %>% 
  enframe(name = "student") %>% 
  mutate(question = 1)

test %>% 
  ggplot(aes(x = value, y = question, fill = factor(..quantile.., levels = c("1", "2", "3", "4", "5")))) +
  ggridges::stat_density_ridges(geom = "density_ridges_gradient", quantile_lines = T, quantiles = 5, from = 0, to = 11) +
  theme_bw() +
  scale_fill_viridis_d(name = "Quantiles") +
  scale_x_continuous(breaks = seq(from=0, by=1, to=11), minor_breaks = F)
#> Picking joint bandwidth of 0.403


quantile(test$value, seq(0.2, 0.8, by = 0.2))
#> 20% 40% 60% 80% 
#>   9  10  11  11

Created on 2019-06-03 by the reprex package (v0.2.1)

MCMaurer commented 5 years ago

Ahhhhhhhh ok, this makes sense! I just need to be sure to communicate that when presenting these plots. Is there a particular way that one gets shown over the other? In this case, the 3rd quantile is shown over the 4th, so I'm guessing that the default fill is always the lower of the "tied" quantiles?

Thanks for such a speedy response!