wlandau / crew

A distributed worker launcher
https://wlandau.github.io/crew/
Other
123 stars 4 forks source link

Use vctrs package for bind rows #121

Closed shikokuchuo closed 1 year ago

shikokuchuo commented 1 year ago

Prework

Description

vctrs::vec_rbind() offers 2-3x speed up vs data.table::rbindlist(). Safe given your format is fixed for monad_tibble in the first place.

Will post a PR.

wlandau commented 1 year ago

As much as I would like to speed up aggregation in controller$map() and avoid the heavy dependency of data.table, my own benchmarks appear to mildly favor rbindlist(). In addition, the default name repair policy in vec_rbind() is an extreme bottleck, so I needed .name_repair = "universal_quiet" for a fair comparison.

result <- crew:::monad_tibble(crew::crew_eval(12))
list <- replicate(1e6, result, simplify = FALSE)
system.time(data.table::rbindlist(list))
#>    user  system elapsed 
#>   1.130   0.026   1.156
system.time(vctrs::vec_rbind(list, .name_repair = "universal_quiet"))
#>    user  system elapsed 
#>   1.244   0.048   1.292

Created on 2023-09-18 with reprex v2.0.2

Session info ``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.3.0 (2023-04-21) #> os macOS Ventura 13.5.2 #> system aarch64, darwin20 #> ui X11 #> language (EN) #> collate en_US.UTF-8 #> ctype en_US.UTF-8 #> tz America/Indiana/Indianapolis #> date 2023-09-18 #> pandoc 3.1.2 @ /usr/local/bin/ (via rmarkdown) #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date (UTC) lib source #> cli 3.6.1 2023-03-23 [1] CRAN (R 4.3.0) #> crew 0.4.1 2023-09-15 [1] local #> data.table 1.14.8 2023-02-17 [1] CRAN (R 4.3.0) #> digest 0.6.33 2023-07-07 [1] CRAN (R 4.3.0) #> evaluate 0.21 2023-05-05 [1] CRAN (R 4.3.0) #> fansi 1.0.4 2023-01-22 [1] CRAN (R 4.3.0) #> fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.0) #> fs 1.6.3 2023-07-20 [1] CRAN (R 4.3.0) #> getip 0.1-3 2023-01-25 [1] CRAN (R 4.3.0) #> glue 1.6.2 2022-02-24 [1] CRAN (R 4.3.0) #> htmltools 0.5.6 2023-08-10 [1] CRAN (R 4.3.0) #> knitr 1.43 2023-05-25 [1] CRAN (R 4.3.0) #> lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.3.0) #> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.3.0) #> mirai 0.10.0 2023-09-16 [1] CRAN (R 4.3.1) #> nanonext 0.10.0 2023-08-31 [1] CRAN (R 4.3.0) #> pillar 1.9.0 2023-03-22 [1] CRAN (R 4.3.0) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.3.0) #> processx 3.8.2 2023-06-30 [1] CRAN (R 4.3.0) #> ps 1.7.5 2023-04-18 [1] CRAN (R 4.3.0) #> purrr 1.0.2 2023-08-10 [1] CRAN (R 4.3.0) #> R.cache 0.16.0 2022-07-21 [1] CRAN (R 4.3.0) #> R.methodsS3 1.8.2 2022-06-13 [1] CRAN (R 4.3.0) #> R.oo 1.25.0 2022-06-12 [1] CRAN (R 4.3.0) #> R.utils 2.12.2 2022-11-11 [1] CRAN (R 4.3.0) #> R6 2.5.1 2021-08-19 [1] CRAN (R 4.3.0) #> reprex 2.0.2 2022-08-17 [1] CRAN (R 4.3.0) #> rlang 1.1.1 2023-04-28 [1] CRAN (R 4.3.0) #> rmarkdown 2.24 2023-08-14 [1] CRAN (R 4.3.0) #> rstudioapi 0.15.0 2023-07-07 [1] CRAN (R 4.3.0) #> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.0) #> styler 1.10.2 2023-08-29 [1] CRAN (R 4.3.0) #> tibble 3.2.1 2023-03-20 [1] CRAN (R 4.3.0) #> tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.3.0) #> utf8 1.2.3 2023-01-31 [1] CRAN (R 4.3.0) #> vctrs 0.6.3 2023-06-14 [1] CRAN (R 4.3.0) #> withr 2.5.0 2022-03-03 [1] CRAN (R 4.3.0) #> xfun 0.40 2023-08-09 [1] CRAN (R 4.3.0) #> yaml 2.3.7 2023-01-23 [1] CRAN (R 4.3.0) #> #> [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library #> #> ────────────────────────────────────────────────────────────────────────────── ```
wlandau commented 1 year ago

I can even shave off a bit more time in rbindlist() using use.names = FALSE. That comparison seems apples-to-apples.

result <- crew:::monad_tibble(crew::crew_eval(12))
list <- replicate(1e6, result, simplify = FALSE)
system.time(data.table::rbindlist(list, use.names = TRUE))
#>    user  system elapsed 
#>   1.154   0.028   1.183
system.time(data.table::rbindlist(list, use.names = FALSE))
#>    user  system elapsed 
#>   0.924   0.014   0.940
system.time(vctrs::vec_rbind(list, .name_repair = "universal_quiet"))
#>    user  system elapsed 
#>   1.338   0.061   1.400

Created on 2023-09-18 with reprex v2.0.2

Session info ``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.3.0 (2023-04-21) #> os macOS Ventura 13.5.2 #> system aarch64, darwin20 #> ui X11 #> language (EN) #> collate en_US.UTF-8 #> ctype en_US.UTF-8 #> tz America/Indiana/Indianapolis #> date 2023-09-18 #> pandoc 3.1.2 @ /usr/local/bin/ (via rmarkdown) #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date (UTC) lib source #> cli 3.6.1 2023-03-23 [1] CRAN (R 4.3.0) #> crew 0.4.1 2023-09-15 [1] local #> data.table 1.14.8 2023-02-17 [1] CRAN (R 4.3.0) #> digest 0.6.33 2023-07-07 [1] CRAN (R 4.3.0) #> evaluate 0.21 2023-05-05 [1] CRAN (R 4.3.0) #> fansi 1.0.4 2023-01-22 [1] CRAN (R 4.3.0) #> fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.0) #> fs 1.6.3 2023-07-20 [1] CRAN (R 4.3.0) #> getip 0.1-3 2023-01-25 [1] CRAN (R 4.3.0) #> glue 1.6.2 2022-02-24 [1] CRAN (R 4.3.0) #> htmltools 0.5.6 2023-08-10 [1] CRAN (R 4.3.0) #> knitr 1.43 2023-05-25 [1] CRAN (R 4.3.0) #> lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.3.0) #> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.3.0) #> mirai 0.10.0 2023-09-16 [1] CRAN (R 4.3.1) #> nanonext 0.10.0 2023-08-31 [1] CRAN (R 4.3.0) #> pillar 1.9.0 2023-03-22 [1] CRAN (R 4.3.0) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.3.0) #> processx 3.8.2 2023-06-30 [1] CRAN (R 4.3.0) #> ps 1.7.5 2023-04-18 [1] CRAN (R 4.3.0) #> purrr 1.0.2 2023-08-10 [1] CRAN (R 4.3.0) #> R.cache 0.16.0 2022-07-21 [1] CRAN (R 4.3.0) #> R.methodsS3 1.8.2 2022-06-13 [1] CRAN (R 4.3.0) #> R.oo 1.25.0 2022-06-12 [1] CRAN (R 4.3.0) #> R.utils 2.12.2 2022-11-11 [1] CRAN (R 4.3.0) #> R6 2.5.1 2021-08-19 [1] CRAN (R 4.3.0) #> reprex 2.0.2 2022-08-17 [1] CRAN (R 4.3.0) #> rlang 1.1.1 2023-04-28 [1] CRAN (R 4.3.0) #> rmarkdown 2.24 2023-08-14 [1] CRAN (R 4.3.0) #> rstudioapi 0.15.0 2023-07-07 [1] CRAN (R 4.3.0) #> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.0) #> styler 1.10.2 2023-08-29 [1] CRAN (R 4.3.0) #> tibble 3.2.1 2023-03-20 [1] CRAN (R 4.3.0) #> tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.3.0) #> utf8 1.2.3 2023-01-31 [1] CRAN (R 4.3.0) #> vctrs 0.6.3 2023-06-14 [1] CRAN (R 4.3.0) #> withr 2.5.0 2022-03-03 [1] CRAN (R 4.3.0) #> xfun 0.40 2023-08-09 [1] CRAN (R 4.3.0) #> yaml 2.3.7 2023-01-23 [1] CRAN (R 4.3.0) #> #> [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library #> #> ────────────────────────────────────────────────────────────────────────────── ```
shikokuchuo commented 1 year ago

I think the difference is that I tested using a browser() instance in your actual map method using an example. There, more columns are filled out with different types, and I guess this favours vctrs - as I mentioned, it was not even close.

However, even using your above test, these are my results for a size of 1e5 for the list.

> microbenchmark(data.table::rbindlist(list, use.names = FALSE), vctrs::vec_rbind(list, .name_repair = "universal_quiet"))
Unit: milliseconds
                                                     expr      min       lq     mean   median       uq       max
           data.table::rbindlist(list, use.names = FALSE) 97.81084 98.56401 99.50163 99.18741 99.87714 114.37894
 vctrs::vec_rbind(list, .name_repair = "universal_quiet") 79.48315 80.53932 82.23909 81.00991 81.52945  98.07469
 neval
   100
   100
shikokuchuo commented 1 year ago

Investigating a bit more, it does seem that rbindlist makes up speed as the size of list gets larger. The examples I was working with only had a few rows (< 20). So I guess it's up to you to make the call - data.table just seemed a bit heavy as you mention.

Just FYI: vec_rbind() is the workhorse behind dplyr::bind_rows().

wlandau commented 1 year ago

Thanks for digging into this more, and thanks for the original suggestion. I am not sure exactly how the lightness of vctrs weighs against the potential performance gains of data.table. I will convert this to a discussion and think on it more.