Closed shikokuchuo closed 1 year ago
As much as I would like to speed up aggregation in controller$map()
and avoid the heavy dependency of data.table
, my own benchmarks appear to mildly favor rbindlist()
. In addition, the default name repair policy in vec_rbind()
is an extreme bottleck, so I needed .name_repair = "universal_quiet"
for a fair comparison.
result <- crew:::monad_tibble(crew::crew_eval(12))
list <- replicate(1e6, result, simplify = FALSE)
system.time(data.table::rbindlist(list))
#> user system elapsed
#> 1.130 0.026 1.156
system.time(vctrs::vec_rbind(list, .name_repair = "universal_quiet"))
#> user system elapsed
#> 1.244 0.048 1.292
Created on 2023-09-18 with reprex v2.0.2
I can even shave off a bit more time in rbindlist()
using use.names = FALSE
. That comparison seems apples-to-apples.
result <- crew:::monad_tibble(crew::crew_eval(12))
list <- replicate(1e6, result, simplify = FALSE)
system.time(data.table::rbindlist(list, use.names = TRUE))
#> user system elapsed
#> 1.154 0.028 1.183
system.time(data.table::rbindlist(list, use.names = FALSE))
#> user system elapsed
#> 0.924 0.014 0.940
system.time(vctrs::vec_rbind(list, .name_repair = "universal_quiet"))
#> user system elapsed
#> 1.338 0.061 1.400
Created on 2023-09-18 with reprex v2.0.2
I think the difference is that I tested using a browser()
instance in your actual map method using an example. There, more columns are filled out with different types, and I guess this favours vctrs
- as I mentioned, it was not even close.
However, even using your above test, these are my results for a size of 1e5 for the list.
> microbenchmark(data.table::rbindlist(list, use.names = FALSE), vctrs::vec_rbind(list, .name_repair = "universal_quiet"))
Unit: milliseconds
expr min lq mean median uq max
data.table::rbindlist(list, use.names = FALSE) 97.81084 98.56401 99.50163 99.18741 99.87714 114.37894
vctrs::vec_rbind(list, .name_repair = "universal_quiet") 79.48315 80.53932 82.23909 81.00991 81.52945 98.07469
neval
100
100
Investigating a bit more, it does seem that rbindlist
makes up speed as the size of list gets larger. The examples I was working with only had a few rows (< 20). So I guess it's up to you to make the call - data.table
just seemed a bit heavy as you mention.
Just FYI: vec_rbind()
is the workhorse behind dplyr::bind_rows()
.
Thanks for digging into this more, and thanks for the original suggestion. I am not sure exactly how the lightness of vctrs
weighs against the potential performance gains of data.table
. I will convert this to a discussion and think on it more.
Prework
Description
vctrs::vec_rbind()
offers 2-3x speed up vsdata.table::rbindlist()
. Safe given your format is fixed for monad_tibble in the first place.Will post a PR.