Closed wkumler closed 10 months ago
See https://stackoverflow.com/q/77204216 for implementation advice, timings below
library(arrow)
artab <- arrow_table(int=1:1000000, value=runif(1000000))
library(dplyr)
library(data.table) # for %between% operator syntax
req_vals <- data.frame(lower_bound=runif(200)) %>%
mutate(upper_bound=lower_bound+0.0002)
eval_method <- function(){
window_req_string <- req_vals %>%
summarise(req_cmd=paste0("value%between%c(", lower_bound, ", ", upper_bound,
")", collapse="|")) %>%
pull(req_cmd)
full_arrow_req <- paste0(
'artab %>% filter(', window_req_string, ') %>% dplyr::collect()'
)
eval(parse(text=full_arrow_req)) %>%
arrange(int)
}
lapply_method <- function(){
req_list <- split(req_vals, seq_len(nrow(req_vals))) %>%
lapply(unlist)
lapply(req_list, function(window) {
artab %>%
filter(value%between%window) %>%
collect()
}) %>%
bind_rows() %>%
distinct() %>%
arrange(int)
}
identical(arrow_output_eval, arrow_output_lapply)
microbenchmark::microbenchmark(eval_method(), lapply_method(), times = 3, check = "identical")
See the new vignette introduced in #24 for timing and implementation of arrow
and its comparison to other packages.
At the R Cascadia conference this past weekend, Cari Gostic gave an excellent talk on the interface between tidy data and the
arrow
package, which handles input/output from Apache Arrow parquet files and datasets. This seems to be a direct upgrade to the tmzML document type and is at least an order of magnitude faster in both creation and retrieval.Notes:
write_dataset
andread_dataset
pluscollect
from thedplyr
package.dplyr
commands can be passed directly to an open dataset object but computations are trickiermutate(samp_type=str_extract(filename, "Blk|175m|15m|DCM|Poo|Std"))
needs to be done in Rfilter(mz%between%pmppm(76.039854+1.003355, 5))
computation needs to be done in R (why doespmppm
work but not simple addition?)str_detect
seems to be very slow if called beforecollect
? E.g.filter(str_detect(filename, "Smp"))