Notes on `arrow` - Githubissues

wkumler commented 1 year ago

At the R Cascadia conference this past weekend, Cari Gostic gave an excellent talk on the interface between tidy data and the arrow package, which handles input/output from Apache Arrow parquet files and datasets. This seems to be a direct upgrade to the tmzML document type and is at least an order of magnitude faster in both creation and retrieval.

Notes:

Super easy to implement, new functions are just the write_dataset and read_dataset plus collect from the dplyr package.
dplyr commands can be passed directly to an open dataset object but computations are trickier
- mutate(samp_type=str_extract(filename, "Blk|175m|15m|DCM|Poo|Std")) needs to be done in R
- filter(mz%between%pmppm(76.039854+1.003355, 5)) computation needs to be done in R (why does pmppm work but not simple addition?)
- str_detect seems to be very slow if called before collect? E.g. filter(str_detect(filename, "Smp"))
I'm probably going to implement this in the RaMS-and-friends vignette but it feels powerful enough that the functionality may eventually make it into the package itself. I think I'd eventually like to switch from tmzML over to parquet entirely but then I'll have to figure out how to keep the original syntax, which sounds like a headache(?).

wkumler commented 1 year ago

See https://stackoverflow.com/q/77204216 for implementation advice, timings below

library(arrow)
artab <- arrow_table(int=1:1000000, value=runif(1000000))

library(dplyr)
library(data.table) # for %between% operator syntax

req_vals <- data.frame(lower_bound=runif(200)) %>%
  mutate(upper_bound=lower_bound+0.0002)

eval_method <- function(){
  window_req_string <- req_vals %>%
    summarise(req_cmd=paste0("value%between%c(", lower_bound, ", ", upper_bound, 
                             ")", collapse="|")) %>%
    pull(req_cmd)
  full_arrow_req <- paste0(
    'artab %>% filter(', window_req_string, ') %>% dplyr::collect()'
  )
  eval(parse(text=full_arrow_req)) %>%
    arrange(int)
}

lapply_method <- function(){
  req_list <- split(req_vals, seq_len(nrow(req_vals))) %>%
    lapply(unlist)
  lapply(req_list, function(window) {
    artab %>%   
      filter(value%between%window) %>%   
      collect()
  }) %>% 
    bind_rows() %>%
    distinct() %>%
    arrange(int)
}

identical(arrow_output_eval, arrow_output_lapply)
microbenchmark::microbenchmark(eval_method(), lapply_method(), times = 3, check = "identical")

wkumler commented 10 months ago

See the new vignette introduced in #24 for timing and implementation of arrow and its comparison to other packages.

wkumler / RaMS

Notes on `arrow` #19