wkumler / RaMS

R-based access to Mass-Spectrometry data
Other
22 stars 7 forks source link

Notes on `arrow` #19

Closed wkumler closed 10 months ago

wkumler commented 1 year ago

At the R Cascadia conference this past weekend, Cari Gostic gave an excellent talk on the interface between tidy data and the arrow package, which handles input/output from Apache Arrow parquet files and datasets. This seems to be a direct upgrade to the tmzML document type and is at least an order of magnitude faster in both creation and retrieval.

Notes:

wkumler commented 1 year ago

See https://stackoverflow.com/q/77204216 for implementation advice, timings below

library(arrow)
artab <- arrow_table(int=1:1000000, value=runif(1000000))

library(dplyr)
library(data.table) # for %between% operator syntax

req_vals <- data.frame(lower_bound=runif(200)) %>%
  mutate(upper_bound=lower_bound+0.0002)

eval_method <- function(){
  window_req_string <- req_vals %>%
    summarise(req_cmd=paste0("value%between%c(", lower_bound, ", ", upper_bound, 
                             ")", collapse="|")) %>%
    pull(req_cmd)
  full_arrow_req <- paste0(
    'artab %>% filter(', window_req_string, ') %>% dplyr::collect()'
  )
  eval(parse(text=full_arrow_req)) %>%
    arrange(int)
}

lapply_method <- function(){
  req_list <- split(req_vals, seq_len(nrow(req_vals))) %>%
    lapply(unlist)
  lapply(req_list, function(window) {
    artab %>%   
      filter(value%between%window) %>%   
      collect()
  }) %>% 
    bind_rows() %>%
    distinct() %>%
    arrange(int)
}

identical(arrow_output_eval, arrow_output_lapply)
microbenchmark::microbenchmark(eval_method(), lapply_method(), times = 3, check = "identical")
wkumler commented 10 months ago

See the new vignette introduced in #24 for timing and implementation of arrow and its comparison to other packages.