zhengxwen / gdsfmt

R Interface to CoreArray Genomic Data Structure (GDS) Files (Development version only)
http://www.bioconductor.org/packages/gdsfmt
18 stars 4 forks source link

gdsSubset and other gds object interactions are not practial for loops/parallel #37

Open rbutleriii opened 4 months ago

rbutleriii commented 4 months ago

If you want to gdsSubset your object, you cannot have it open, which is incredibly impractical for downstream use if I need information from it, for example sample IDs.

# prune to the relevant biomarker overlap
gds <- GdsGenotypeReader(gds_fn)
sample.sel <- getScanID(gds)

# for each biomarker make grm
lapply(bios, function(i) {
  # intersection for variable of interest
  both <- intersect(a[DS1V == 0 & !is.na(get(i)), PIDN], sample.sel)
  x <- tempfile()
  gdsSubset(gds_fn, x, sample.include = both)

  # run PC-Relate
  x <- GenotypeBlockIterator(x)
  mypcrelate <- pcrelate(x, 
    pcs = mypcair$vectors[both, 1:8], 
    training.set = intersect(mypcair$unrels, both),
    BPPARAM=BiocParallel::SerialParam()
  )

  # write pcrelate RDS and GRM to file
  saveRDS(mypcrelate, file = paste("mypcrelate", i, "rds", sep = "."))
})

This will fail unless I close the object first, and would be a major chokepoint if I had to access elements inside the gds object in the lapply, as it would open and close the gds file for each iteration. Not to mention impossible to use with something like future_lapply trying to parallelize the process since it only allows the file to be open once. Is there a reason it has to lock the file for use?

smgogarten commented 4 months ago

This should really be an issue in the GWASTools package, since gdsSubset is a GWASTools function. But this is not the intended use of gdsSubset. You don't need to write a new GDS file for every sample set; instead, use the sample.include argument to pcrelate to select samples for every iteration.

If you really want to do it this way, set allow.fork=TRUE in your call to gdsSubset.

rbutleriii commented 4 months ago

If you do pass it sample.include, it will error without the training.set being subset, but it runs with the full set of pcs got the gds. Does the function handle the pc subset, or are the rows mismatched without pcs = mypcair$vectors[both, 1:8]?

plan("multicore", workers = 8)
options(future.globals.maxSize = 4 * 1e9)
# prune to the relevant biomarker overlap
gds <- GenotypeBlockIterator(GenotypeData(GdsGenotypeReader(gds_fn, allow.fork = TRUE)))
sample.sel <- getScanID(gds)

# for each biomarker make grm
future_lapply(bios, function(i) {
  # intersection for variable of interest
  both <- intersect(a[DS1V == 0 & !is.na(get(i)), PIDN], sample.sel)

  # run PC-Relate
  mypcrelate <- pcrelate(gds, 
    pcs = mypcair$vectors[, 1:8], 
    training.set = intersect(mypcair$unrels, both),
    BPPARAM = BiocParallel::SerialParam(),
    sample.include = both
  )

  # write pcrelate RDS and GRM to file
  saveRDS(mypcrelate, file = paste("mypcrelate", i, "rds", sep = "."))
}, future.seed = TRUE)
smgogarten commented 4 months ago

I don't entirely understand your question. If you're getting an error or unexpected results using sample.include with pcrelate, please post a reproducible example on the GENESIS issues page.