zhengxwen / SeqArray

Data management of large-scale whole-genome sequence variant calls (Development version only)
http://www.bioconductor.org/packages/SeqArray
44 stars 12 forks source link

seqBlockApply() is unexpectedly slow for "annotation/info/VARIABLE" #59

Closed zhengxwen closed 4 years ago

zhengxwen commented 4 years ago
library(SeqArray)
sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.5

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] SeqArray_1.26.2 gdsfmt_1.23.6

1KG chrX gds file is downloaded from gds-stat.

f <- seqOpen("1KG_ALL.chrX.phase3_shapeit2_mvncall_integrated_v1b.20130502.genotypes.gds")
f
## Object of class "SeqVarGDSClass"
## File: 1KG_ALL.chrX.phase3_shapeit2_mvncall_integrated_v1b.20130502.genotypes.gds (94.1M)
## +    [  ] *
## |--+ description   [  ] *
## |--+ sample.id   { Str8 2504 LZMA_ra(7.66%), 1.5K } *
## |--+ variant.id   { Int32 3468093 LZMA_ra(3.56%), 483.0K } *
## |--+ position   { Int32 3468093 LZMA_ra(29.4%), 3.9M } *
## |--+ chromosome   { Str8 3468093 LZMA_ra(0.02%), 1.1K } *
## |--+ allele   { Str8 3468093 LZMA_ra(16.2%), 2.3M } *
## |--+ genotype   [  ] *
## |  |--+ data   { Bit2 2x2504x3471055 LZMA_ra(1.49%), 61.8M } *
## |  |--+ extra.index   { Int32 3x0 LZMA_ra, 18B } *
## |  \--+ extra   { Int16 0 LZMA_ra, 18B }
## |--+ phase   [  ]
## |  |--+ data   { Bit1 2504x3468093 LZMA_ra(0.01%), 154.3K } *
## |  |--+ extra.index   { Int32 3x0 LZMA_ra, 18B } *
## |  \--+ extra   { Bit1 0 LZMA_ra, 18B }
## |--+ annotation   [  ]
## |  |--+ id   { Str8 3468093 LZMA_ra(33.0%), 6.0M } *
## |  |--+ qual   { Float32 3468093 LZMA_ra(0.02%), 2.1K } *
## |  |--+ filter   { Int32,factor 3468093 LZMA_ra(0.02%), 2.1K } *
## |  |--+ info   [  ]
## |  |  |--+ CIEND   { Int32 2x3468093 LZMA_ra(0.04%), 9.9K } *
## |  |  |--+ CIPOS   { Int32 2x3468093 LZMA_ra(0.04%), 9.8K } *
## |  |  |--+ CS   { Str8 3468093 LZMA_ra(0.23%), 8.0K } *
## |  |  |--+ END   { Int32 3468093 LZMA_ra(0.09%), 11.7K } *
## |  |  |--+ IMPRECISE   { Bit1 3468093 LZMA_ra(0.05%), 217B } *
## |  |  |--+ MC   { Str8 994 LZMA_ra(21.8%), 3.5K } *
## ...
system.time({ seqBlockApply(f, "position", function(x) {}) })
# 4.6 seconds, look OK
system.time({ seqBlockApply(f, "variant.id", function(x) {}) })
# 11 minutes!!
system.time({ seqBlockApply(f, "annotation/qual", function(x) {}) })
# 1 minutes!!
system.time({ seqBlockApply(f, "annotation/info/DP", function(x) {}) })
# 23 minutes!!!
system.time({ seqBlockApply(f, "annotation/info/AA", function(x) {}) })
# 6 minutes!!

seqBlockApply() calls seqGetData() internally.

zhengxwen commented 4 years ago

SeqArray_1.27.8 fixes the problems.

> system.time({ seqBlockApply(f, "position", function(x) {}) })
   user  system elapsed
  0.367   0.011   0.381
> system.time({ seqBlockApply(f, "variant.id", function(x) {}) })
   user  system elapsed
  0.214   0.004   0.218
> system.time({ seqBlockApply(f, "annotation/qual", function(x) {}) })
   user  system elapsed
  0.038   0.005   0.042
> system.time({ seqBlockApply(f, "annotation/info/DP", function(x) {}) })
   user  system elapsed
  0.390   0.003   0.394
> system.time({ seqBlockApply(f, "annotation/info/AA", function(x) {}) })
   user  system elapsed
  0.255   0.003   0.258

Direct read data without blocking:

> system.time(z <- seqGetData(f, "position"))
   user  system elapsed
  0.006   0.003   0.009
> system.time(z <- seqGetData(f, "variant.id"))
   user  system elapsed
  0.209   0.004   0.216
> system.time(z <- seqGetData(f, "annotation/qual"))
   user  system elapsed
  0.032   0.009   0.041
> system.time(z <- seqGetData(f, "annotation/info/DP"))
   user  system elapsed
  0.388   0.002   0.391
> system.time(z <- seqGetData(f, "annotation/info/AA"))
   user  system elapsed
  0.259   0.028   0.290