zhengxwen / SeqArray

Data management of large-scale whole-genome sequence variant calls (Development version only)
http://www.bioconductor.org/packages/SeqArray
43 stars 12 forks source link

problem importing very large VCF file #30

Closed thierrygosselin closed 6 years ago

thierrygosselin commented 6 years ago

Hi Xiuwen,

RStudio crashes while trying to import a very large VCF file (33GB) using this command (sending the link to the file by email):

vcf.connection <- SeqArray::seqVCF2GDS(
    vcf.fn = "populations.snps.vcf",
    out.fn = "data.gds",
    parallel = 4L,
    verbose = TRUE)

I'm able to import the same file with SNPRelate and converting SeqArray:

vcf.connection <- SNPRelate::snpgdsVCF2GDS(
    vcf.fn = "populations.snps.vcf",
    out.fn = "data.gds",
    method = "biallelic.only",
    verbose = TRUE)
vcf.connection.seq <- SeqArray::seqSNP2GDS(
    gds.fn = "data.gds",
    out.fn = "data.seq.gds",
    verbose = TRUE)

My session info:

devtools::session_info()
Session info ----------------------------------------------------------------------------------------------------
 setting  value                       
 version  R version 3.5.0 (2018-04-23)
 system   x86_64, darwin15.6.0        
 ui       RStudio (1.1.453)           
 language (EN)                        
 collate  en_CA.UTF-8                 
 tz       Australia/Hobart            
 date     2018-06-07                  

Packages --------------------------------------------------------------------------------------------------------
 package   * version date       source        
 base      * 3.5.0   2018-04-24 local         
 compiler    3.5.0   2018-04-24 local         
 datasets  * 3.5.0   2018-04-24 local         
 devtools    1.13.5  2018-02-18 CRAN (R 3.5.0)
 digest      0.6.15  2018-01-28 CRAN (R 3.5.0)
 graphics  * 3.5.0   2018-04-24 local         
 grDevices * 3.5.0   2018-04-24 local         
 memoise     1.1.0   2017-04-21 CRAN (R 3.5.0)
 methods   * 3.5.0   2018-04-24 local         
 stats     * 3.5.0   2018-04-24 local         
 tools       3.5.0   2018-04-24 local         
 utils     * 3.5.0   2018-04-24 local         
 withr       2.1.2   2018-03-15 CRAN (R 3.5.0)
 yaml        2.1.19  2018-05-01 CRAN (R 3.5.0)

Thanks Thierry

zhengxwen commented 6 years ago

The problem is not related to the large file size. When I only import the first 25 lines of your VCF file, I see the same problem:

 *** caught segfault ***
address (nil), cause 'memory not mapped'

I have added a bug label to this issue. Thanks for your patience.

zhengxwen commented 6 years ago

Please install the latest version from GitHub:

library("devtools")
install_github("zhengxwen/SeqArray")
thierrygosselin commented 6 years ago

Thanks working !