zhengxwen / SeqArray

Data management of large-scale whole-genome sequence variant calls (Development version only)
http://www.bioconductor.org/packages/SeqArray
43 stars 12 forks source link

problem with SeqArray::seqVCF2GDS #36

Closed thierrygosselin closed 5 years ago

thierrygosselin commented 5 years ago

Using the vcf file provided in the email, I get two problems using SeqArray::seqVCF2GDS differently:

1. without parallel

test <- SeqArray::seqVCF2GDS(
  vcf.fn = "test.vcf",
  out.fn = "test",
  parallel = FALSE,
  verbose = TRUE)
Mon Sep 24 10:23:07 2018
Variant Call Format (VCF) Import:
    file(s):
        test.vcf (55.9M)
    file format: VCFv4.2
    the number of sets of chromosomes (ploidy): 2
    the number of samples: 511
    genotype storage: bit2
    compression method: LZMA_RA
    # of samples: 511
Output:
    test
Parsing 'test.vcf':
Error in SeqArray::seqVCF2GDS(vcf.fn = "test.vcf", out.fn = "test", parallel = FALSE,  : 
  FORMAT ID 'AD' should have 1 value(s), but receives 2.
FILE: ~test.vcf
LINE: 12, COLUMN: 10, 0/0:24:24,0

2. with parallel

test <- SeqArray::seqVCF2GDS(
  vcf.fn = "test.vcf",
  out.fn = "test",
  parallel = 8,
  verbose = TRUE)
Mon Sep 24 10:50:53 2018
Variant Call Format (VCF) Import:
    file(s):
        test.vcf (55.9M)
    file format: VCFv4.2
    the number of sets of chromosomes (ploidy): 2
    the number of samples: 511
    genotype storage: bit2
    compression method: LZMA_RA
    # of samples: 511
    calculating the total number of variants ...
    the total number of variants for import: 10,170
    Writing to 8 files:
        test_tmp01_83181905b707 [1..1,272]
        test_tmp02_831846333966 [1,273..2,544]
        test_tmp03_83184cfd7b8b [2,545..3,816]
        test_tmp04_831815ba0a2a [3,817..5,088]
        test_tmp05_831868f1568a [5,089..6,360]
        test_tmp06_83183c68b3d9 [6,361..7,632]
        test_tmp07_83187df78a8a [7,633..8,904]
        test_tmp08_83186a8aaa2 [8,905..10,170]
    Done (Mon Sep 24 10:50:54 2018).
Output:
    test
Merging:
    opening 'test_tmp01_83181905b707' ... [done]
    opening 'test_tmp02_831846333966' ... [done]
    opening 'test_tmp03_83184cfd7b8b' ... [done]
    opening 'test_tmp04_831815ba0a2a' ... [done]
    opening 'test_tmp05_831868f1568a' ... [done]
    opening 'test_tmp06_83183c68b3d9' ... [done]
    opening 'test_tmp07_83187df78a8a' ... [done]
    opening 'test_tmp08_83186a8aaa2' ... [done]
Digests:
    sample.id  [md5: 04d1e6a0b6c74c7f27b0d240722647d2]
    variant.id  [md5: 542bdf96296222b1721f6dafbf84aed0]
    position  [md5: 70ac4989105257e5e3e685982392bbee]
    chromosome  [md5: 32a3e2c9caf1f42f64f5d5b9049d07a6]
    allele  [md5: 1dc1172743276e95593e72e55fa736db]
    genotype  [md5: 6e6dcb39f4b68c5a7dc6ad670c6ff366]
    phase  [md5: 6e6dcb39f4b68c5a7dc6ad670c6ff366]
    annotation/id  [md5: a36cde20b263a017a84b811b1aeda9ff]
    annotation/qual  [md5: 5d49cf128b1bc2003d7de302af063392]
    annotation/filter  [md5: b712b811fcc1eae98db00d18dbfa896d]
    annotation/info/NS  [md5: a9ab378ade0e37dbdb1de6334ca37b00]
    annotation/info/AF  [md5: d827b6659689b4212dfe4a8f3f5d3046]
    annotation/info/locori  [md5: b986643e8ad8df11c7851c3c9691970d]
    annotation/format/DP  [md5: 6e6dcb39f4b68c5a7dc6ad670c6ff366]
    annotation/format/AD  [md5: 6e6dcb39f4b68c5a7dc6ad670c6ff366]
    annotation/format/GL  [md5: 6e6dcb39f4b68c5a7dc6ad670c6ff366]
Done.
Mon Sep 24 10:50:55 2018
Optimize the access efficiency ...
Clean up the fragments of GDS file:
    open the file 'test' (11.7K)
    # of fragments: 143
    save to 'test.tmp'
    rename 'test.tmp' (10.9K, reduced: 852B)
    # of fragments: 72
Mon Sep 24 10:50:55 2018
Warning message:
In mclapply(seq_len(njobs), mc.preschedule = FALSE, mc.cores = njobs,  :
  8 function calls resulted in an error

And FYI the result of my devtools::session_info()

Session info --------------------------------------------------------
 setting  value                       
 version  R version 3.5.1 (2018-07-02)
 system   x86_64, darwin15.6.0        
 ui       RStudio (1.2.1007)          
 language (EN)                        
 collate  en_CA.UTF-8                 
 tz       America/Montreal            
 date     2018-09-24                  

Packages ------------------------------------------------------------
 package          * version   date      
 amap               0.8-16    2018-05-14
 assertthat         0.2.0     2017-04-11
 base             * 3.5.1     2018-07-05
 bindr              0.1.1     2018-03-13
 bindrcpp           0.2.2     2018-03-29
 BiocGenerics       0.26.0    2018-05-01
 Biostrings         2.48.0    2018-05-01
 bitops             1.0-6     2013-08-17
 codetools          0.2-15    2016-10-05
 colorspace         1.3-2     2016-12-14
 compiler           3.5.1     2018-07-05
 crayon             1.3.4     2017-09-16
 data.table         1.11.6    2018-09-19
 datasets         * 3.5.1     2018-07-05
 devtools           1.13.6    2018-06-27
 digest             0.6.17    2018-09-12
 dplyr              0.7.6     2018-06-29
 fst                0.8.8     2018-06-07
 future             1.9.0     2018-07-23
 gdsfmt             1.16.0    2018-05-01
 GenomeInfoDb       1.16.0    2018-05-01
 GenomeInfoDbData   1.1.0     2018-06-06
 GenomicRanges      1.32.7    2018-09-20
 ggplot2            3.0.0     2018-07-03
 globals            0.12.3    2018-09-17
 glue               1.3.0     2018-07-17
 graphics         * 3.5.1     2018-07-05
 grDevices        * 3.5.1     2018-07-05
 grid               3.5.1     2018-07-05
 gtable             0.2.0     2016-02-26
 hms                0.4.2     2018-03-10
 IRanges            2.14.12   2018-09-20
 lazyeval           0.2.1     2017-10-29
 listenv            0.7.0     2018-01-21
 magrittr           1.5       2014-11-22
 memoise            1.1.0     2017-04-21
 methods          * 3.5.1     2018-07-05
 munsell            0.5.0     2018-06-12
 parallel           3.5.1     2018-07-05
 pbmcapply          1.2.5     2018-05-19
 pillar             1.3.0     2018-07-14
 pkgconfig          2.0.2     2018-08-16
 plyr               1.8.4     2016-06-08
 purrr              0.2.5     2018-05-29
 R6                 2.2.2     2017-06-17
 radiator         * 0.0.16    2018-09-07
 Rcpp               0.12.18   2018-07-23
 RCurl              1.95-4.11 2018-07-15
 readr              1.1.1     2017-05-16
 rlang              0.2.2     2018-08-16
 rstudioapi         0.7       2017-09-07
 S4Vectors          0.18.3    2018-06-08
 scales             1.0.0     2018-08-09
 SeqArray           1.21.6    2018-09-24
 stats            * 3.5.1     2018-07-05
 stats4             3.5.1     2018-07-05
 stringi            1.2.4     2018-07-20
 tibble             1.4.2     2018-01-22
 tidyr              0.8.1     2018-05-18
 tidyselect         0.2.4     2018-02-26
 tools              3.5.1     2018-07-05
 utils            * 3.5.1     2018-07-05
 withr              2.1.2     2018-03-15
 XVector            0.20.0    2018-05-01
 zlibbioc           1.26.0    2018-05-01
 source                             
 CRAN (R 3.5.0)                     
 cran (@0.2.0)                      
 local                              
 cran (@0.1.1)                      
 cran (@0.2.2)                      
 Bioconductor                       
 Bioconductor                       
 CRAN (R 3.5.0)                     
 CRAN (R 3.5.1)                     
 cran (@1.3-2)                      
 local                              
 CRAN (R 3.5.0)                     
 CRAN (R 3.5.1)                     
 local                              
 CRAN (R 3.5.0)                     
 CRAN (R 3.5.1)                     
 CRAN (R 3.5.1)                     
 CRAN (R 3.5.0)                     
 CRAN (R 3.5.0)                     
 Bioconductor                       
 Bioconductor                       
 Bioconductor                       
 cran (@1.32.7)                     
 CRAN (R 3.5.0)                     
 CRAN (R 3.5.1)                     
 CRAN (R 3.5.1)                     
 local                              
 local                              
 local                              
 cran (@0.2.0)                      
 cran (@0.4.2)                      
 cran (@2.14.12)                    
 cran (@0.2.1)                      
 cran (@0.7.0)                      
 cran (@1.5)                        
 CRAN (R 3.5.0)                     
 local                              
 CRAN (R 3.5.0)                     
 local                              
 CRAN (R 3.5.0)                     
 CRAN (R 3.5.0)                     
 CRAN (R 3.5.0)                     
 cran (@1.8.4)                      
 CRAN (R 3.5.0)                     
 CRAN (R 3.5.0)                     
 local                              
 CRAN (R 3.5.0)                     
 CRAN (R 3.5.0)                     
 cran (@1.1.1)                      
 CRAN (R 3.5.0)                     
 CRAN (R 3.5.0)                     
 cran (@0.18.3)                     
 CRAN (R 3.5.0)                     
 Github (zhengxwen/SeqArray@7966999)
 local                              
 local                              
 CRAN (R 3.5.0)                     
 cran (@1.4.2)                      
 CRAN (R 3.5.0)                     
 cran (@0.2.4)                      
 local                              
 local                              
 CRAN (R 3.5.0)                     
 Bioconductor                       
 Bioconductor

Note: that I get the same problem with Bioconductor version 1.20.2 and GitHub dev version 1.21.6 of SeqArray

zhengxwen commented 5 years ago

Next time, please send me a compressed file instead of an uncompressed VCF file. Your raw VCF file crashes my email account, and I have to delete your email after downloading your test file.

thierrygosselin commented 5 years ago

sorry about that, new email with dropbox link working

thierrygosselin commented 5 years ago

is it a problem with the VCF file ?

zhengxwen commented 5 years ago
# modify the header 
h <- seqVCF_Header("test.vcf")
h$format
h$format$Number[h$format$ID=="AD"] <- "."

seqVCF2GDS("test.vcf", "test.gds", header=h)

This works.

thierrygosselin commented 5 years ago

I understand the problem now... Stacks as consistency Issues with vcf, they say things in the headers that is not always there... or the wrong class ... GL and AD is just a glimpse

Thanks for pointing for an easy solution