zhengxwen / SeqArray

Data management of large-scale whole-genome sequence variant calls (Development version only)
http://www.bioconductor.org/packages/SeqArray
44 stars 12 forks source link

problem with VCF #40

Closed thierrygosselin closed 6 years ago

thierrygosselin commented 6 years ago

Hi Xiuwen,

I have another VCF file generating an error with SeqArray. This time, the vcf file was produced with freeBayes v1.1.0-4-gb6041c6

Here's the command:

require(magrittr)
check <- SeqArray::seqVCF2GDS(
  vcf.fn = "test.vcf",
  out.fn = "testing",
  parallel = 12,
  storage.option = "ZIP_RA",
  verbose = FALSE) %>%
  SeqArray::seqOpen(gds.fn = ., readonly = FALSE)

The error:

Warning message:
In mclapply(seq_len(njobs), mc.preschedule = FALSE, mc.cores = njobs,  :
  10 function calls resulted in an error

The link to get the file test.vcf was sent by email.

results of my devtools::session_info():

Session info --------------------------------------------------------
 setting  value                       
 version  R version 3.5.1 (2018-07-02)
 system   x86_64, darwin15.6.0        
 ui       RStudio (1.1.456)           
 language (EN)                        
 collate  en_CA.UTF-8                 
 tz       America/Montreal            
 date     2018-10-30                  

Packages ------------------------------------------------------------
 package          * version   date       source        
 base             * 3.5.1     2018-07-05 local         
 BiocGenerics       0.26.0    2018-05-01 Bioconductor  
 Biostrings         2.48.0    2018-05-01 Bioconductor  
 bitops             1.0-6     2013-08-17 CRAN (R 3.5.0)
 compiler           3.5.1     2018-07-05 local         
 crayon             1.3.4     2017-09-16 CRAN (R 3.5.0)
 datasets         * 3.5.1     2018-07-05 local         
 devtools           1.13.6    2018-06-27 CRAN (R 3.5.0)
 digest             0.6.18    2018-10-10 CRAN (R 3.5.0)
 gdsfmt             1.16.0    2018-05-01 Bioconductor  
 GenomeInfoDb       1.16.0    2018-05-01 Bioconductor  
 GenomeInfoDbData   1.1.0     2018-10-02 Bioconductor  
 GenomicRanges      1.32.7    2018-09-20 Bioconductor  
 graphics         * 3.5.1     2018-07-05 local         
 grDevices        * 3.5.1     2018-07-05 local         
 IRanges            2.14.12   2018-09-20 Bioconductor  
 magrittr         * 1.5       2014-11-22 CRAN (R 3.5.0)
 memoise            1.1.0     2017-04-21 CRAN (R 3.5.0)
 methods          * 3.5.1     2018-07-05 local         
 parallel           3.5.1     2018-07-05 local         
 RCurl              1.95-4.11 2018-07-15 CRAN (R 3.5.0)
 rstudioapi         0.8       2018-10-02 CRAN (R 3.5.0)
 S4Vectors          0.18.3    2018-06-08 Bioconductor  
 SeqArray           1.20.2    2018-09-10 Bioconductor  
 stats            * 3.5.1     2018-07-05 local         
 stats4             3.5.1     2018-07-05 local         
 tools              3.5.1     2018-07-05 local         
 utils            * 3.5.1     2018-07-05 local         
 withr              2.1.2     2018-03-15 CRAN (R 3.5.0)
 XVector            0.20.0    2018-05-01 Bioconductor  
 yaml               2.2.0     2018-07-25 CRAN (R 3.5.0)
 zlibbioc           1.26.0    2018-05-01 Bioconductor  

Cheers Thierry

thierrygosselin commented 6 years ago

Setting parallel = 1 reveal more info on the error:

Error in SeqArray::seqVCF2GDS(vcf.fn = "test.vcf", out.fn = "testing",  : 
  FORMAT ID 'AD' (Number=R) should have 2 value(s), but receives 3.
FILE: test.vcf
LINE: 87, COLUMN: 17, ./.:1:1,0,0:1:22:0,0:0,0:0,-0.30103,-2.19784,-0.30103,-2.19784,-2.19784

The vcf is multi-allelic (haplotypes), however the variant in line 87 should be bi-allelic (C/T), so I supposed it comes down to how loose you want to be in your parsing interpretation of the VCF spec 4.2... I totally understand if you don't want to work around those issues.

zhengxwen commented 6 years ago

See:

library(SeqArray)

seqVCF2GDS("test.vcf", "test.gds")  # fails

hd <- seqVCF_Header("test.vcf")
hd$format
hd$format[hd$format$ID=="AD", ]
#   ID Number    Type                           Description
# 5 AD      R Integer Number of observation for each allele

# change it
hd$format[hd$format$ID=="AD", "Number"] <- "."
hd$format[hd$format$ID=="AO", "Number"] <- "."
hd$format[hd$format$ID=="QA", "Number"] <- "."
hd$format[hd$format$ID=="QA", "Number"] <- "."
hd$format[hd$format$ID=="GL", "Number"] <- "."

seqVCF2GDS("test.vcf", "test.gds", header=hd)  # works
thierrygosselin commented 6 years ago

much appreciated! Is there a way to cancel the import of some (e.g. AO or QA) so that it's not integrated in the GDS from the start ?

zhengxwen commented 6 years ago

See:

nm <- hd$format$ID
nm <- setdiff(nm, c("AO", "QA"))
seqVCF2GDS("test.vcf", "test.gds", header=hd, fmt.import=nm)
thierrygosselin commented 6 years ago

thanks!