zhengxwen / SeqArray

Data management of large-scale whole-genome sequence variant calls (Development version only)
http://www.bioconductor.org/packages/SeqArray
43 stars 12 forks source link

Problem in VCF AD field imported by SeqArray #60

Closed thierrygosselin closed 4 years ago

thierrygosselin commented 4 years ago

Dear Xiuwen,

The original data is a VCF file generated by the RADseq pipeline called dDocent. The import with SeqArray is fine.

Problem: I'm having a problem with the GDS. The GT and DP length are fine, but it's the AD length that doesn't match.

Question: Could you tell be if it's a SeqArray bug during import or an upstream problem (dDocent and/or FreeBayes) ?

I'm sending you the VCF file by email.

Best Thierry

zhengxwen commented 4 years ago

The error:

Error in seqVCF2GDS(fn, "t.gds") :
  FORMAT ID 'AD' (Number=R) should have 2 value(s), but receives 4.
FILE: /Users/zhengxx1/Downloads/thierry_gosselin_20200127_freebayes.vcf
LINE: 75, COLUMN: 10, 0/0:12:12,0,0,0:12:432:0,0,0:0,0,0:0,-3.61236,-39.2188,-3.61236,-39.2188,-39.2188,-3.61236,-39.2188,-39.2188,-39.2188

You should modify the vcf file and change ID=AD,Number=R to ID=AD,Number=.. Please also modify other lines of header of VCF, to allow variable-length data.

thierrygosselin commented 4 years ago

Thanks for the quick answer!

zhengxwen commented 4 years ago

Or revise the VCF header in R:

library(SeqArray)

fn <- "thierry_gosselin_20200127_freebayes.vcf.gz"
hdr <- seqVCF_Header(fn)
hdr$format$Number[hdr$format$ID=="AD"] <- "."
hdr$format$Number[hdr$format$ID=="AO"] <- "."
hdr$format$Number[hdr$format$ID=="QA"] <- "."
hdr$format$Number[hdr$format$ID=="GL"] <- "."

seqVCF2GDS(fn, "t.gds", header=hdr)
thierrygosselin commented 4 years ago

Yes I usually do those.

But this type of VCF, where it's supposed to be haplotypes, but only showing just 2 alleles and 4 AD values...