zhengxwen / SeqArray

Data management of large-scale whole-genome sequence variant calls (Development version only)
http://www.bioconductor.org/packages/SeqArray
44 stars 12 forks source link

seqGDS2VCF creates an inverted GT on the VCF file #22

Closed AAvalos82 closed 7 years ago

AAvalos82 commented 7 years ago

Likely a minor bug, but the seqGDS2VCF function is effectively inverting the genotype when it writes out a VCF.

The function uses the allele coding:

2 = two copies of A/REF, 1= A/B, 0= two copies of B/ALT, when producing a VCF file.

Thus the VCF output becomes:

1/1 for the REF in the GT field and 0/0 for the ALT in the GT field of the VCF.

However, the VCF format treats 0 = REF and 1 = ALT (see "1.4.2 Genotype fields" section in https://samtools.github.io/hts-specs/VCFv4.2.pdf, and also http://www.internationalgenome.org/wiki/Analysis/vcf4.0/). As per this format a REF homozygous should be 0/0 and ALT homozygous should be 1/1 in the VCF.

This may be specific to my case, where I converted a SNP GDS file to a Seq GDS file first using the seqSNP2GDS, then wrote the Seq GDS into a VCF. But if that is the case then the inversion is occurring at the seqSNP2GDS step.

zhengxwen commented 7 years ago

The inversion is occurring at the seqSNP2GDS step.

A new option major.ref is added to `seqSNP2GDS:

seqSNP2GDS(gds.fn, out.fn, storage.option="LZMA_RA", major.ref=TRUE,
    optimize=TRUE, digest=TRUE, verbose=TRUE)

if major.ref=TRUE, use the major allele as a reference allele.