zhengxwen / SeqArray

Data management of large-scale whole-genome sequence variant calls (Development version only)
http://www.bioconductor.org/packages/SeqArray
43 stars 12 forks source link

Missing VCF INFO field causes seqVCF2GDS error #70

Closed kaskarn closed 3 years ago

kaskarn commented 3 years ago

VCF format specs allow variants to only list a subset of the INFO fields appearing in the header. In my VCF files, SNPs are optionally assigned RSID= in the last INFO field.

##INFO=<ID=RSID,Number=0,Type=String,Description="rs id number, extracted from  https://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/VCF/All_20180418.vcf.gz">

Most variants list an RSID (first line), but others do not (second line):

chr10   10654   chr10:10654:TGCAGAGAAGAACGCA:T  TGCAGAGAAGAACGCA        T       .       PASS    AF=0.00035;MAF=0.00035;R2=0.3532;IMPUTED;R2_HAT=0.399014;RSID=rs1211140797
chr10   10709   chr10:10709:G:A G       A       .       PASS    AF=0.01458;MAF=0.01458;R2=0.3914;IMPUTED;R2_HAT=0.37429

When I attempt to call seqVCF2GDS one these data, I get the following error:

...
file format: VCFv4.1
    genome reference: <unknown>
    the number of sets of chromosomes (ploidy): 2
    the number of samples: 2,013
    genotype storage: bit2
    compression method: LZ4_RA
    # of samples: 2013
    scenario: imputation
        annotation/format/DS: packedreal16
        annotation/format/GP: packedreal16
Output:
    ./file6a133133ba4b.gds
    ID Number   Type
9 RSID      0 String
                                                                                                                Description
9 rs id number, extracted from  https://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/VCF/All_20180418.vcf.gz
  Source Version
9   <NA>    <NA>
Error in seqVCF2GDS(vcf_fn, tmpf, verbose = TRUE, parallel = ntasks, scenario = gtype,  : 
  The length should be >0.
Execution halted

The expected behavior would be to set missing INFO fields to the missing value for their type, e.g. bcftools:

bcftools query -f "%CHROM %POS %RSID\n" $f | head 
chr10 10654 rs1211140797
chr10 10709 .
chr10 10859 rs1288337901
chr10 10905 rs1385633825
chr10 10943 .
chr10 10966 rs1432747350
chr10 10988 .
chr10 10993 rs1223450248
chr10 11818 .
chr10 11991 rs1413065739
kaskarn commented 3 years ago

It turns out the issue was something else (Number= in header was incorrect). Closing