will-rowe / artic-tools

a set of tools for viral amplicon schemes
MIT License
6 stars 2 forks source link

--strict fails #12

Closed MattBashton closed 3 years ago

MattBashton commented 3 years ago

--strict fails for every sample in a recent run, looks like a bug with the contig missing from the vcf header which is causing vcf_parse to fail.

Content of vcfreport.txt:

[16:35:15] [artic-tools::check_vcf] starting VCF checker
[16:35:15] [artic-tools::check_vcf] reading scheme
[16:35:15] [artic-tools::check_vcf] collecting scheme stats
[16:35:15] [artic-tools::check_vcf]   primer scheme file:   ./primer-schemes/scov2/V3/scov2.scheme.bed
[16:35:15] [artic-tools::check_vcf]   reference sequence:   MN908947.3
[16:35:15] [artic-tools::check_vcf]   number of pools:    2
[16:35:15] [artic-tools::check_vcf]   number of primers:   218 (includes 22 alts)
[16:35:15] [artic-tools::check_vcf]   minimum primer size:  22
[16:35:15] [artic-tools::check_vcf]   maximum primer size:  57
[16:35:15] [artic-tools::check_vcf]   number of amplicons:  98
[16:35:15] [artic-tools::check_vcf]   mean amplicon size:   343
[16:35:15] [artic-tools::check_vcf]   maximum amplicon size: 375
[16:35:15] [artic-tools::check_vcf]   scheme ref. span:    30-29866
[16:35:15] [artic-tools::check_vcf]   scheme overlaps:    12.850247%
[16:35:15] [artic-tools::check_vcf] checking VCF file
[16:35:15] [artic-tools::check_vcf]   filtering variants: true
[16:35:15] [artic-tools::check_vcf]   output file: SAMP_ID.merged.filtered.vcf
[16:35:15] [artic-tools::check_vcf]   discard primer site vars: true
[16:35:15] [artic-tools::check_vcf]   discard overlap fail vars: true
[W::vcf_parse] Contig 'MN908947.3' is not defined in the header. (Quick workaround: index the file with tabix.)
error--> refusing to process VCF records

Specifically the command failing on standard error is reported by artic minion as:

Command failed:artic-tools check_vcf --dropPrimerVars --dropOverlapFails --vcfOut SAMP_ID.merged.filtered.vcf SAMP_ID.merged.vcf ./primer-schemes/scov2/V3/scov2.scheme.bed 2> SAMP_ID.vcfreport.txt

will-rowe commented 3 years ago

Thanks for posting the issue here - I'll take a look. Hopefully before the end of the week!

MattBashton commented 3 years ago

OKay so did some digging, I'm using the default nanopolish option, and the file consumed at this stage SAMP_ID.merged.filtered.vcf looks to be missing the canonical vcf spec ##contig= line specifically the VCF header looks like this:

##fileformat=VCFv4.2
##nanopolish_window=MN908947.3:1-29902
##INFO=<ID=TotalReads,Number=1,Type=Integer,Description="The number of event-space reads used to call the variant">
##INFO=<ID=SupportFraction,Number=1,Type=Float,Description="The fraction of event-space reads that support the variant">
##INFO=<ID=SupportFractionByStrand,Number=2,Type=Float,Description="Fraction of event-space reads that support the variant for each strand">
##INFO=<ID=BaseCalledReadsWithVariant,Number=1,Type=Integer,Description="The number of base-space reads that support the variant">
##INFO=<ID=BaseCalledFraction,Number=1,Type=Float,Description="The fraction of base-space reads that support the variant">
##INFO=<ID=AlleleCount,Number=1,Type=Integer,Description="The inferred number of copies of the allele">
##INFO=<ID=StrandSupport,Number=4,Type=Integer,Description="Number of reads supporting the REF and ALT allele, by strand">
##INFO=<ID=StrandFisherTest,Number=1,Type=Integer,Description="Strand bias fisher test">
##INFO=<ID=SOR,Number=1,Type=Float,Description="StrandOddsRatio test from GATK">
##INFO=<ID=RefContext,Number=1,Type=String,Description="The reference sequence context surrounding the variant call">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

I think valid looking contig line in the header should look like this:

##contig=<ID=MN908947.3,length=29903,URL=https://raw.githubusercontent.com/artic-network/primer-schemes/master/nCoV-2019/V3/nCoV-2019.reference.fasta,species="Severe acute respiratory syndrome coronavirus 2">

I guess if nanopolish is running, details from the primer scheme ref file can be used to complete these details and make sure at very least the ID line and length are completed. Which should hopefully get vcf_parse working.

will-rowe commented 3 years ago

Looks like the problem was an unindexed VCF. It didn't seem to be an issue in my older testing environment but appears with the latest env release - my bad for not testing this feature properly before deploying!

Closing this issue in favour of the fix to the artic pipeline over in https://github.com/artic-network/fieldbioinformatics/pull/68 .

This leaves artic-tools requiring an indexed VCF and adds the required indexing step to the artic pipeline. Decision for this is because the indexed VCF will be useful later on when I get round to adding more features to the vcf checker.

Thanks again for raising this issue @MattBashton!