Closed ACEnglish closed 2 weeks ago
That's an interesting artifact.
Trimming the fileformat
line sets a precedent that all lines should be trimmed, and I'm not sure that's a desirable outcome.
For now, I would suggest the following workaround, which manually preprocesses lines and builds a VCF header.
vcf_308.rs
Since htslib seems to be robust to this error
But perhaps overly robust. It not only accepts invalid inputs but produces wrong results, e.g.,
$ echo "##fileformat=VCFv4.65538" | htsfile -
-: VCF version 4.2 variant calling text
$ echo "##fileformat=VCFvN/A" | htsfile -
-: VCF version 0.0 variant calling text
When building a VCF header, htslib doesn't parse unstructured lines, including the value of fileformat
(samtools/htslib/vcf.c#L1189-L1195
). It copies the input as-is (samtools/htslib/vcf.c#L583-L592
), so the trailing characters even remain after a rewrite.
That's interesting about htslib. In that case, I'm even more on-board with noodles continuing to raise the error. And if this particular vcf wasn't a one-off anomaly, we'll be better prepared to investigate future problems. Thanks!
Hello,
I've come across a VCF which has extra whitespace at the end of the format header line:
Where
<tab>
is an actual tab. This is causing noodles to raise an error:This is clearly an error in the user's VCF, except there's nothing explicitly stated in the specs that there can't be whitespaces. Since htslib seems to be robust to this error, I wanted to inform you in case you wanted to add e.g.
src = src.trim_end();
to the parser. If you don't think this should be added, I support that decision because this is a (hopefully) rare, weird outlier vcf.Have a great day, ~/Adam English