zaeleus / noodles

Bioinformatics I/O libraries in Rust
MIT License
515 stars 53 forks source link

InvalidVersion error #308

Closed ACEnglish closed 2 weeks ago

ACEnglish commented 2 weeks ago

Hello,

I've come across a VCF which has extra whitespace at the end of the format header line:

##fileformat=VCFv4.2<tab><tab>

Where <tab> is an actual tab. This is causing noodles to raise an error:

Custom { kind: InvalidData, error: InvalidRecord(InvalidValue(InvalidFileFormat(InvalidVersion))) }

This is clearly an error in the user's VCF, except there's nothing explicitly stated in the specs that there can't be whitespaces. Since htslib seems to be robust to this error, I wanted to inform you in case you wanted to add e.g. src = src.trim_end(); to the parser. If you don't think this should be added, I support that decision because this is a (hopefully) rare, weird outlier vcf.

Have a great day, ~/Adam English

zaeleus commented 2 weeks ago

That's an interesting artifact.

Trimming the fileformat line sets a precedent that all lines should be trimmed, and I'm not sure that's a desirable outcome.

For now, I would suggest the following workaround, which manually preprocesses lines and builds a VCF header.

vcf_308.rs ```rust use std::io::{self, BufRead}; use noodles_vcf as vcf; const DATA: &[u8] = b"\ ##fileformat=VCFv4.2 #CHROM POS ID REF ALT QUAL FILTER INFO "; fn main() -> Result<(), Box> { let mut reader = vcf::io::Reader::new(DATA); let header = read_header(reader.get_mut())?; dbg!(header); Ok(()) } fn read_header(reader: &mut R) -> io::Result where R: BufRead, { const PREFIX: u8 = b'#'; const LINE_FEED: u8 = b'\n'; let mut parser = vcf::header::Parser::default(); let mut line = Vec::new(); loop { let src = reader.fill_buf()?; if src.is_empty() || src[0] != PREFIX { break; } line.clear(); reader.read_until(LINE_FEED, &mut line)?; parser .parse_partial(line.trim_ascii_end()) .map_err(|e| io::Error::new(io::ErrorKind::InvalidData, e))?; } parser .finish() .map_err(|e| io::Error::new(io::ErrorKind::InvalidData, e)) } ```

Since htslib seems to be robust to this error

But perhaps overly robust. It not only accepts invalid inputs but produces wrong results, e.g.,

$ echo "##fileformat=VCFv4.65538" | htsfile -
-:      VCF version 4.2 variant calling text
$ echo "##fileformat=VCFvN/A" | htsfile -
-:      VCF version 0.0 variant calling text

When building a VCF header, htslib doesn't parse unstructured lines, including the value of fileformat (samtools/htslib/vcf.c#L1189-L1195). It copies the input as-is (samtools/htslib/vcf.c#L583-L592), so the trailing characters even remain after a rewrite.

ACEnglish commented 2 weeks ago

That's interesting about htslib. In that case, I'm even more on-board with noodles continuing to raise the error. And if this particular vcf wasn't a one-off anomaly, we'll be better prepared to investigate future problems. Thanks!