zaeleus / noodles

Bioinformatics I/O libraries in Rust
MIT License
477 stars 53 forks source link

IUPAC codes in VCF REF field #268

Closed ACEnglish closed 2 months ago

ACEnglish commented 3 months ago

Hello,

First of all, thank you for supporting this package. I've had fun using it.

I received an issue on my tool that uses noodles-vcf in which variant records with IUPAC codes in the REF were being thrown out. In an attempt to fix the problem I started by moving to the latest version of noodles-vcf (0.57) and found that the reading error was no longer being thrown.

However, when attempting to write the record back out, the error is being thrown. This would be fine except noodles writes partial lines in the output VCF which are truncated at the offending bases. For example:

chr20   24408074    .   GTATTTCCAGCCTTGGCCATAGchr20 24408077    .   TTTCCAGCCTTGGCC

(note that the rest of the entry @ 24408077 was written, just truncated here for readability)

Is it possible for noodles-vcf to raise the error before writing in order to prevent the corrupted lines? My idea is that I can catch the error, attempt to fix the VCF entry, and retry writing.

I think no is a more than fair answer to this request. If so, I'd just need to validate all entries before trying to write, which would incur some overhead doing redundant checks to noodles.

If the answer is yes, but you don't have time, I'm willing to try creating a pull request. My idea would be to edit io/writer/record so that the write_* calls are sent a temporary buffer instead of the final writer. Then, at the end of the method the buffer is sent to the final writer. However, I'm not familiar enough with the code base to know if this would be sufficient. So any guidance on what all would need to change would be helpful.

zaeleus commented 2 months ago

As you noted in https://github.com/ACEnglish/kanpig/issues/1, the error is being triggered because the input is invalid. Reference bases must be {A, C, G, T, N}. In previous versions of noodles, this was validated when parsed, but it was changed to when serializing to allow at least reading such cases.

I think the solution is to integrate an ambiguity base resolver into noodles-vcf's record reference bases writer. This likely covers the vast majority of "invalid" reference bases. Although this behavior is undefined in VCF < 4.3, it is valid to apply to those versions.

Is it possible for noodles-vcf to raise the error before writing in order to prevent the corrupted lines?

This isn't really ideal because the given record's in-memory format is opaque, i.e., records are implementations of vcf::variant::Record. It's the reason why validation is done inline; the record is not guaranteed to have been previously parsed (e.g., vcf::Record and bcf::Record).

My idea would be to edit io/writer/record so that the write_* calls are sent a temporary buffer instead of the final writer.

For this approach, my recommendation would be to create a wrapper around vcf::io::Writer, e.g.,

main.rs ```rust // cargo add noodles@0.74.0 --features core,vcf use std::io::{self, Write}; use noodles::{ core::Position, vcf::{ self as vcf, header::record::value::{map::Contig, Map}, variant::io::Write as _, }, }; fn main() -> io::Result<()> { let stdout = io::stdout().lock(); let mut writer = VcfLineWriter::new(vcf::io::Writer::new(stdout)); let header = vcf::Header::builder() .add_contig("sq0", Map::::new()) .build(); writer.write_header(&header)?; let record = vcf::variant::RecordBuf::builder() .set_reference_sequence_name("sq0") .set_variant_start(Position::MIN) .set_reference_bases("R") .build(); writer.write_variant_record(&header, &record)?; Ok(()) } struct VcfLineWriter { inner: vcf::io::Writer, buf: Vec, } impl VcfLineWriter where W: Write, { fn new(inner: vcf::io::Writer) -> Self { Self { inner, buf: Vec::new(), } } fn write_header(&mut self, header: &vcf::Header) -> io::Result<()> { self.buf.clear(); let mut writer = vcf::io::Writer::new(&mut self.buf); writer.write_header(header)?; self.inner.get_mut().write_all(&self.buf) } fn write_variant_record(&mut self, header: &vcf::Header, record: &R) -> io::Result<()> where R: vcf::variant::Record, { self.buf.clear(); let mut writer = vcf::io::Writer::new(&mut self.buf); writer.write_variant_record(header, record)?; self.inner.get_mut().write_all(&self.buf) } } ```

Thanks for the report, and let me know if you have further questions or other issues with migrating to noodles-vcf >= 0.52.0.