Closed ACEnglish closed 2 months ago
As you noted in https://github.com/ACEnglish/kanpig/issues/1, the error is being triggered because the input is invalid. Reference bases must be {A, C, G, T, N}
. In previous versions of noodles, this was validated when parsed, but it was changed to when serializing to allow at least reading such cases.
I think the solution is to integrate an ambiguity base resolver into noodles-vcf's record reference bases writer. This likely covers the vast majority of "invalid" reference bases. Although this behavior is undefined in VCF < 4.3, it is valid to apply to those versions.
Is it possible for noodles-vcf to raise the error before writing in order to prevent the corrupted lines?
This isn't really ideal because the given record's in-memory format is opaque, i.e., records are implementations of vcf::variant::Record
. It's the reason why validation is done inline; the record is not guaranteed to have been previously parsed (e.g., vcf::Record
and bcf::Record
).
My idea would be to edit
io/writer/record
so that thewrite_*
calls are sent a temporary buffer instead of the final writer.
For this approach, my recommendation would be to create a wrapper around vcf::io::Writer
, e.g.,
main.rs
Thanks for the report, and let me know if you have further questions or other issues with migrating to noodles-vcf >= 0.52.0.
Hello,
First of all, thank you for supporting this package. I've had fun using it.
I received an issue on my tool that uses
noodles-vcf
in which variant records with IUPAC codes in the REF were being thrown out. In an attempt to fix the problem I started by moving to the latest version ofnoodles-vcf
(0.57) and found that the reading error was no longer being thrown.However, when attempting to write the record back out, the error is being thrown. This would be fine except noodles writes partial lines in the output VCF which are truncated at the offending bases. For example:
(note that the rest of the entry @ 24408077 was written, just truncated here for readability)
Is it possible for noodles-vcf to raise the error before writing in order to prevent the corrupted lines? My idea is that I can catch the error, attempt to fix the VCF entry, and retry writing.
I think no is a more than fair answer to this request. If so, I'd just need to validate all entries before trying to write, which would incur some overhead doing redundant checks to noodles.
If the answer is yes, but you don't have time, I'm willing to try creating a pull request. My idea would be to edit
io/writer/record
so that thewrite_*
calls are sent a temporary buffer instead of the final writer. Then, at the end of the method the buffer is sent to the final writer. However, I'm not familiar enough with the code base to know if this would be sufficient. So any guidance on what all would need to change would be helpful.