zaeleus / noodles

Bioinformatics I/O libraries in Rust
MIT License
512 stars 53 forks source link

fix: lazy vcf overreading into next record #224

Closed tshauck closed 10 months ago

tshauck commented 10 months ago

Hi, using the lazy vcf reader, I think I found an issue w.r.t. how the fields are being parsed. In cases where not all fields are present, read_field will read into the next line causing record.buf to contain the record in question as it was parsed and the "raw" next record (because of the last read_line in read_lazy_record).

For example, if you modify test_read_lazy_record test to have two records, &b"sq0\t1\t.\tA\t.\t.\tPASS\t.\nsq0\t1\t.\tA\t.\t.\tPASS\t."[..];, the test fails because record.buf is actually sq01.A..PASS.\nsq01\t.\tA\t.\t.\tPASS\t. (the samples field has the next record).

    #[tokio::test]
    async fn test_read_lazy_record() -> io::Result<()> {
        let mut src = &b"sq0\t1\t.\tA\t.\t.\tPASS\t.\nsq0\t1\t.\tA\t.\t.\tPASS\t."[..];

        let mut record = lazy::Record::default();
        read_lazy_record(&mut src, &mut record).await?;

        assert_eq!(record.buf, "sq01.A..PASS.");

        assert_eq!(record.bounds.chromosome_end, 3);
        assert_eq!(record.bounds.position_end, 4);
        assert_eq!(record.bounds.ids_end, 5);
        assert_eq!(record.bounds.reference_bases_end, 6);
        assert_eq!(record.bounds.alternate_bases_end, 7);
        assert_eq!(record.bounds.quality_score_end, 8);
        assert_eq!(record.bounds.filters_end, 12);
        assert_eq!(record.bounds.info_end, 13);

        Ok(())
    }

To attempt a fix, I took perhaps the dumbest and least idiomatic approach :), but I thought I'd open it up to see if you agree there's an issue and if so, thoughts on the best way to fix it.

Thanks,

zaeleus commented 10 months ago

I've been reworking the alignment formats, and I actually very recently fixed this in SAM: https://github.com/zaeleus/noodles/commit/92c51198fac5dc6710fcabc4ddda929f9ff9a6e0. The current reader fails when optional fields aren't present, but I haven't started reworking the variant formats yet.

The first 8 fields are required, so I don't think it's valid to return success if a line feed is reached.

tshauck commented 10 months ago

Cool... do you have plans to work on the variant format soon / want me to hold off, or mind if I adapted your approach w/ SAM files to VCFs? It's causing those files to give incorrect results, so I'd like to get it resolved.

zaeleus commented 10 months ago

Allow me to fix the blocking reader and reuse that in a buffered async read. I'll let you know when that's done.

The alignment format rework is close to a checkpoint, but I don't have a precise timeline. I will use my findings from that to improve the variant crates.

tshauck commented 10 months ago

Sounds good to me, thanks.

zaeleus commented 10 months ago

Thanks for using and reporting issues with the VCF lazy records!