zaeleus / noodles

Bioinformatics I/O libraries in Rust
MIT License
477 stars 53 forks source link

fix: lazy vcf overreading into next record #224

Closed tshauck closed 7 months ago

tshauck commented 7 months ago

Hi, using the lazy vcf reader, I think I found an issue w.r.t. how the fields are being parsed. In cases where not all fields are present, read_field will read into the next line causing record.buf to contain the record in question as it was parsed and the "raw" next record (because of the last read_line in read_lazy_record).

For example, if you modify test_read_lazy_record test to have two records, &b"sq0\t1\t.\tA\t.\t.\tPASS\t.\nsq0\t1\t.\tA\t.\t.\tPASS\t."[..];, the test fails because record.buf is actually sq01.A..PASS.\nsq01\t.\tA\t.\t.\tPASS\t. (the samples field has the next record).

    #[tokio::test]
    async fn test_read_lazy_record() -> io::Result<()> {
        let mut src = &b"sq0\t1\t.\tA\t.\t.\tPASS\t.\nsq0\t1\t.\tA\t.\t.\tPASS\t."[..];

        let mut record = lazy::Record::default();
        read_lazy_record(&mut src, &mut record).await?;

        assert_eq!(record.buf, "sq01.A..PASS.");

        assert_eq!(record.bounds.chromosome_end, 3);
        assert_eq!(record.bounds.position_end, 4);
        assert_eq!(record.bounds.ids_end, 5);
        assert_eq!(record.bounds.reference_bases_end, 6);
        assert_eq!(record.bounds.alternate_bases_end, 7);
        assert_eq!(record.bounds.quality_score_end, 8);
        assert_eq!(record.bounds.filters_end, 12);
        assert_eq!(record.bounds.info_end, 13);

        Ok(())
    }

To attempt a fix, I took perhaps the dumbest and least idiomatic approach :), but I thought I'd open it up to see if you agree there's an issue and if so, thoughts on the best way to fix it.

Thanks,

zaeleus commented 7 months ago

I've been reworking the alignment formats, and I actually very recently fixed this in SAM: https://github.com/zaeleus/noodles/commit/92c51198fac5dc6710fcabc4ddda929f9ff9a6e0. The current reader fails when optional fields aren't present, but I haven't started reworking the variant formats yet.

The first 8 fields are required, so I don't think it's valid to return success if a line feed is reached.

tshauck commented 7 months ago

Cool... do you have plans to work on the variant format soon / want me to hold off, or mind if I adapted your approach w/ SAM files to VCFs? It's causing those files to give incorrect results, so I'd like to get it resolved.

zaeleus commented 7 months ago

Allow me to fix the blocking reader and reuse that in a buffered async read. I'll let you know when that's done.

The alignment format rework is close to a checkpoint, but I don't have a precise timeline. I will use my findings from that to improve the variant crates.

tshauck commented 7 months ago

Sounds good to me, thanks.

zaeleus commented 7 months ago

Thanks for using and reporting issues with the VCF lazy records!