zaeleus / noodles

Bioinformatics I/O libraries in Rust
MIT License
499 stars 52 forks source link

Async BGZF readers eagerly evaluate blocks #243

Closed mmalenic closed 7 months ago

mmalenic commented 8 months ago

This is more to clarify functionality than any sort of bug.

When using BGZF readers, the worker count eagerly evaluates multiple blocks. The means that if you attempt to read a file that is valid up to a certain point, it will fail before reading all possible records. For example, consider reading a BAM header with arbitrary data that extends beyond the header data:

use std::num::NonZeroUsize;
use noodles::{bam as bam, sam};
use noodles::sam::header::record::value::Map;
use noodles::sam::header::record::value::map::ReferenceSequence;

#[tokio::main]
async fn main() {
    // Create header
    let header = sam::Header::builder().add_reference_sequence(
        "sq0",
        Map::<ReferenceSequence>::new(NonZeroUsize::try_from(13).unwrap()),
    ).build();
    let mut writer = bam::AsyncWriter::new(vec![]);
    writer.write_header(&header).await.unwrap();
    writer.shutdown().await.unwrap();

    let mut data = writer.into_inner().into_inner();

    // Add junk data to the end.
    data.push(0);

    // Try to read header
    let mut reader = bam::AsyncReader::new(data.as_slice());
    let header = reader.read_header().await;
    println!("{:#?}", header);
}

If this data does not align to a BGZF block, previously valid blocks fail to be read and an error is returned from read_header:

Err(
    Custom {
        kind: Other,
        error: "bytes remaining on stream",
    },
)

For some reason, I was expecting read_header to succeed and ignore any data past the end of the header, although I could be wrong in this assumption. If the worker_count is set to 1, or if using the non-async bam::io::Reader, this code does not produce an error.

Is this the intended functionality? I guess by nature of it being an asynchronous reader, it's meant to eagerly evaluate records in advance?

zaeleus commented 8 months ago

For concurrent decoding, the reader does buffer future blocks and expects them to be complete, so yes, it was intentionally designed to be like that.

The error could be delayed to on access if incomplete blocks are allowed and validation occurs on inflate instead. Would this be more useful than completely erroring before access on an invalid stream? I'm assuming yes, given your question, but I'm just confirming whether this is theoretical or not. :)

mmalenic commented 7 months ago

Okay, that makes sense, thanks.

Would this be more useful than completely erroring before access on an invalid stream?

Yes, I think so, i.e. if on-access means that valid blocks were read and returned successfully before invalid ones. This sometimes occurs in htsget-rs when reading header data coming from a partial stream of the file, e.g. with Crypt4GH blocks, because those blocks don't necessarily align to BGZF blocks.