zaeleus / noodles

Bioinformatics I/O libraries in Rust
MIT License
477 stars 53 forks source link

Lazy reading of BAM records requires reading of the header first #221

Closed markjschreiber closed 8 months ago

markjschreiber commented 8 months ago

First of all. Very cool to see this library! Great work!

Reading records using bam_reader.lazy_records() doesn't require a reference to the header whereas a non lazy read does. Given that it doesn't use the header I thought I could just start reading records. Unfortunately that results in an unexpected EOF error.

Having to read the header first isn't a big deal in terms of performance hit but it would be cool if the lazy_records() could automatically skip over the header if it hasn't been read already.

use std::env::args;
use std::io;
use noodles::bam as bam;

fn main() -> io::Result<()> {
    let source = args().nth(1).expect("supply an input path");
    let mut bam_reader = bam::reader::Builder.build_from_path(source)?;
    let mut count = 0;

    // If I remove this line reading will fail on unexpected end of file
    let _header = bam_reader.read_header()?;

    for result in bam_reader.lazy_records() {
        let _record = result?;
        count +=1;

        if count % 1_000_000 == 0 {
            eprintln!("read {count} records");
        }
    }

    println!("{count}");
    Ok(())
}
zaeleus commented 8 months ago

Thanks, @markjschreiber!

The architecture is intentional, and there are a few reasons why the caller is required to read the header manually.

  1. Readers should not perform I/O on construction.
  2. The caller should own the header, not the reader.
  3. Reader::read_lazy_record is in the hot path and should avoid header checks on each call. (There would also be a concern of what to do after a seek.)

There are no plans to automatically discard the header, but you can always create a wrapper yourself, e.g.,

struct LazyBamReader<R>(bam::Reader<R>);

impl<R> LazyBamReader<R>
where
    R: Read,
{
    fn new(mut inner: bam::Reader<R>) -> io::Result<Self> {
        inner.read_header()?;
        Ok(Self(inner))
    }

    fn read_lazy_record(&mut self, record: &mut bam::lazy::Record) -> io::Result<usize> {
        self.0.read_lazy_record(record)
    }
}