zaeleus / noodles

Bioinformatics I/O libraries in Rust
MIT License
477 stars 53 forks source link

Reading entries of file into preallocated buffer #260

Closed rickymagner closed 3 months ago

rickymagner commented 3 months ago

Hi, I was wondering if it's possible (or on the roadmap to add) to iterate over lines of a file and store the results into a preallocated buffer? For example, the rust_htslib library has a method for reading from a BAM directly into a Record object that's been preallocated. In just doing something silly like taking the sum of all TLEN's in a file, this seemed to be 3x faster than noodles for a small BAM I tried it out on. It'd be great if noodles could offer a similar API for iterating over entries of files so new memory doesn't need to be allocated on every line.

zaeleus commented 3 months ago

Yes, it's possible to reuse a bam::Record buffer via Reader::read_record, e.g.,

let mut record = bam::Record::default();

while reader.read_record(&mut record)? != 0 {
    // ...
}
rickymagner commented 3 months ago

Great, thanks! Also I think the difference in speed was mostly due to extra noodles verification (which I've seen talked about in other issues). I tried the hts version without using the preallocated buffer and it was still much faster, but understand this library strives for accuracy in handling the format over speed in some cases (which is a valid choice).

zaeleus commented 3 months ago

I would expect noodles to be quite competitive in the summation of a single field. Feel free to post the benchmark in a discussion if you'd like to investigate further.