zaeleus / noodles

Bioinformatics I/O libraries in Rust
MIT License
482 stars 52 forks source link

noodles implementation is slower than rust-htslib, did I do something suboptimal? #218

Closed bwlang closed 7 months ago

bwlang commented 9 months ago

rust-htslib:

cat   0.00s user 0.01s system 1% cpu 0.568 total
sudo cargo run --release -- > tandem_reads.bam  0.43s user 0.04s system 82% cpu 0.570 total

noodles:

cat   0.00s user 0.01s system 0% cpu 1.192 total
sudo cargo run --release -- > tandem_reads.bam  1.04s user 0.05s system 91% cpu 1.195 total

i first implemented with htslib, then tried switching to noodles with no architectural changes.

image

for noodles i'm opening like this:

    let mut bam_reader = bam::reader::Builder.build_from_path(&opt.input_bam_file)?;
    let header = bam_reader.read_header()?;
    let mut bam_writer = bam::writer::Builder.build_from_path(&opt.bam_output_file)?;
    for r in bam_reader.records(&header) {
        ...
    }

for htlslib like this:

    let mut bam = bam::Reader::from_path(&opt.input_bam_file).unwrap();
    let header = bam::Header::from_template(bam.header());
    let mut bam_out = bam::Writer::from_path(&opt.bam_output_file, &header, bam::Format::Bam).unwrap();
    for r in bam.records() {
        ...
    }

The internal filtering logic is the same , reasoning about flags insert size , etc - writing out some records.

Any suggestions?

bwlang commented 9 months ago

this is on a macbook arm64 computer.

zaeleus commented 9 months ago

noodles tends to do more record validation and tries to ensure data is spec compliant. However, in your flamegraphs, it seems the DEFLATE implementation htslib linked to is using SIMD intrinsics (see inflate_fast_neon and longest_match_neon) on your CPU. By default, noodles uses the pure Rust implementation miniz_oxide, which can't compete in this case.

Can you retime your tests using libdeflate for both? I.e., cargo add noodles-bgzf --features libdeflate for noodles and cargo add rust-htslib --features libdeflate for rust-htslib. libdeflate is much faster for this application, regardless, and it would be better than comparing differing DEFLATE implementations.

bwlang commented 9 months ago

with htslib libdeflate:

cat   0.00s user 0.01s system 1% cpu 0.574 total
sudo cargo run --release -- > tandem_reads.bam  0.45s user 0.03s system 83% cpu 0.577 total

with noodles

cat   0.00s user 0.01s system 1% cpu 0.740 total
sudo cargo run --release -- > tandem_reads.bam  0.61s user 0.05s system 89% cpu 0.743 total
image

much closer... thanks for the suggestion! Anything else I should try?

zaeleus commented 9 months ago

If you don't need owned records, you can reuse the record buffer.

let mut record = sam::alignment::Record::default();

while bam_reader.read_record(&header, &mut record)? != 0 {
    // ...
}

When doing this, you may also want to use Reader::rc_records in your rust-htslib test for a more fair comparison.

zaeleus commented 7 months ago

noodles 0.61.0 now allows alignment format records to read and written, but I'm not sure if this will help in a passthrough case like in your tests. Feel free to open a new discussion if you have any further questions.