Concurrent reading/decompression of BAM files.

Adoni5 commented 9 months ago

Hi @zaeleus ,

Thank you very much for this solid set of crates! I have a question, specifically about the fastest way to iterate BAM records. I've written code in rust_htslib

    let mapq_filter = mapq_filter.unwrap_or(60);
    let mut bam_reader = bam::IndexedReader::from_path(&bam_file.path).unwrap_or_else(|_| {
        panic!(
            "Error creating bam reader with file {}",
            bam_file.path.display()
        )
    });
    bam_reader
        .set_threads(threads.unwrap_or(num_cpus::get()))
        .expect("Error setting threads");
    let header = bam_reader.header().to_owned();
    let mut record = Record::new();
    while let Some(result) = bam_reader.read(&mut record) {
        match result {
            Ok(_) => {
                if !record.is_unmapped()
                    & (record.mapq() >= mapq_filter)
                    & !record.is_secondary()
                    & !record.is_supplementary()
                {
                    let target_name =
                        std::str::from_utf8(header.tid2name(record.tid() as u32)).unwrap();

                    results.get_mut(target_name).unwrap()[record.pos() as usize / BIN_SIZE] += 1;
                    valid_number_reads += 1;
                }
            }
            Err(_) => panic!("BAM parsing failed..."),
        }
        bar.inc(1);
    }

This runs approximately 6 times faster than the equivalent noodles_bam implementation. I suspect this due to the rust_htslib thread pool, and was wondering how I can efficiently reimplement this into noodles? This is what I have so far -

    use fnv::FnvHashMap;

    use noodles::bam::indexed_reader::Builder;
    use noodles::sam::alignment::Record;
    let mapq_filter = 60;

    let mut bam_reader = Builder::default().build_from_path(bam_file.path)?;

    let header = bam_reader.read_header().unwrap().to_owned();
    let mut results: FnvHashMap<String, Vec<u16>> = FnvHashMap::default();
    let mut record = Record::default();
    while bam_reader.read_record(&header, &mut record)? != 0 {
        let flags = record.flags();
        if !flags.is_unmapped()
            & (record.mapping_quality().unwrap().get() >= mapq_filter)
            & !flags.is_secondary()
            & !flags.is_supplementary()
        {
            let target_name = record
                .reference_sequence(&header)
                .unwrap()
                .unwrap()
                .0
                .as_str();
            results.get_mut(target_name).unwrap()
                [record.alignment_start().unwrap().get() / BIN_SIZE] += 1;
        }
    }

Many Thanks, Rory

zaeleus commented 9 months ago

When reading through a file like this (no seeking and few fields), you can 1) use bgzf::MultithreadedReader for parallel block decoding and 2) read lazy BAM records to avoid decoding all the fields in a record.

Here's a full example for noodles.

$ cargo new issue_219
$ cd issue_219
$ cargo add noodles --features bam,bgzf,sam
$ cargo add noodles-bgzf --features libdeflate

main.rs

```rust use std::{collections::HashMap, env, fs::File, io, num::NonZeroUsize, thread}; use noodles::{ bam, bgzf, sam::{self, record::MappingQuality}, }; const BIN_SIZE: usize = 1 << 14; fn main() -> io::Result<()> { let src = env::args().nth(1).expect("missing src"); let file = File::open(src)?; let worker_count = thread::available_parallelism().unwrap_or(NonZeroUsize::MIN); let decoder = bgzf::MultithreadedReader::with_worker_count(worker_count, file); let mut reader = bam::Reader::from(decoder); let header = reader.read_header()?; let mut frequencies: HashMap> = HashMap::new(); for (name, reference_sequence) in header.reference_sequences() { let len = (usize::from(reference_sequence.length()) / BIN_SIZE) + 1; let bins = vec![0; len]; frequencies.insert(name.to_string(), bins); } let mut record = bam::lazy::Record::default(); while reader.read_lazy_record(&mut record)? != 0 { if filter(&record) { let reference_sequence_name = reference_sequence_name(&record, &header)?.expect("missing reference sequence ID"); let start = record.alignment_start()?.expect("missing alignment start"); let bins = frequencies .get_mut(reference_sequence_name) .expect("missing frequencies"); let bin = (usize::from(start) - 1) / BIN_SIZE; bins[bin] += 1; } } Ok(()) } fn filter(record: &bam::lazy::Record) -> bool { const MIN_MAPPING_QUALITY: MappingQuality = match MappingQuality::new(60) { Some(mapq) => mapq, None => unreachable!(), }; let flags = record.flags(); !flags.is_unmapped() && record .mapping_quality() .map(|mapq| mapq >= MIN_MAPPING_QUALITY) .unwrap_or(true) && !flags.is_secondary() && !flags.is_supplementary() } fn reference_sequence_name<'a>( record: &bam::lazy::Record, header: &'a sam::Header, ) -> io::Result> { record .reference_sequence_id()? .map(|id| { header .reference_sequences() .get_index(id) .map(|(name, _)| name.as_ref()) .ok_or_else(|| { io::Error::new(io::ErrorKind::InvalidData, "invalid reference sequence ID") }) }) .transpose() } ```

Adoni5 commented 9 months ago

Thanks @zaeleus for the quick and detailed reply - will give this a try tomorrow and report back!

Adoni5 commented 9 months ago

Can confirm that this is twice as fast as the implementation that I wrote in rust_htslib 🥇 . Whilst there may be some under the hood compiler optimisations on the filter and reference_seq_name functions (which I haven't done in my htslib based code), and setting the number of threads to a smarter number than just the number of available cores, I'm very happy I don't have to mess around cross-compiling the htslib C bindings anymore!

Thanks very much @zaeleus!

Edit

Using available_parallelism in the rust_htslib version, makes a big difference with it now being ~2 seconds slower than noodles over a 2.8 million row BAM file! That is a good trick to know. There probably some optimisations I could make to the htslib code, but still very happy with this, Thanks again!

Noodles

----------------------------------------------- benchmark: 1 tests -----------------------------------------------
Name (time in s)              Min      Max     Mean  StdDev   Median     IQR  Outliers     OPS  Rounds  Iterations
------------------------------------------------------------------------------------------------------------------
test_iterate_bam_file     44.8837  46.5181  45.7009  1.1557  45.7009  1.6344       0;0  0.0219       2           2
------------------------------------------------------------------------------------------------------------------

Rust-htslib


----------------------------------------------- benchmark: 1 tests -----------------------------------------------
Name (time in s)              Min      Max     Mean  StdDev   Median     IQR  Outliers     OPS  Rounds  Iterations
------------------------------------------------------------------------------------------------------------------
test_iterate_bam_file     47.3635  47.4921  47.4278  0.0909  47.4278  0.1286       0;0  0.0211       2           2
------------------------------------------------------------------------------------------------------------------

zaeleus / noodles