zaeleus / noodles

Bioinformatics I/O libraries in Rust
MIT License
477 stars 53 forks source link

Seamless parsing of BAM/SAM #227

Closed mbhall88 closed 6 months ago

mbhall88 commented 6 months ago

I was wondering if there is a way to seamlessly parse SAM or BAM.

It seems not from the API...

Alternatively, it would be nice to have a method which returns the file type (i.e., BAM/SAM) for an alignment file.

zaeleus commented 6 months ago

For an opaque alignment reader (SAM/BAM/CRAM), I would recommend using alignment::io::Reader in noodles-util with the alignment feature. E.g.,

// cargo add noodles-util --features alignment

use std::{env, io};

use noodles_util::alignment;

fn main() -> io::Result<()> {
    let src = env::args().nth(1).expect("missing src");

    let mut reader = alignment::io::reader::Builder::default().build_from_path(src)?;
    let header = reader.read_header()?;

    for result in reader.records(&header) {
        let record = result?;
        // ...
    }

    Ok(())
}

This automatically detects the file format and compression method of the input. The iterator returns records as noodles::sam::alignment::Record trait objects.

The alignment readers implement noodles::sam::alignment::io::Read, so alternatively, you can also use your own heuristics to determine the file format, e.g., by extension:

// cargo add noodles --features bam,bgzf,sam

use std::{
    env,
    fs::File,
    io::{self, BufRead, BufReader},
    path::PathBuf,
};

use noodles::{bam, bgzf, sam};

fn main() -> io::Result<()> {
    let src = env::args().nth(1).map(PathBuf::from).expect("missing src");

    let file = File::open(&src)?;

    let mut reader: Box<dyn sam::alignment::io::Read<_>> =
        match src.extension().and_then(|ext| ext.to_str()) {
            Some("sam") => {
                let inner: Box<dyn BufRead> = Box::new(BufReader::new(file));
                Box::new(sam::io::Reader::from(inner))
            }
            Some("bam") => {
                let decoder: Box<dyn BufRead> = Box::new(bgzf::Reader::new(file));
                Box::new(bam::io::Reader::from(decoder))
            }
            _ => panic!("unsupported file format"),
        };

    let header = reader.read_alignment_header()?;

    for result in reader.alignment_records(&header) {
        let record = result?;
        // ...
    }

    Ok(())
}
mbhall88 commented 6 months ago

Oh very cool. Sorry I missed that.

Out of interest, how does it detect the file type exactly?

zaeleus commented 6 months ago

It first detects whether the input is (b)gzip-compressed, i.e., read the first two bytes and check the magic number for 1f 8b. If it is, use a gzip decoder to read the magic number of the decoded stream. This can be BAM (BAM\x01) or, otherwise, bgzipped-compressed SAM. If not, check the magic number of the raw input stream for BAM or CRAM (CRAM). If all of those checks fail, it falls back to uncompressed, plain-text SAM.

See the following for the implementation.

https://github.com/zaeleus/noodles/blob/f8a3f7d35b99123a31db81006e2260b231d894cb/noodles-util/src/alignment/io/reader/builder.rs#L169-L217

mbhall88 commented 6 months ago

Neat. Thanks for all of the info