zaeleus / noodles

Bioinformatics I/O libraries in Rust
MIT License
498 stars 52 forks source link

Transparently read/write SAM vs. BAM vs. CRAM #78

Closed tfenne closed 9 months ago

tfenne commented 2 years ago

I'm curious about the design choices for the sam/bam/cram modules and whether there is a relatively easy (though perhaps) less efficient way to transparently work with the three formats interchangeably?

For context I'm one of the primary authors of HTSJDK, Picard and latterly fgbio, where we are used to writing code and tools that can accept any of the *AM formats as both input and output - we have factory/builder classes that return appropriate reader/writer instances, and the Record classes flowing into/out of them are the same regardless of the file type in use.

I'm newer to rust, but have a few project under my belt, and wanted to use noodles' SAM/BAM/CRAM support for a project recently but stopped once I realized I would have to write quite a bit of code to handle SAM vs. BAM everywhere. I know folks writing rust tend to value performance (and correctness) extremely highly, and the easier solutions might involve some dynamic dispatch or extra function call overhead, but I'm wondering why this design choice was made, and if there is an easy way (that I'm not seeing) to generalize over SAM/BAM/CRAM reading/writing at the cost of some performance?

jkbonfield commented 2 years ago

+1 to this. I believe the transparent nature of reading *AM in htslib was one of the primary reasons that CRAM managed to get off the ground as it broke the catch-22 bootstrapping problem of people not wanting to support it until there are lots of files out there to support.

I doubt it's a significant overhead either, not even for BAM, as one extra layer of function redirection is nothing compared to the complexities of LZ encoding and decoding.

cmnbroad commented 2 years ago

FWIW, I went through the same cycle recently when I started to experiment with rust in genomics - I started with noodles, but abandoned it for the same reason.

zaeleus commented 2 years ago

Thanks for bringing up this discussion, @tfenne.

The current API tends to make the assumption that each crate is a distinct format, which is why you don't see generalizations (only conversions) across similar ones. It focuses development for that particular format and prevents types from leaking to upstream dependencies.

I do agree that the alignment formats can be generalized, at least for ease of use. There is actually existing evidence of this—albeit of little progress—in the sam::RecordExt trait, which is implemented for {sam,bam,cram}::Record. I was deterred in the past due to naming conflicts (always the hard part!), and it was put on the backburner.

I'll prioritize this and first work on an alignment reader API this week.

tfenne commented 2 years ago

@zaeleus Thanks so much for the response. I'm sure that both myself and @sstadick would be happy to help if there is a way for us to do. @sstadick is a more seasoned rust programmer than me, but we're both capable and have spent a lot of time with various APIs to SAM/BAM/etc.

rob-p commented 2 years ago

I just want to say that I'd also strongly support this feature! There are several places where I'm currently using rust-htslib where I might consider / prefer noodles instead, but having to factor out all of that code to handle SAM/BAM/CRAM differently is sort of a blocker.

tfenne commented 2 years ago

I would also suggest (and am curious if others on this thread agree/disagree) that a solution that more or less maintained performance for BAM but reduced performance a bit for SAM and CRAM would be a good trade-off. My sense is that most of the time folks are working with BAM if performance is the primary concern. CRAM just is slower in my experience both on encoding and decoding, due to the design decisions / trade-offs made for CRAM. And I always expect parsing of text-based formats like SAM to be slower - the only place I think most users use SAM is coming directly out of bwa and similar aligners, where it's immediately processed into BAM or CRAM.

rob-p commented 2 years ago

I think this is reasonable if such a tradeoff is necessary. However, it's not immediately obvious to me that a generification of the reader/writer/record type implies a speed hit to processing.

zaeleus commented 2 years ago

Good progress has been made on this. Both read and read/write examples can be viewed and tested.

# SAM/BAM/CRAM => SAM. The output is written to stdout.
cargo run --release --all-features --examples util_alignment_view -- <src> [fasta-src]

# SAM/BAM/CRAM => SAM/BAM/CRAM. The output is written to a file, and the format
# is determined by the extension of dst.
cargo run --release --all-features --examples util_alignment_rewrite -- <src> <dst> [fasta-src]

(fasta-src is only required if CRAM needs it. An associated index (*.fai) must be in the same directory.)

By default, the alignment reader attempts to detect the format by magic number, falling back to SAM. The writer always needs a format specified. All alignment records now implement sam::AlignmentRecord, which covers the typical alignment record fields.

The interesting performance impact in the current design is that all record fields are parsed as domain models, which, of course, incurs a cost. While this significantly improves confidence in the validity of the data, it does make noodles measure much slower than, e.g., htslib. (htslib is overly liberal and can read and write invalid/nonsensical records, so I'm not sure if it's actually a good comparison.) I'm curious whether this design decision is deterring or not.

rob-p commented 2 years ago

Hi @zaeleus,

The progress on this is great news! I'm excited to try it out.

Regarding the thoughts on the performance impact of the current design, I'm afraid I'm going to have to ask for a bit of clarification by what you mean by "parsed as domain models". Do you mean there is e.g. a formal grammar and some sort of parsing and verification that happens to populate the records from the file? In general, I'm a big fan of the idea of having a safe and well-behaved default codepath, with an exit hatch for those situations where you need peak performance and are willing to run the risk of bad behavior on non-standard input. For example, something like having an unchecked variant of the parsing or iterator methods that uses a fast but non-validating parser etc. I don't know how others feel about this, but I think having a way to optionally drop down to the faster (but less safe) parsing might be pretty important in certain use cases.

claymcleod commented 2 years ago

Regarding the thoughts on the performance impact of the current design, I'm afraid I'm going to have to ask for a bit of clarification by what you mean by "parsed as domain models". Do you mean there is e.g. a formal grammar and some sort of parsing and verification that happens to populate the records from the file?

I could be incorrect (let me know if I am, @zaeleus), but I interpreted his comment to mean that the current design adds a level of abstraction above the specific *AM file formats themselves that captures the commonality between them. When he says "domain model", I think he's just referring to this abstraction layer of an alignment record and perhaps the builder classes introduced for this purposes (though, I'd expect the performance hit and memory footprint introduced for those to be negligible).

So for instance, he mentions the sam::AlignmentRecord, which doesn't include any specifics about how to serialize the data. That is left to the reader/writer classes, which initialized through a builder and then serialized according to the builder implementation. This abstraction is what causes the overhead if I understand correctly.

In general, I'm a big fan of the idea of having a safe and well-behaved default codepath, with an exit hatch for those situations where you need peak performance and are willing to run the risk of bad behavior on non-standard input. For example, something like having an unchecked variant of the parsing or iterator methods that uses a fast but non-validating parser etc. I don't know how others feel about this, but I think having a way to optionally drop down to the faster (but less safe) parsing might be pretty important in certain use cases.

I agree with this approach, and it's not clear to me whether the original implementation path will still be available to those interested in squeaking out the best performance (@zaeleus, could you comment on this?).

zaeleus commented 2 years ago

@claymcleod describes the architecture accordingly.

The overhead of the I/O abstraction is not significant, but the deserialization into the common record fields (i.e., the domain models) can be. I refer to domain models as in types are guaranteed to be represented within the constraints of their specification definitions. The grammar for SAM fields is more formally defined in the spec (see § 1.4).

A performance example: For a sequence, SAM (and CRAM) support the full alphabet set ([A-Za-z=.]), and this is what describes its common model. In BAM, the sequence is encoded using 4-bit bases (supporting the set [=ACMGRSVTWYHKDBN]). This allow two bases to be packed into one byte. Therefore, when deserializing, each 4-bit pair must be expanded, and while this operation is very fast, it is done for every base, which is a lot slower than, e.g., a memcpy.

I'm still experimenting with optimizations, but I made early mention of performance because readers and writers now have different characteristics due to the generalization changes.

I have considered (re)adding lazy records, which would parse fields upon first use. It would help read-only applications that only use a subset of fields, but it would not help round trips. Lazy records, however, pose the question: if you're reading an infallible (to parse) field, e.g., the flags, but every field after that is unknowingly invalid, do you still trust the data?

brainstorm commented 2 years ago

If it helps, @mmalenic wrote some traits to tackle this noodles format generalization issue in htsget-rs over here:

https://github.com/umccr/htsget-rs/blob/main/htsget-search/src/htsget/search.rs

Disclaimer: We've not run benchmarks on top of that yet, though, and I'm not sure if that particular abstraction will incur a ton of overhead or if it can be generalized further for non-htsget use cases.

claymcleod commented 2 years ago

I have considered (re)adding lazy records, which would parse fields upon first use. It would help read-only applications that only use a subset of fields, but it would not help round trips. Lazy records, however, pose the question: if you're reading an infallible (to parse) field, e.g., the flags, but every field after that is unknowingly invalid, do you still trust the data?

This leads to a philosophical question that noodles will likely need to pick a side on: by default, does the library value speed and ease of use (a la the changes discussed above) or does it value correctness. That will obviously have to be answered by you, @zaeleus, but I can see arguments for both sides:

I would personally suggest this approach as a middle ground: allow the user of the library to make the decision whether to prioritize speed or correctness. This could be accomplished by having two run states for the readers (e.g., ReadMode::LAZY and ReadMode::EAGER) that could be specified either at initialization or through a builder type interface. If this was the direction, you could use the eager mode of operation as the default, which would preserve the prevailing spirit of the library thusfar to value correctness.

rob-p commented 2 years ago

This is very well stated @claymcleod. I would submit, @zaeleus, that a large speed deficit compared to htslib (or, rather, rust-htslib) would unfortunately probably deter a non-trivial fraction of folks who might otherwise prefer noodles. I agree that whether or not the stricter (and slower) behavior is default would not matter much (Rust users are smart people and can figure out what makes the most sense for their use case ;P), but if a fast path was not possible, I think that might pose an adoption barrier.

claymcleod commented 2 years ago

I just tried Michael's latest changes (at https://github.com/zaeleus/noodles/commit/b6f24b78a0ef4e1299908f6cd82edbed3c4dbe26), and the results appear to be huge in terms of speed up. For instance, I'm rewriting part of our ngsderive application in Rust. The results were consistently between 1.5x and 2.5x quicker. Further, the results for our flagstat implementation improved between 2x and 6x. Nice work, @zaeleus!

jkbonfield commented 2 years ago

Good progress has been made on this. Both read and read/write examples can be viewed and tested.

# SAM/BAM/CRAM => SAM. The output is written to stdout.
cargo run --release --all-features --examples util_alignment_view -- <src> [fasta-src]

# SAM/BAM/CRAM => SAM/BAM/CRAM. The output is written to a file, and the format
# is determined by the extension of dst.
cargo run --release --all-features --examples util_alignment_rewrite -- <src> <dst> [fasta-src]

Thanks for this. I can confirm I've converted from BAM to CRAM using this, but I have some queries.

Specifically I'm wondering at what point we can consider Noodles to be the 2nd implementation of the draft CRAM 3.1 standard (and a way to move it forwards). I think it'd need to demonstrate more of the features.

Thanks.

zaeleus commented 2 years ago

Thanks for testing, @jkbonfield.

1) There's been no work to select more appropriate/optimal codecs for data series, so the current implementation will simply use gzip for all block data. There is no option for a user-defined data series-compression method map yet. 2) There is only CRAM 3.1 read support, i.e., only decoders for the new codecs have been implemented.

For further discussion on CRAM 3.1 write support, I created tracking issue #107.

zaeleus commented 9 months ago

While a unified alignment reader and writer have been available in noodles-util for quite some time, underlying implementations were required to fully decode records. noodles 0.61.0 / noodles-sam 0.50.0 now provides a trait for alignment records, greatly improving read performance, as format records can now be returned.

I believe this feature is now better implemented and will be closing this issue. Please open a new issue if there are any problems.