zaeleus / noodles

Bioinformatics I/O libraries in Rust
MIT License
477 stars 53 forks source link

noodles bam failled to write record with large mapping position. #229

Closed natir closed 6 months ago

natir commented 6 months ago

The bam writer cannot write a record with a position greater than 997,048,321, whereas the specification indicates that the max value is 2^31 - 1.

Col Field Type Regexp/Range
4 POS Int [0, $2^{31} - 1$]

There's no problem for the sam writer here's a small example:

use noodles_bam as bam;
use noodles_sam as sam;
use noodles_util::alignment;

fn main() {
    let mut writer = bam::io::Writer::new(Vec::new());
    let header = sam::Header::default();

    let record1 = sam::alignment::RecordBuf::builder()
        .set_alignment_start(
            noodles_core::Position::new(997048321).expect("Position conversion failled"),
        )
        .build();
    let record2 = sam::alignment::RecordBuf::builder()
        .set_alignment_start(
            noodles_core::Position::new(997048322).expect("Position conversion failled"),
        )
        .build();

    let mut sam_writer = alignment::io::writer::builder::Builder::default()
        .set_format(alignment::io::Format::Sam)
        .build_from_writer(std::io::stdout().lock())
        .unwrap();
    sam_writer
        .write_record(&header, &record1)
        .expect("sam write record1 failed");
    sam_writer
        .write_record(&header, &record2)
        .expect("sam write record2 failed");
    println!("Write sam record success");

    let mut bam_writer = alignment::io::writer::builder::Builder::default()
        .set_format(alignment::io::Format::Bam)
        .build_from_writer(std::io::stdout().lock())
        .unwrap();
    bam_writer
        .write_record(&header, &record1)
        .expect("bam write record1 failed");
    bam_writer
        .write_record(&header, &record2)
        .expect("bam write record2 failed");
    println!("Write bam record success");
}
zaeleus commented 6 months ago

Oh, that's an interesting case!

Writing the position isn't failing but calculating the BAM index bin ID is. While the max position for BAM is indeed 231 - 1 (0-based), the max supported position in BAI is 229 - 1 (0-based). Each BAM record does include its supposed bin ID if it were to be placed in an index, calculated from its position, and 997048321 is the first position that overflows the field. (This position works in your example due to the alignment end being incorrectly calculated for records with no reference-consuming CIGAR operations. This is fixed in d19eb8e487440a444884088537f74ec881e87a71.)

The specification doesn't say what to do with this field when you have a nonsensical value (for BAM, > 37449) or when the bin ID overflows. I ended up taking the same approach as htslib and using the overflowing values.

Thanks for the report and test case! This fix is released in noodles 0.62.0.


If you're interested in the math, binning indices are m-ary trees, where $m$ = 8. For BAI, depth $d$ of the tree is 5, and the window size $w$ is $2^{s}$, where $s$ is 14. The number of nodes in the last level of the tree $L$ is $m^{d}$. A BAM bin ID is 2 bytes, meaning there are $u = 2^{16}$ values. The total number of nodes in the tree $N$, which is also the maximum bin ID, is $\frac{m * L - 1}{m - 1}$. The max position $P$ is $Lw$.

The last position that does not overflow the bin ID is given by

\begin{align}
p_{last} &= P + w(u - N) \\
         &= Lw + (2^{s})(u - \frac{m * L - 1}{m - 1}) \\
         &= (m^{d})(2^{s}) + (2^{s})(u - \frac{m * m^{d} - 1}{m - 1}) \\
         &= (8^5)(2^{14}) + (2^{14})(2^{16} - \frac{8 * 8^{5} - 1}{8 - 1}) \\
         &= 997048320
\end{align}

Therefore, 997048320 + 1 = 997048321 (1-based) will overflow the bin ID calculation.