zaeleus / noodles

Bioinformatics I/O libraries in Rust
MIT License
512 stars 53 forks source link

Alignment span calculation (bug or misunderstanding) #230

Closed rob-p closed 9 months ago

rob-p commented 9 months ago

Hi @zaeleus,

I'm attempting to parse some alignments using the newest version of noodles, and I'm running into some strange unexpected behavior. I've attached a BAM file here that has the records that reproduce this behavior.

Basically, I have a series of alignments (in this case 6, but it's happening a lot in my data) for a read. All of them have the read aligned to reference sequence, one alignment is primary (as required) and the rest are secondary. This data is generated using minimap2.

What I observe is that, although all of these records represent alignments and have valid CIGAR strings, only one of them is producing a valid (correct) alignment_span. Specifically this is the last alignment, whose span is matching it's CIGAR string 12S80M24S (span of 80).

The rest of the alignments all have a span of None (implying a calculated value of 0), despite having valid CIGAR strings. The records are below (these are the exact ones encoded in the BAM along with the header).

SRR14286054.2187        0       ENST00000435275 527     1       5S111M  *       0       0       GGGCAAAAAAGAAGTGAGCTGGAGATTGGATCACAGCCGAAGGAGTAAAGGTGCTGCAATGATGTTAGCTGTGGCCACTGTGGATTTTTCGCAAGAACATTAATAAACTAAAAACT       13:;>?:75--:<42215<:@=<7-,.=>C@=111-*+*+...,'56921<87/1.-.-;?>@@>@-<@?6**-136:>378;;QJNLFB8;QH?EB8ACDEDCIA:19A=<7)%-    NM:i:1  ms:i:216        AS:i:216        nn:i:0  tp:A:P     cm:i:17 s1:i:97 s2:i:98 de:f:0.009      rl:i:0
SRR14286054.2187        256     ENST00000360830 407     0       11S105M *       0       0       *       *       NM:i:0  ms:i:210        AS:i:210        nn:i:0  tp:A:S  cm:i:18 s1:i:98 de:f:0  rl:i:0
SRR14286054.2187        256     ENST00000480662 605     0       12S104M *       0       0       *       *       NM:i:0  ms:i:208        AS:i:208        nn:i:0  tp:A:S  cm:i:18 s1:i:98 de:f:0  rl:i:0
SRR14286054.2187        256     ENST00000372360 426     0       34S82M  *       0       0       *       *       NM:i:0  ms:i:164        AS:i:164        nn:i:0  tp:A:S  cm:i:12 s1:i:72 de:f:0  rl:i:0
SRR14286054.2187        256     ENST00000482069 456     0       34S82M  *       0       0       *       *       NM:i:0  ms:i:164        AS:i:164        nn:i:0  tp:A:S  cm:i:12 s1:i:72 de:f:0  rl:i:0
SRR14286054.2187        256     ENST00000485708 465     0       12S80M24S       *       0       0       *       *       NM:i:0  ms:i:160        AS:i:160        nn:i:0  tp:A:S  cm:i:13 s1:i:71 de:f:0     rl:i:0

I have minimal code to reproduce this; it's just:

use noodles_bam as bam;
use noodles_sam::alignment::record::Record;

fn main() -> Result<(), std::io::Error>{
    let mut reader = bam::io::reader::Builder::default().build_from_path("SAMPLE.bam")?;
    let header = reader.read_header()?;

    for result in reader.records() {
        let record = result?;
        let aspan = record.alignment_span().expect("should have alignment span");
        println!("record = {:?}\nalignment span = {:?}\n\n", record, aspan);
    }
    Ok(())
}

and the output is

record = Record { reference_sequence_id: Some(Ok(98279)), alignment_start: Some(Ok(Position(527))), mapping_quality: Some(MappingQuality(1)), flags: Flags(0x0), mate_reference_sequence_id: None, mate_alignment_start: None, template_length: 0, name: Some(Name([83, 82, 82, 49, 52, 50, 56, 54, 48, 53, 52, 46, 50, 49, 56, 55, 0])), cigar: Cigar([]), sequence: Sequence { src: [68, 66, 17, 17, 17, 65, 20, 132, 20, 40, 68, 20, 24, 132, 65, 130, 18, 20, 34, 65, 20, 65, 72, 17, 20, 72, 66, 132, 33, 24, 65, 132, 136, 20, 40, 72, 68, 34, 18, 132, 132, 65, 136, 136, 130, 66, 17, 65, 18, 24, 129, 24, 17, 18, 129, 17, 17, 40], base_count: 116 }, quality_scores: QualityScores([16, 18, 25, 26, 29, 30, 25, 22, 20, 12, 12, 25, 27, 19, 17, 17, 16, 20, 27, 25, 31, 28, 27, 22, 12, 11, 13, 28, 29, 34, 31, 28, 16, 16, 16, 12, 9, 10, 9, 10, 13, 13, 13, 11, 6, 20, 21, 24, 17, 16, 27, 23, 22, 14, 16, 13, 12, 13, 12, 26, 30, 29, 31, 31, 29, 31, 12, 27, 31, 30, 21, 9, 9, 12, 16, 18, 21, 25, 29, 18, 22, 23, 26, 26, 48, 41, 45, 43, 37, 33, 23, 26, 48, 39, 30, 36, 33, 23, 32, 34, 35, 36, 35, 34, 40, 32, 25, 16, 24, 32, 28, 27, 22, 8, 4, 12]), data: {Tag("NM"): UInt8(1), Tag("ms"): UInt8(216), Tag("AS"): UInt8(216), Tag("nn"): UInt8(0), Tag("tp"): Character(80), Tag("cm"): UInt8(17), Tag("s1"): UInt8(97), Tag("s2"): UInt8(98), Tag("de"): Float(0.009), Tag("rl"): UInt8(0)} }
alignment span = None

record = Record { reference_sequence_id: Some(Ok(98286)), alignment_start: Some(Ok(Position(407))), mapping_quality: Some(MappingQuality(0)), flags: Flags(SECONDARY), mate_reference_sequence_id: None, mate_alignment_start: None, template_length: 0, name: Some(Name([83, 82, 82, 49, 52, 50, 56, 54, 48, 53, 52, 46, 50, 49, 56, 55, 0])), cigar: Cigar([]), sequence: Sequence { src: [], base_count: 0 }, quality_scores: QualityScores([]), data: {Tag("NM"): UInt8(0), Tag("ms"): UInt8(210), Tag("AS"): UInt8(210), Tag("nn"): UInt8(0), Tag("tp"): Character(83), Tag("cm"): UInt8(18), Tag("s1"): UInt8(98), Tag("de"): Float(0.0), Tag("rl"): UInt8(0)} }
alignment span = None

record = Record { reference_sequence_id: Some(Ok(98291)), alignment_start: Some(Ok(Position(605))), mapping_quality: Some(MappingQuality(0)), flags: Flags(SECONDARY), mate_reference_sequence_id: None, mate_alignment_start: None, template_length: 0, name: Some(Name([83, 82, 82, 49, 52, 50, 56, 54, 48, 53, 52, 46, 50, 49, 56, 55, 0])), cigar: Cigar([]), sequence: Sequence { src: [], base_count: 0 }, quality_scores: QualityScores([]), data: {Tag("NM"): UInt8(0), Tag("ms"): UInt8(208), Tag("AS"): UInt8(208), Tag("nn"): UInt8(0), Tag("tp"): Character(83), Tag("cm"): UInt8(18), Tag("s1"): UInt8(98), Tag("de"): Float(0.0), Tag("rl"): UInt8(0)} }
alignment span = None

record = Record { reference_sequence_id: Some(Ok(98283)), alignment_start: Some(Ok(Position(426))), mapping_quality: Some(MappingQuality(0)), flags: Flags(SECONDARY), mate_reference_sequence_id: None, mate_alignment_start: None, template_length: 0, name: Some(Name([83, 82, 82, 49, 52, 50, 56, 54, 48, 53, 52, 46, 50, 49, 56, 55, 0])), cigar: Cigar([]), sequence: Sequence { src: [], base_count: 0 }, quality_scores: QualityScores([]), data: {Tag("NM"): UInt8(0), Tag("ms"): UInt8(164), Tag("AS"): UInt8(164), Tag("nn"): UInt8(0), Tag("tp"): Character(83), Tag("cm"): UInt8(12), Tag("s1"): UInt8(72), Tag("de"): Float(0.0), Tag("rl"): UInt8(0)} }
alignment span = None

record = Record { reference_sequence_id: Some(Ok(98289)), alignment_start: Some(Ok(Position(456))), mapping_quality: Some(MappingQuality(0)), flags: Flags(SECONDARY), mate_reference_sequence_id: None, mate_alignment_start: None, template_length: 0, name: Some(Name([83, 82, 82, 49, 52, 50, 56, 54, 48, 53, 52, 46, 50, 49, 56, 55, 0])), cigar: Cigar([]), sequence: Sequence { src: [], base_count: 0 }, quality_scores: QualityScores([]), data: {Tag("NM"): UInt8(0), Tag("ms"): UInt8(164), Tag("AS"): UInt8(164), Tag("nn"): UInt8(0), Tag("tp"): Character(83), Tag("cm"): UInt8(12), Tag("s1"): UInt8(72), Tag("de"): Float(0.0), Tag("rl"): UInt8(0)} }
alignment span = None

record = Record { reference_sequence_id: Some(Ok(98281)), alignment_start: Some(Ok(Position(465))), mapping_quality: Some(MappingQuality(0)), flags: Flags(SECONDARY), mate_reference_sequence_id: None, mate_alignment_start: None, template_length: 0, name: Some(Name([83, 82, 82, 49, 52, 50, 56, 54, 48, 53, 52, 46, 50, 49, 56, 55, 0])), cigar: Cigar([196, 0, 0, 0, 0, 5, 0, 0, 132, 1, 0, 0]), sequence: Sequence { src: [], base_count: 0 }, quality_scores: QualityScores([]), data: {Tag("NM"): UInt8(0), Tag("ms"): UInt8(160), Tag("AS"): UInt8(160), Tag("nn"): UInt8(0), Tag("tp"): Character(83), Tag("cm"): UInt8(13), Tag("s1"): UInt8(71), Tag("de"): Float(0.0), Tag("rl"): UInt8(0)} }
alignment span = Some(80)

Any idea what might be going on here? In some sense, it's even stranger to me that one of the spans is correct, becuase I don't understand why the rest are empty.

The bam file used here (had to additionally zip it to satisfy github to upload): SAMPLE.bam.zip.

Thanks! Rob

P.S. As a follow up, if I request the CIGAR strings from each of these records using the cigar method, I get the following — which matches with the incorrect spans. I note however, that there is both the cigar record on the BAM record and a cigar method under the sam::alignment::Record impl for the bam::Record. It's not unclear to me why the direct method fails to return the proper data.

cigar = Cigar([])
cigar = Cigar([])
cigar = Cigar([])
cigar = Cigar([])
cigar = Cigar([])
cigar = Cigar([196, 0, 0, 0, 0, 5, 0, 0, 132, 1, 0, 0])
rob-p commented 9 months ago

The plot thickens (at least for me)!

If I use the buffered (eager) instead of lazy iterator, the fields seem to be correct. That is, if I use

 for result in reader.record_bufs(&header) {

instead of

 for result in reader.records() {

to iterate over the records, the cigar strings and the other fields seem ok --- that is, I get

record = RecordBuf { name: Some(Name([83, 82, 82, 49, 52, 50, 56, 54, 48, 53, 52, 46, 50, 49, 56, 55])), flags: Flags(0x0), reference_sequence_id: Some(98279), alignment_start: Some(Position(527)), mapping_quality: Some(MappingQuality(1)), cigar: Cigar([Op { kind: SoftClip, len: 5 }, Op { kind: Match, len: 111 }]), mate_reference_sequence_id: None, mate_alignment_start: None, template_length: 0, sequence: Sequence([71, 71, 71, 67, 65, 65, 65, 65, 65, 65, 71, 65, 65, 71, 84, 71, 65, 71, 67, 84, 71, 71, 65, 71, 65, 84, 84, 71, 71, 65, 84, 67, 65, 67, 65, 71, 67, 67, 71, 65, 65, 71, 71, 65, 71, 84, 65, 65, 65, 71, 71, 84, 71, 67, 84, 71, 67, 65, 65, 84, 71, 65, 84, 71, 84, 84, 65, 71, 67, 84, 71, 84, 71, 71, 67, 67, 65, 67, 84, 71, 84, 71, 71, 65, 84, 84, 84, 84, 84, 67, 71, 67, 65, 65, 71, 65, 65, 67, 65, 84, 84, 65, 65, 84, 65, 65, 65, 67, 84, 65, 65, 65, 65, 65, 67, 84]), quality_scores: QualityScores([16, 18, 25, 26, 29, 30, 25, 22, 20, 12, 12, 25, 27, 19, 17, 17, 16, 20, 27, 25, 31, 28, 27, 22, 12, 11, 13, 28, 29, 34, 31, 28, 16, 16, 16, 12, 9, 10, 9, 10, 13, 13, 13, 11, 6, 20, 21, 24, 17, 16, 27, 23, 22, 14, 16, 13, 12, 13, 12, 26, 30, 29, 31, 31, 29, 31, 12, 27, 31, 30, 21, 9, 9, 12, 16, 18, 21, 25, 29, 18, 22, 23, 26, 26, 48, 41, 45, 43, 37, 33, 23, 26, 48, 39, 30, 36, 33, 23, 32, 34, 35, 36, 35, 34, 40, 32, 25, 16, 24, 32, 28, 27, 22, 8, 4, 12]), data: Data([(Tag("NM"), UInt8(1)), (Tag("ms"), UInt8(216)), (Tag("AS"), UInt8(216)), (Tag("nn"), UInt8(0)), (Tag("tp"), Character(80)), (Tag("cm"), UInt8(17)), (Tag("s1"), UInt8(97)), (Tag("s2"), UInt8(98)), (Tag("de"), Float(0.009)), (Tag("rl"), UInt8(0))]) }
alignment span = Some(111)

record = RecordBuf { name: Some(Name([83, 82, 82, 49, 52, 50, 56, 54, 48, 53, 52, 46, 50, 49, 56, 55])), flags: Flags(SECONDARY), reference_sequence_id: Some(98286), alignment_start: Some(Position(407)), mapping_quality: Some(MappingQuality(0)), cigar: Cigar([Op { kind: SoftClip, len: 11 }, Op { kind: Match, len: 105 }]), mate_reference_sequence_id: None, mate_alignment_start: None, template_length: 0, sequence: Sequence([]), quality_scores: QualityScores([]), data: Data([(Tag("NM"), UInt8(0)), (Tag("ms"), UInt8(210)), (Tag("AS"), UInt8(210)), (Tag("nn"), UInt8(0)), (Tag("tp"), Character(83)), (Tag("cm"), UInt8(18)), (Tag("s1"), UInt8(98)), (Tag("de"), Float(0.0)), (Tag("rl"), UInt8(0))]) }
alignment span = Some(105)

record = RecordBuf { name: Some(Name([83, 82, 82, 49, 52, 50, 56, 54, 48, 53, 52, 46, 50, 49, 56, 55])), flags: Flags(SECONDARY), reference_sequence_id: Some(98291), alignment_start: Some(Position(605)), mapping_quality: Some(MappingQuality(0)), cigar: Cigar([Op { kind: SoftClip, len: 12 }, Op { kind: Match, len: 104 }]), mate_reference_sequence_id: None, mate_alignment_start: None, template_length: 0, sequence: Sequence([]), quality_scores: QualityScores([]), data: Data([(Tag("NM"), UInt8(0)), (Tag("ms"), UInt8(208)), (Tag("AS"), UInt8(208)), (Tag("nn"), UInt8(0)), (Tag("tp"), Character(83)), (Tag("cm"), UInt8(18)), (Tag("s1"), UInt8(98)), (Tag("de"), Float(0.0)), (Tag("rl"), UInt8(0))]) }
alignment span = Some(104)

record = RecordBuf { name: Some(Name([83, 82, 82, 49, 52, 50, 56, 54, 48, 53, 52, 46, 50, 49, 56, 55])), flags: Flags(SECONDARY), reference_sequence_id: Some(98283), alignment_start: Some(Position(426)), mapping_quality: Some(MappingQuality(0)), cigar: Cigar([Op { kind: SoftClip, len: 34 }, Op { kind: Match, len: 82 }]), mate_reference_sequence_id: None, mate_alignment_start: None, template_length: 0, sequence: Sequence([]), quality_scores: QualityScores([]), data: Data([(Tag("NM"), UInt8(0)), (Tag("ms"), UInt8(164)), (Tag("AS"), UInt8(164)), (Tag("nn"), UInt8(0)), (Tag("tp"), Character(83)), (Tag("cm"), UInt8(12)), (Tag("s1"), UInt8(72)), (Tag("de"), Float(0.0)), (Tag("rl"), UInt8(0))]) }
alignment span = Some(82)

record = RecordBuf { name: Some(Name([83, 82, 82, 49, 52, 50, 56, 54, 48, 53, 52, 46, 50, 49, 56, 55])), flags: Flags(SECONDARY), reference_sequence_id: Some(98289), alignment_start: Some(Position(456)), mapping_quality: Some(MappingQuality(0)), cigar: Cigar([Op { kind: SoftClip, len: 34 }, Op { kind: Match, len: 82 }]), mate_reference_sequence_id: None, mate_alignment_start: None, template_length: 0, sequence: Sequence([]), quality_scores: QualityScores([]), data: Data([(Tag("NM"), UInt8(0)), (Tag("ms"), UInt8(164)), (Tag("AS"), UInt8(164)), (Tag("nn"), UInt8(0)), (Tag("tp"), Character(83)), (Tag("cm"), UInt8(12)), (Tag("s1"), UInt8(72)), (Tag("de"), Float(0.0)), (Tag("rl"), UInt8(0))]) }
alignment span = Some(82)

record = RecordBuf { name: Some(Name([83, 82, 82, 49, 52, 50, 56, 54, 48, 53, 52, 46, 50, 49, 56, 55])), flags: Flags(SECONDARY), reference_sequence_id: Some(98281), alignment_start: Some(Position(465)), mapping_quality: Some(MappingQuality(0)), cigar: Cigar([Op { kind: SoftClip, len: 12 }, Op { kind: Match, len: 80 }, Op { kind: SoftClip, len: 24 }]), mate_reference_sequence_id: None, mate_alignment_start: None, template_length: 0, sequence: Sequence([]), quality_scores: QualityScores([]), data: Data([(Tag("NM"), UInt8(0)), (Tag("ms"), UInt8(160)), (Tag("AS"), UInt8(160)), (Tag("nn"), UInt8(0)), (Tag("tp"), Character(83)), (Tag("cm"), UInt8(13)), (Tag("s1"), UInt8(71)), (Tag("de"), Float(0.0)), (Tag("rl"), UInt8(0))]) }
alignment span = Some(80)

Presumably, the precise method of iteration shouldn't matter, right — or at least one might expect any differences in behavior as critical as this to be loudly documented ;P.

Thanks! Rob

zaeleus commented 9 months ago

Thanks for using/testing the new BAM implementation and reporting this!

The BAM format is limited to 216-1 CIGAR operations but can overflow to an auxiliary data field if necessary. This is done by setting the record CIGAR operations buffer to a flag equal to kSmN, where k is the base count and m is the reference sequence length.[^1] The lazy decoder was accidentally consuming the CIGAR operations buffer when doing this check, which is only done for CIGARs with two operations. This is why the first five records in your example return empty operations, and the last one returns the correct value.

The RecordBuf reader fully decodes the record CIGAR operations and then tries to resolve whether it's an overflow flag, which is why it's not affected.

Sorry for the inconvenience. This is fixed in noodles 0.62.1 / noodles-bam 0.54.1.

[^1]: § 4.2.2 "N_CIGAR_OP field" (2023-05-24).

rob-p commented 9 months ago

Thanks for the super quick-fix and for the explanation. The pointer regarding the overflow field is also very useful. In this case, what would the cigar() function return? I must admit when I see Result<Option<T>> I immediately want to understand the error condition versus the normal empty condition, and where different information will be propagated back to the caller.

Thanks again for noodles. We've been using it across different projects and it's awesome to have a library for this in pure Rust, for so many different formats, and with such a responsive developer!

zaeleus commented 9 months ago

In this case, what would the cigar() function return?

bam::record::Cigar wraps a raw BAM CIGAR operations buffer. It can be either the cigar field or, if the flags as described above is set, the value of the CG data field.

Take, for example, the regression tests that were added for this issue.

https://github.com/zaeleus/noodles/blob/be7d7bdad2a6207784e725d2cfdc14074f7910ac/noodles-bam/src/record/fields.rs#L225-L305

test_cigar is the typical case; it wraps the cigar field. test_cigar_with_2_cigar_ops has two CIGAR operations that do not match the overflow flag, and it, again, wraps the cigar field. test_cigar_with_overflowing_cigar has two CIGAR operations that do match the overflow flag, but instead of returning the cigar field, it searches the record auxiliary data for a CG tag and returns that value.

I must admit when I see Result<Option<T>> I immediately want to understand the error condition versus the normal empty condition, and where different information will be propagated back to the caller.

In the case of Record::alignment_span, it's fallible because it requires decoding the CIGAR operations, which can fail (e.g., an invalid operation type; see also Cigar::alignment_span). When an alignment span is calculated to be 0, we consider the record alignment span to be missing: the record could be unmapped (therefore, no CIGAR operations), it has no reference-consuming operations (which may be nonsensical), etc. There could certainly be a counterexample that I'm not aware of, and if that's the case, please let me know!