Closed rob-p closed 9 months ago
The plot thickens (at least for me)!
If I use the buffered (eager) instead of lazy iterator, the fields seem to be correct. That is, if I use
for result in reader.record_bufs(&header) {
instead of
for result in reader.records() {
to iterate over the records, the cigar strings and the other fields seem ok --- that is, I get
record = RecordBuf { name: Some(Name([83, 82, 82, 49, 52, 50, 56, 54, 48, 53, 52, 46, 50, 49, 56, 55])), flags: Flags(0x0), reference_sequence_id: Some(98279), alignment_start: Some(Position(527)), mapping_quality: Some(MappingQuality(1)), cigar: Cigar([Op { kind: SoftClip, len: 5 }, Op { kind: Match, len: 111 }]), mate_reference_sequence_id: None, mate_alignment_start: None, template_length: 0, sequence: Sequence([71, 71, 71, 67, 65, 65, 65, 65, 65, 65, 71, 65, 65, 71, 84, 71, 65, 71, 67, 84, 71, 71, 65, 71, 65, 84, 84, 71, 71, 65, 84, 67, 65, 67, 65, 71, 67, 67, 71, 65, 65, 71, 71, 65, 71, 84, 65, 65, 65, 71, 71, 84, 71, 67, 84, 71, 67, 65, 65, 84, 71, 65, 84, 71, 84, 84, 65, 71, 67, 84, 71, 84, 71, 71, 67, 67, 65, 67, 84, 71, 84, 71, 71, 65, 84, 84, 84, 84, 84, 67, 71, 67, 65, 65, 71, 65, 65, 67, 65, 84, 84, 65, 65, 84, 65, 65, 65, 67, 84, 65, 65, 65, 65, 65, 67, 84]), quality_scores: QualityScores([16, 18, 25, 26, 29, 30, 25, 22, 20, 12, 12, 25, 27, 19, 17, 17, 16, 20, 27, 25, 31, 28, 27, 22, 12, 11, 13, 28, 29, 34, 31, 28, 16, 16, 16, 12, 9, 10, 9, 10, 13, 13, 13, 11, 6, 20, 21, 24, 17, 16, 27, 23, 22, 14, 16, 13, 12, 13, 12, 26, 30, 29, 31, 31, 29, 31, 12, 27, 31, 30, 21, 9, 9, 12, 16, 18, 21, 25, 29, 18, 22, 23, 26, 26, 48, 41, 45, 43, 37, 33, 23, 26, 48, 39, 30, 36, 33, 23, 32, 34, 35, 36, 35, 34, 40, 32, 25, 16, 24, 32, 28, 27, 22, 8, 4, 12]), data: Data([(Tag("NM"), UInt8(1)), (Tag("ms"), UInt8(216)), (Tag("AS"), UInt8(216)), (Tag("nn"), UInt8(0)), (Tag("tp"), Character(80)), (Tag("cm"), UInt8(17)), (Tag("s1"), UInt8(97)), (Tag("s2"), UInt8(98)), (Tag("de"), Float(0.009)), (Tag("rl"), UInt8(0))]) }
alignment span = Some(111)
record = RecordBuf { name: Some(Name([83, 82, 82, 49, 52, 50, 56, 54, 48, 53, 52, 46, 50, 49, 56, 55])), flags: Flags(SECONDARY), reference_sequence_id: Some(98286), alignment_start: Some(Position(407)), mapping_quality: Some(MappingQuality(0)), cigar: Cigar([Op { kind: SoftClip, len: 11 }, Op { kind: Match, len: 105 }]), mate_reference_sequence_id: None, mate_alignment_start: None, template_length: 0, sequence: Sequence([]), quality_scores: QualityScores([]), data: Data([(Tag("NM"), UInt8(0)), (Tag("ms"), UInt8(210)), (Tag("AS"), UInt8(210)), (Tag("nn"), UInt8(0)), (Tag("tp"), Character(83)), (Tag("cm"), UInt8(18)), (Tag("s1"), UInt8(98)), (Tag("de"), Float(0.0)), (Tag("rl"), UInt8(0))]) }
alignment span = Some(105)
record = RecordBuf { name: Some(Name([83, 82, 82, 49, 52, 50, 56, 54, 48, 53, 52, 46, 50, 49, 56, 55])), flags: Flags(SECONDARY), reference_sequence_id: Some(98291), alignment_start: Some(Position(605)), mapping_quality: Some(MappingQuality(0)), cigar: Cigar([Op { kind: SoftClip, len: 12 }, Op { kind: Match, len: 104 }]), mate_reference_sequence_id: None, mate_alignment_start: None, template_length: 0, sequence: Sequence([]), quality_scores: QualityScores([]), data: Data([(Tag("NM"), UInt8(0)), (Tag("ms"), UInt8(208)), (Tag("AS"), UInt8(208)), (Tag("nn"), UInt8(0)), (Tag("tp"), Character(83)), (Tag("cm"), UInt8(18)), (Tag("s1"), UInt8(98)), (Tag("de"), Float(0.0)), (Tag("rl"), UInt8(0))]) }
alignment span = Some(104)
record = RecordBuf { name: Some(Name([83, 82, 82, 49, 52, 50, 56, 54, 48, 53, 52, 46, 50, 49, 56, 55])), flags: Flags(SECONDARY), reference_sequence_id: Some(98283), alignment_start: Some(Position(426)), mapping_quality: Some(MappingQuality(0)), cigar: Cigar([Op { kind: SoftClip, len: 34 }, Op { kind: Match, len: 82 }]), mate_reference_sequence_id: None, mate_alignment_start: None, template_length: 0, sequence: Sequence([]), quality_scores: QualityScores([]), data: Data([(Tag("NM"), UInt8(0)), (Tag("ms"), UInt8(164)), (Tag("AS"), UInt8(164)), (Tag("nn"), UInt8(0)), (Tag("tp"), Character(83)), (Tag("cm"), UInt8(12)), (Tag("s1"), UInt8(72)), (Tag("de"), Float(0.0)), (Tag("rl"), UInt8(0))]) }
alignment span = Some(82)
record = RecordBuf { name: Some(Name([83, 82, 82, 49, 52, 50, 56, 54, 48, 53, 52, 46, 50, 49, 56, 55])), flags: Flags(SECONDARY), reference_sequence_id: Some(98289), alignment_start: Some(Position(456)), mapping_quality: Some(MappingQuality(0)), cigar: Cigar([Op { kind: SoftClip, len: 34 }, Op { kind: Match, len: 82 }]), mate_reference_sequence_id: None, mate_alignment_start: None, template_length: 0, sequence: Sequence([]), quality_scores: QualityScores([]), data: Data([(Tag("NM"), UInt8(0)), (Tag("ms"), UInt8(164)), (Tag("AS"), UInt8(164)), (Tag("nn"), UInt8(0)), (Tag("tp"), Character(83)), (Tag("cm"), UInt8(12)), (Tag("s1"), UInt8(72)), (Tag("de"), Float(0.0)), (Tag("rl"), UInt8(0))]) }
alignment span = Some(82)
record = RecordBuf { name: Some(Name([83, 82, 82, 49, 52, 50, 56, 54, 48, 53, 52, 46, 50, 49, 56, 55])), flags: Flags(SECONDARY), reference_sequence_id: Some(98281), alignment_start: Some(Position(465)), mapping_quality: Some(MappingQuality(0)), cigar: Cigar([Op { kind: SoftClip, len: 12 }, Op { kind: Match, len: 80 }, Op { kind: SoftClip, len: 24 }]), mate_reference_sequence_id: None, mate_alignment_start: None, template_length: 0, sequence: Sequence([]), quality_scores: QualityScores([]), data: Data([(Tag("NM"), UInt8(0)), (Tag("ms"), UInt8(160)), (Tag("AS"), UInt8(160)), (Tag("nn"), UInt8(0)), (Tag("tp"), Character(83)), (Tag("cm"), UInt8(13)), (Tag("s1"), UInt8(71)), (Tag("de"), Float(0.0)), (Tag("rl"), UInt8(0))]) }
alignment span = Some(80)
Presumably, the precise method of iteration shouldn't matter, right — or at least one might expect any differences in behavior as critical as this to be loudly documented ;P.
Thanks! Rob
Thanks for using/testing the new BAM implementation and reporting this!
The BAM format is limited to 216-1 CIGAR operations but can overflow to an auxiliary data field if necessary. This is done by setting the record CIGAR operations buffer to a flag equal to kSmN, where k is the base count and m is the reference sequence length.[^1] The lazy decoder was accidentally consuming the CIGAR operations buffer when doing this check, which is only done for CIGARs with two operations. This is why the first five records in your example return empty operations, and the last one returns the correct value.
The RecordBuf
reader fully decodes the record CIGAR operations and then tries to resolve whether it's an overflow flag, which is why it's not affected.
Sorry for the inconvenience. This is fixed in noodles 0.62.1 / noodles-bam 0.54.1.
[^1]: § 4.2.2 "N_CIGAR_OP field" (2023-05-24).
Thanks for the super quick-fix and for the explanation. The pointer regarding the overflow field is also very useful. In this case, what would the cigar()
function return? I must admit when I see Result<Option<T>>
I immediately want to understand the error condition versus the normal empty condition, and where different information will be propagated back to the caller.
Thanks again for noodles
. We've been using it across different projects and it's awesome to have a library for this in pure Rust, for so many different formats, and with such a responsive developer!
In this case, what would the
cigar()
function return?
bam::record::Cigar
wraps a raw BAM CIGAR operations buffer. It can be either the cigar
field or, if the flags as described above is set, the value of the CG
data field.
Take, for example, the regression tests that were added for this issue.
test_cigar
is the typical case; it wraps the cigar
field. test_cigar_with_2_cigar_ops
has two CIGAR operations that do not match the overflow flag, and it, again, wraps the cigar
field. test_cigar_with_overflowing_cigar
has two CIGAR operations that do match the overflow flag, but instead of returning the cigar
field, it searches the record auxiliary data for a CG
tag and returns that value.
I must admit when I see
Result<Option<T>>
I immediately want to understand the error condition versus the normal empty condition, and where different information will be propagated back to the caller.
In the case of Record::alignment_span
, it's fallible because it requires decoding the CIGAR operations, which can fail (e.g., an invalid operation type; see also Cigar::alignment_span
). When an alignment span is calculated to be 0, we consider the record alignment span to be missing: the record could be unmapped (therefore, no CIGAR operations), it has no reference-consuming operations (which may be nonsensical), etc. There could certainly be a counterexample that I'm not aware of, and if that's the case, please let me know!
Hi @zaeleus,
I'm attempting to parse some alignments using the newest version of
noodles
, and I'm running into some strange unexpected behavior. I've attached a BAM file here that has the records that reproduce this behavior.Basically, I have a series of alignments (in this case 6, but it's happening a lot in my data) for a read. All of them have the read aligned to reference sequence, one alignment is primary (as required) and the rest are secondary. This data is generated using
minimap2
.What I observe is that, although all of these records represent alignments and have valid CIGAR strings, only one of them is producing a valid (correct)
alignment_span
. Specifically this is the last alignment, whose span is matching it's CIGAR string12S80M24S
(span of 80).The rest of the alignments all have a span of
None
(implying a calculated value of 0), despite having valid CIGAR strings. The records are below (these are the exact ones encoded in the BAM along with the header).I have minimal code to reproduce this; it's just:
and the output is
Any idea what might be going on here? In some sense, it's even stranger to me that one of the spans is correct, becuase I don't understand why the rest are empty.
The bam file used here (had to additionally zip it to satisfy github to upload): SAMPLE.bam.zip.
Thanks! Rob
P.S. As a follow up, if I request the CIGAR strings from each of these records using the
cigar
method, I get the following — which matches with the incorrect spans. I note however, that there is both thecigar
record on the BAM record and acigar
method under thesam::alignment::Record
impl for thebam::Record
. It's not unclear to me why the direct method fails to return the proper data.