zaeleus / noodles

Bioinformatics I/O libraries in Rust
MIT License
482 stars 52 forks source link

query_unmapped fails #200

Closed drtconway closed 12 months ago

drtconway commented 1 year ago

Hi.

Again thank you for the library - I'm loving it!

My code does approximately the following:

let mut reader = bam::indexed_reader::Builder::default().build_from_path(bam)?;
let hdr = reader.read_header()?;
let unmapped = reader.query_unmapped()?;
while let Some(rec_res) = unmapped.next() {
    let rec = rec_res?;
    ...
}

And the let in the while loop fails with:

Custom { kind: InvalidData, error: InvalidReferenceSequenceId(MissingReferenceSequenceDictionaryEntry { actual: 3356, expected: 0 }) }

Examination of the BAM shows that for reads with one end mapped and the other end unmapped, it stores the unmapped end with the same RNAME and POS as the mapped end, which is obviously convenient for a number of use cases.

(The reference sequence ID is valid - it's a hg38 BAM with all the HLA, and other bits and pieces in it - 3366 reference sequences.)

When I realised this, I was shocked, but I then went and read the spec, and it appears to be legal:

Bit 0x4 is the only reliable place to tell whether the read is unmapped. If 0x4 is set, no assumptions can be made about RNAME, POS, CIGAR, MAPQ, and bits 0x2, 0x100, and 0x800.

Am I right that this an over zealous check in the parsing code?

zaeleus commented 1 year ago

Are you able to update to noodles 0.50.0 / noodles-bam 0.44.0? Reader::query_unmapped now requires the header, which should fix this issue.

  • bam/reader: Require a SAM header when querying for unmapped records (Reader::query_unmapped).

    It's possible for a chunk to include mapped records, which are subsequently filtered out, but they do require the associated header to decode.

drtconway commented 1 year ago

Great! That appears to work, thank you!

While I have your attention, can I ask a related question?

So by my reading of the spec, I think it is allowable, for an unmapped read, to have RNAME set, but POS as 0.

Does the query range allow 0 to be part of the interval? My understanding is that Position starts at 1.

I guess the simplest form of query to achieve my goals, would be to do a query that just iterates over anything with the given RNAME.

zaeleus commented 1 year ago

So by my reading of the spec, I think it is allowable, for an unmapped read, to have RNAME set, but POS as 0.

This is not allowed in by the spec. An unmapped record with no coordinate (i.e., POS = 0) does not have a reference sequence name (i.e., RNAME = *). See § 1.4.3 (2022-08-22): "An unmapped segment without coordinate has a * at [RNAME]."

Does the query range allow 0 to be part of the interval?

No, 0 is not a coordinate that can be queried from the index. To query a full region, use an unbounded range, e.g., Region::new("sq0", ..).