zaeleus / noodles

Bioinformatics I/O libraries in Rust
MIT License
477 stars 53 forks source link

migrating from 0.49 to 0.65: generating invalid BAM files #244

Closed jamestwebber closed 5 months ago

jamestwebber commented 5 months ago

Every once in a while I touch this code and find that noodles has gone up 20 versions 😁 This time, I'm going from noodles 0.49 / noodles-bam 0.43 to 0.65 / 0.56 respectively. I was able to get my code running by looking through the changelog--I just needed to not read/write reference sequences, and change how I parsed the tag I need. Now my code runs, but the output appears to be invalid for some reason.

Trying samtools view on the result gets me [E::sam_format1_append] Corrupted aux data for read [read name]. I'm not sure how to diagnose what's wrong here. If I try to read the output with pysam it gives me KeyError: "unknown type '71'" in the tags, so I'm guessing the tags are being mangled somehow.

I'm just filtering a BAM file based on a tag (an array of two uint16) and separating the reads into different output BAMs. I use async readers and writers to speed things up.

Code example ```rust info!("Reading from {}", input_bam.display()); let (mut reader, header) = make_reader(&input_bam).await?; ... let mut writers = make_writers(&header, output_bams).await?; debug!("Reading records from BAM"); let mut records = reader.records(); while let Some(record) = records.try_next().await? { if let Some(Ok(Value::Array(Array::UInt16(bc_val)))) = record.data().get(&BC) { let bc_val: Vec<_> = bc_val.iter().filter_map(|s| s.ok()).collect(); if bc_val.len() != 2 { debug!("bc array with length {}, that's weird!", bc_val.len()); } else { if let Some(p) = barcode_map.get(&bc_val) { if let Some(writer) = writers.get_mut(p) { writer.write_record(&header, &record).await?; } } } } } for writer in writers.values_mut() { writer.shutdown().await?; } ```

That code is slightly modified from before to extract the tag correctly, but not much has changed. What did I miss here?

zaeleus commented 5 months ago

It's a bug, sorry! There was a typo in how value counts were calculated. It affects noodles 0.61.0-0.65.0/noodles-bam 0.53.0-0.56.0 when using BAM format records (bam::Record) that have data array fields.

This is now fixed in noodles 0.66.0/noodles-bam 0.57.0. Thanks for reporting!

jamestwebber commented 5 months ago

Thanks for the ultra-fast bugfix as usual