zaeleus / noodles

Bioinformatics I/O libraries in Rust
MIT License
482 stars 52 forks source link

How to (async) write secondary BAM records #176

Closed ArtRand closed 1 year ago

ArtRand commented 1 year ago

Hello,

Firstly, I'm really excited about using this library. My question is probably of the "holding it wrong" variety. I am investigating using the bam::AsyncReader and bam::AsyncWriter, when I encounter a secondary/reverse-complement secondary alignment I get this error from write_record: read length-sequence length mismatch. I couldn't tell from the documentation if there is something I need to do in order to write these records. Thanks for the help.

simplified code example

async fn main() -> anyhow::Result<()> {
    // setup..

    let mut reader = File::open(&cli.in_bam).await.map(bam::AsyncReader::new)?;
    let header = reader
        .read_header()
        .await?
        .parse()?;

    let _reference_sequences = reader
        .read_reference_sequences()
        .await?;

    let mut writer = File::create(&cli.out_bam)
        .await
        .map(|fh| bam::AsyncWriter::new(fh))?;
    writer
        .write_header(&header)
        .await?;
    writer
        .write_reference_sequences(header.reference_sequences())
        .await?;

    // 
    // ..snip...
    // 

    let mut records = reader.records(&header);
    while let Some(result) = records.next().await {
        match result {
            Ok(mut record) => {
                record.data_mut().insert(tag, tag_value.clone());
                let ok = writer.write_record(&header, &record).await;
                match ok {
                    Ok(_) => pb.inc(1),
                    Err(e) => {
                        let record_name = record.read_name().map(|n| n.to_string());
                        println!(
                            "there was an err {} with {:?}, flags {:?}",
                            e.to_string(),
                            record_name,
                            record.flags()
                        );
                    }
                }
            }
            Err(_) => erred += 1,
        }
    }
    writer.shutdown().await?;
    Ok(())
}
zaeleus commented 1 year ago

Hi @ArtRand,

That error means the CIGAR is invalid for the given sequence. Are you able to provide an example input or the (SAM) record that causes this?

ArtRand commented 1 year ago

Hello @zaeleus thanks for the quick response.

Here's an example of an offending record:

900b1b9d-9282-43a3-abe8-2d80bbd0530a    256     chr1    10001   0       726S10M5I23M1I28M2I2M2I45M1I38M1D12M1I17M4I6M2I65M1D11M1I32M1I38M1D6M1I16M1I31M1I6M1D6M1D6M1I65M2276S   *       0       0       *       *       NM:i:41 ms:i:731        AS:i:720        nn:i:0  tp:A:S  cm:i:16 s1:i:118        de:f:0.0643     rl:i:781      qs:i:9  du:f:13.799     ns:i:55196      ts:i:10 mx:i:3  ch:i:448        st:Z:2022-04-11T19:12:20.240+00:00      rn:i:4913       f5:Z:chr20-chr20.pod5   sm:f:84.638     sd:f:22.55      sv:Z:quantile   MM:Z:C+h?,4,76,52,672,157,7,183,34,0,114,3,14,1,10,5,0,3,1,6,7,0,1,0,0,2,9,8,3,14,14,2,2,3,3,2,2,0,0,2,2,1,0,0,0,1,0,1,1,0,1,0,1,1,0,0,1,0,1,1,0,0,1,0,1,1,0,1,0,0,0,0,1,0,1,1,0,0,1,0,1,1,0,0,1,1,1,1,1,0,1,1,0,0,1,0,0,1,0,2,0,1,1,1,1,0,1,1,0,1,0,1,1,0,0,0,1,0,1;C+m?,4,76,52,672,157,7,183,34,0,114,3,14,1,10,5,0,3,1,6,7,0,1,0,0,2,9,8,3,14,14,2,2,3,3,2,2,0,0,2,2,1,0,0,0,1,0,1,1,0,1,0,1,1,0,0,1,0,1,1,0,0,1,0,1,1,0,1,0,0,0,0,1,0,1,1,0,0,1,0,1,1,0,0,1,1,1,1,1,0,1,1,0,0,1,0,0,1,0,2,0,1,1,1,1,0,1,1,0,1,0,1,1,0,0,0,1,0,1;       ML:B:C,13,10,253,22,18,4,6,102,224,8,3,253,8,3,43,49,20,112,232,3,31,136,72,27,80,3,81,0,4,42,12,4,40,185,38,188,145,56,0,14,31,6,113,0,0,3,82,37,157,1,1,53,124,2,2,0,45,51,197,45,245,0,21,38,157,138,232,104,30,25,195,2,2,12,21,14,240,255,46,80,16,28,162,4,1,52,241,45,10,10,103,30,71,1,29,204,185,115,0,72,30,52,0,0,12,31,30,55,175,6,12,112,179,9,0,6,4,143,0,0,1,173,0,0,240,153,4,0,0,2,0,252,210,206,234,143,21,18,223,116,183,227,170,252,173,255,0,213,243,0,214,70,216,37,110,197,255,240,224,249,142,255,255,252,173,218,98,254,253,202,130,253,253,255,210,204,58,210,10,255,234,217,96,100,22,147,225,228,60,253,253,243,233,241,15,0,203,175,239,226,93,251,0,203,14,210,245,245,152,224,184,253,226,46,70,140,255,183,225,203,255,254,242,224,225,200,79,249,243,143,76,246,255,249,251,112

I think the problem originates from comes from put_sequence where the logic may need to be adjusted to

     // § 1.4.10 "`SEQ`" (2022-08-22): "If not a '*', the length of the sequence must equal the sum
     // of lengths of `M`/`I`/`S`/`=`/`X` operations in `CIGAR`."
-    if read_length > 0 && sequence.len() != read_length {
+    if sequence.len() > 0 && sequence.len() != read_length {
         return Err(io::Error::new(
             io::ErrorKind::InvalidInput,
             "read length-sequence length mismatch",

since secondary alignments will have * for the SEQ field but non-zero CIGAR lengths. The above diff causes some of the tests to fail. I can chase those down and submit a PR if you think it's the correct approach.

zaeleus commented 1 year ago

Thanks for the example and investigation! This is indeed a bug.

Yes, that's the function that's failing. In this case, the correct approach would be to simply return if the sequence is empty. See the SAM sequence writer for a example (though note you don't have to write any output for BAM). Feel free to submit a patch!

ArtRand commented 1 year ago

Great. PR is open. Happy to accept any comments and suggestions.