zaeleus / noodles

Bioinformatics I/O libraries in Rust
MIT License
482 stars 52 forks source link

Invalid data when reading a BAM file #179

Closed bistace closed 1 year ago

bistace commented 1 year ago

Hello and thank you for developing noodles!

This is probably a very silly question and I must have missed something obvious but I am met with this error when trying to read a bam file:

Custom { kind: InvalidData, error: InvalidReadName(Invalid) }

Here is a MVP of what I did to encounter this error:

use noodles::sam::Header;
use noodles_util::alignment;
use std::io::BufRead;

pub fn get_reader_and_header(path: &str) -> (alignment::Reader<Box<dyn BufRead>>, Header) {
    let mut reader = get_reader(path);
    let header = reader
        .read_header()
        .unwrap_or_else(|e| panic!("Error reading the header of '{path}': {e:?}"));
    (reader, header)
}

fn read_bam(bam: &str) {
    let (mut reader, header) = get_reader_and_header(bam);
    for record in reader.records(&header) {
        let record = record.unwrap_or_else(|e| panic!("Failed to read record: {e:?}"));
    }
}

fn main() {
    read_bam("test.bam");
}

Could you please help me understand what is happening? Samtools view does not give any error when reading this BAM file.

zaeleus commented 1 year ago

Hi, @bistace.

Can you provide the read name of the record that causes this error? A small example input would also be helpful.

Your code seems fine, particularly since the header is being read and parsed correctly. You can try running the util_alignment_view example to confirm the error.

bistace commented 1 year ago

The code (mine and the one in the util_alignment_view example) fails to read even the first record, which looks like that with samtools view | head -n 1:

18460451        163     5       10      0       47M     =       10      47      ATTATTCATTAAATAATACTTACCTGTATCTCAAAAACTTCTTGTAT        @CCFFFFFHHHHHJJIJJJJJJJJJJHJJJJJJJJJJJJJJJJJHGI NM:i:0  ms:i:94 AS:i:94nn:i:0  tp:A:P  cm:i:2  s1:i:43 s2:i:43 de:f:0  rl:i:0

Read and reference names were renamed to numbers prior to the mapping step. I forgot to mention that the file is read without any problem when given a SAM as input instead of a BAM.

I have tried to make a smaller bam file so you can test it out but for some reason, it works if the bam file is smaller so I guess it may have something to do with the header but I can't find what. I have uploaded the full bam file (around 16GB) here if you want to test with my file.

zaeleus commented 1 year ago

This is a great example input, thanks! The issue actually stems from how the alignment reader detects the format.

The first block of this file exceeds the buffer size of the inspector, which causes the BGZF decoder to fail. Unfortunately, this error was discarded, and a fallback format was chosen (bgzipped SAM rather than BAM), which in turn opened the wrong reader.

This is fixed in bf4f47d63b8d28d42d2e47113c4a640ad3161741, which swaps the block-based decoder to a streaming one for format detection. It is also now available in noodles-util 0.15.0.