zaeleus / noodles

Bioinformatics I/O libraries in Rust
MIT License
477 stars 53 forks source link

BGZF invalid header really strict #237

Closed andypexai closed 5 months ago

andypexai commented 5 months ago

My noodles program processing some .fastq.gz files would work really well if I could bypass the error if this isn't valid. Could it be possible to make it a warning or an optionally-bypassed requirement? My files are generated with Illumina's "bcl-convert" and have the right gzip magic numbers and the CM value is also 8, but the other numbers are off. I get a hard error here, but HTSlib reads them just fine. Even bgzip --test my.fastq.gz runs without a warning or anyting.

zaeleus commented 5 months ago

It sounds like your files are gzip-compressed, not bgzip-compressed. Can you share the first 18 bytes of an input, e.g., head -c 18 <src> | xxd?

andypexai commented 5 months ago
00000000: 1f8b 0800 0000 0000 04ff b4bd 498e 2b49  ............I.+I
00000010: 1260                                     .`

After bgzip -d and bgzip

00000000: 1f8b 0804 0000 0000 00ff 0600 4243 0200  ............BC..
00000010: 5700                                     W.

I think it'd be strange if they're regular gzipped. In the C++ predecessor program we have, using hts_set_threads() had a dramatic benefit for us. I think they're are a sorta botched Illumina bgzipping that still is functional. I wish I had more details about what bcl-convert 4.0.3 is attempting to do. I think the previous bcl2fastq also had issues. I haven't seen the bgzip people discuss this particular issue but it's not making errors there, just here.

Update: Maybe you're right, I'm not as confident anymore that it's an attempted bgzipping. Maybe it is plain-old gzip. If that's the case, can you offer advice how you'd change my reader setup? Is there a Gzip-style AsyncReader somewhere I can slot in there in place of the bgzf one?

    let mut reader = File::open(input_file)
        .await
        .map(BufReader::new)
        .map(bgzf::AsyncReader::new)
        .map(fastq::AsyncReader::new)?;

    let mut records = reader.records();
    // loop through records
    while let Some(mut record) = records.try_next().await? {
        // ...
    }
ghuls commented 5 months ago

You can run file or htsfile from HTSlib on your file.

# bgzipped files have the "extra field" section compared with most normal gzipped files.
$ file test.fastq.gz
test.fastq.gz: gzip compressed data, extra field, original size 23124

$ htsfile test.fastq.gz
test.fastq.gz:   FASTQ BGZF-compressed sequence data
andypexai commented 5 months ago

OK that's pretty definitive then. It just says "gzipped-compressed". I suppose the C++ program I was talking about the gains were for compression-only and I had this confused. Thanks.

zaeleus commented 5 months ago

Thanks for the output, @andypexai. It does show that your inputs are gzip-compressed, not bgzip-compressed. The fourth byte FLG.FEXTRA is not set, meaning the frame metadata is missing.

I also confirmed that bcl-convert 4.0.3 does not write bgzip-compressed output.[^1]

$ bcl-convert --version
bcl-convert Version 00.000.000.4.0.3
Copyright (c) 2014-2018 Illumina, Inc.

$ htsfile results/*.fastq.gz
results/Undetermined_S0_L001_R1_001.fastq.gz:   FASTQ gzip-compressed sequence data
results/Undetermined_S0_L001_R2_001.fastq.gz:   FASTQ gzip-compressed sequence data
results/iseq-DI_S1_L001_R1_001.fastq.gz:    FASTQ gzip-compressed sequence data
results/iseq-DI_S1_L001_R2_001.fastq.gz:    FASTQ gzip-compressed sequence data

Instead of using noodles-bgzf, use a general-purpose decoder like async-compression.

// cargo add async-compression flate2 --features async-compression/gzip,flate2/zlib-ng

use tokio::io::{AsyncBufRead, BufReader};
use async_compression::futures::bufread::GzipDecoder;

let mut reader = File::open(input_file)
    .await
    .map(BufReader::new)
    .map(GzipDecoder::new)
    .map(BufReader::new)
    .map(fastq::AsyncReader::new)?;

Note that async-compression uses flate2. I would recommend linking to zlib-ng for better performance,

[^1]: "Generating FASTQs with BCL Convert (Illumina Software)". Accessed 2024-02-28.

andypexai commented 5 months ago

@zaeleus That's actually exactly what I tried, but now it's a

Error: Kind(UnexpectedEof)

and now I'm hunting down something else that probably isn't related to noodles.