Closed andypexai closed 5 months ago
It sounds like your files are gzip-compressed, not bgzip-compressed. Can you share the first 18 bytes of an input, e.g., head -c 18 <src> | xxd
?
00000000: 1f8b 0800 0000 0000 04ff b4bd 498e 2b49 ............I.+I
00000010: 1260 .`
After bgzip -d
and bgzip
00000000: 1f8b 0804 0000 0000 00ff 0600 4243 0200 ............BC..
00000010: 5700 W.
I think it'd be strange if they're regular gzipped. In the C++ predecessor program we have, using hts_set_threads() had a dramatic benefit for us. I think they're are a sorta botched Illumina bgzipping that still is functional. I wish I had more details about what bcl-convert 4.0.3 is attempting to do. I think the previous bcl2fastq also had issues. I haven't seen the bgzip people discuss this particular issue but it's not making errors there, just here.
Update: Maybe you're right, I'm not as confident anymore that it's an attempted bgzipping. Maybe it is plain-old gzip. If that's the case, can you offer advice how you'd change my reader setup? Is there a Gzip-style AsyncReader somewhere I can slot in there in place of the bgzf one?
let mut reader = File::open(input_file)
.await
.map(BufReader::new)
.map(bgzf::AsyncReader::new)
.map(fastq::AsyncReader::new)?;
let mut records = reader.records();
// loop through records
while let Some(mut record) = records.try_next().await? {
// ...
}
You can run file
or htsfile
from HTSlib on your file.
# bgzipped files have the "extra field" section compared with most normal gzipped files.
$ file test.fastq.gz
test.fastq.gz: gzip compressed data, extra field, original size 23124
$ htsfile test.fastq.gz
test.fastq.gz: FASTQ BGZF-compressed sequence data
OK that's pretty definitive then. It just says "gzipped-compressed". I suppose the C++ program I was talking about the gains were for compression-only and I had this confused. Thanks.
Thanks for the output, @andypexai. It does show that your inputs are gzip-compressed, not bgzip-compressed. The fourth byte FLG.FEXTRA
is not set, meaning the frame metadata is missing.
I also confirmed that bcl-convert 4.0.3 does not write bgzip-compressed output.[^1]
$ bcl-convert --version
bcl-convert Version 00.000.000.4.0.3
Copyright (c) 2014-2018 Illumina, Inc.
$ htsfile results/*.fastq.gz
results/Undetermined_S0_L001_R1_001.fastq.gz: FASTQ gzip-compressed sequence data
results/Undetermined_S0_L001_R2_001.fastq.gz: FASTQ gzip-compressed sequence data
results/iseq-DI_S1_L001_R1_001.fastq.gz: FASTQ gzip-compressed sequence data
results/iseq-DI_S1_L001_R2_001.fastq.gz: FASTQ gzip-compressed sequence data
Instead of using noodles-bgzf, use a general-purpose decoder like async-compression.
// cargo add async-compression flate2 --features async-compression/gzip,flate2/zlib-ng
use tokio::io::{AsyncBufRead, BufReader};
use async_compression::futures::bufread::GzipDecoder;
let mut reader = File::open(input_file)
.await
.map(BufReader::new)
.map(GzipDecoder::new)
.map(BufReader::new)
.map(fastq::AsyncReader::new)?;
Note that async-compression uses flate2. I would recommend linking to zlib-ng for better performance,
[^1]: "Generating FASTQs with BCL Convert (Illumina Software)". Accessed 2024-02-28.
@zaeleus That's actually exactly what I tried, but now it's a
Error: Kind(UnexpectedEof)
and now I'm hunting down something else that probably isn't related to noodles.
My noodles program processing some
.fastq.gz
files would work really well if I could bypass the error if this isn't valid. Could it be possible to make it a warning or an optionally-bypassed requirement? My files are generated with Illumina's "bcl-convert" and have the right gzip magic numbers and the CM value is also 8, but the other numbers are off. I get a hard error here, but HTSlib reads them just fine. Evenbgzip --test my.fastq.gz
runs without a warning or anyting.