Open d-cameron opened 2 years ago
The short answer is no; this is currently not supported.
Sorry for the delayed response. I originally thought this was doable with the current API and wanted to give an example. Unfortunately, while you can do this for reading and parsing now, there is no way to preserialize a record for writing.
Thank you for bringing up this use case. I'll investigate it further.
What's the current state of this? Does the API now support multi-threading? Even just multithreading support for reading would be immensely useful.
Thanks
With respect to bgzip-compressed formats, you can compose a multithreaded decoder with a format reader. E.g.,
let worker_count = thread::available_parallelism().unwrap_or(NonZeroUsize::MIN);
let file = File::open(src)?;
let decoder = bgzf::MultithreadedReader::with_worker_count(worker_count, file);
let mut reader = bam::io::Reader::from(decoder);
This can also be done with a BGZF encoder and format writer. I think this is half of what you originally requested, and it does greatly help with compression/decompression performance. The reader, in particular, does have the limitation of not implementing random access though.
But there is still no multithreaded format readers/writers. In a passthrough context, the SAM/BAM readers don't eagerly decode record fields anymore, so I tried to write an example for parallel serialization, which is feasible with the current API. It's a POC and nontrivial, but it does show something is possible.
main.rs
Is it possible to set the compression level when using: bgzf::MultithreadedWriter
?
I'm attempting to write a small utility that process bam records and I can't figure out the aysnc API. I'm attempting to adapt the noodles-bam/examples/bam_reheader_async.rs example and I've been running into problems with isolating the async code.
A common design pattern for bioinformatics tool is to iterate over one or more files in genomic coordinate order, process the records, then write (typically a subset of) the records to new files. The cost of the processing for many of these programs is small and the bottleneck is I/O and record parsing. For these sorts of programs, HTSJDK/htslib expose an API that allow offloading of the I/O and record parsing to background threads.
Does noodles support a synchronous AM read/write API in which the compression/decompression and serialisation/parsing are off-loaded to background threads? Something along the lines of `builder().set_worker_threads(8).set_buffer_size(8 65536).build();` ?