wwood / CoverM

Read coverage calculator for metagenomics
GNU General Public License v3.0
297 stars 30 forks source link

multi-threading CoverM coverage estimation via data parallelism library rayon #106

Open jianshu93 opened 2 years ago

jianshu93 commented 2 years ago

Dear Ben,

I saw that in genome.rs line 147:

per_genome_coverage_estimators[genome_index].iter_mut()

we can use rayon::preclude::*; in Cargo.toml and replace .iter_mut() with par_inter_mut(), everything else should be exactly the same including map et.al. and we will have parallel version of CoverM. Since rayon takes care everything for us, the number of threads that should be spawned depends on the data size we have. It would be a nice improvement and could accelerate coverage calculation by a lot. I notice that with a 6.8 G sorted bam file and 200 reference genomes, it will take several minutes. With rayon parallelism, I believe it should be done within a minute. You may need to check wherever you use iter_mut() and replace it with par_iter_mut(). I would be happy to test if you feel like it worth the efforts.

Many,

Thanks,

Jianshu

jianshu93 commented 2 years ago

Or at lease the bam file read and iterating bam files can be much faster. I noticed that there are some iterator implemented specifically for coverage taker.

Thanks,

Jianshu

wwood commented 2 years ago

Hi @jianshu93 right, I agree. I had considered this, but I had thought that the opening of the BAM file could not be done multiple times? This is especially true if the BAM is streamed in, but also think there's an internal pointer in HTSLIB that cannot be shared between threads.

Do you have an implementation that shows me to be wrong?

jianshu93 commented 2 years ago

Hello Ben,

I check the rust-htslib and found that yes, reading bam file is sequential and that internal point cannot be share between threads because of the C language it self. I think if we have a rust rewritten of htslib, we could do that. The bam file reading is the limiting step for now right? The coverage calculation is actually already very fast right.

Another thing related to gene coverage on a contig. Is that possible to add an option to also calculate each gene coverage by providing a gff file of contigs. Mapping reads to genes and then calculate coverage is dangerous because of the edge effects generated by filtering by reads alignment ratio or reads mappers (bowtie2 end to end will miss lot of reads mapped to both ends of genes).

Thanks,

Jianshu

rhysnewell commented 2 years ago

Hi there,

Yep, I've run into this same issue with the rust-htslib. Luckily someone has already written a BAM parser completely in Rust! https://docs.rs/bam/latest/bam/ However, using additional multithreading might still be non-trivial with that crate. It might be worth experimenting with in future

Gene coverage would also be pretty neat, I think we've talked about it in the past but currently we're pretty swamped. Ben does have a previous tool that does do exactly what you ask though, dirseq (https://github.com/wwood/dirseq) it works quite well once installed.

Cheers, Rhys

wwood commented 2 years ago

Thanks both, yeh gene-wise abundance is tracked in https://github.com/wwood/CoverM/issues/16 but there hasn't been much movement there for a while, it seems.

@rhysnewell do you have some sense of how fast the pure Rust library is compared to rust-htslib, multithreading aside?

rhysnewell commented 2 years ago

I've no idea, I've asked the creator though so will let you know what their repsonse is.

jianshu93 commented 2 years ago

Hi All,

Thanks for letting me know dirseq @rhysnewell, I have being using it for a while for checking DNA contamination for RNA seq data (and Compare it with our own implementation, quite consistent). I was asking whether CoverM can do this because it is already a pipeline from mapping to coverage calculation. It could be even more convenient if there is a -gff option in both coverm genome and contig and output for gene coverage.

I checked the Bam and it seems you could use more threads?:

extern crate bam;

fn main() { // Read "in.bam" using 4 additional threads (5 total). let reader = bam::BamReader::from_path("in.bam", 4).unwrap(); for record in reader { let record = record.unwrap(); // Do something. } }

I do not understand why 5 in total...

Thanks,

Jianshu