wwood / CoverM

Read coverage calculator for metagenomics
GNU General Public License v3.0
273 stars 30 forks source link

[Enhancement] Improving Multi-sample Parallelisation Strategy #193

Open erfanshekarriz opened 6 months ago

erfanshekarriz commented 6 months ago

Hi!

I wanted to point out that running coverm contig --coupled <sample1_R1.fast.gz> <sample1_R2.fast.gz> <sample2_R1.fast.gz> <sample2_R2.fast.gz> .... with 40+ samples doesn't scale well when compared to running each sample individually and then merging the output files together. Even though the outcome per sample is the same.

With my specific dataset it takes 14 hours + to run all samples in one go vs only ~ 1 hour if I run each sample separately and then use a custom python script to merge the results together. I'm not sure whether it's the same for coverm genome since I haven't tested it myself.

Would be worth looking into making the parallelization process as fast as possible!

I am running on a 64 core machine with 72G memory for reference.

Hope this helps,

Erfan

wwood commented 5 months ago

Hi,

Yes, totally agree.

We (particularly @rhysnewell ) have taken some stabs at this in the past, but haven't quite gotten it to a usable point yet. It is also annoying from a UI perspective because then there's then 2 parameters - how many mappings to run at once? and how many threads in each? More mappings run simultaneously will mean higher RAM usage (and tmp disk usage), but faster.

For clarity, the main bottleneck you are running into is the mapping step, correct? Maybe updating to 0.7.0 would help because then the faster strobealign method is used.

erfanshekarriz commented 5 months ago

Yes I think for the time being I am simply running each parallel on a cluster using Snakemake so that the memory allocation is adjusted using a formula based on the number of threads and input size. I then merge all the results together with a python script.

Works just fine.

Would be worth looking into for the future!