yukiteruono / pbsim3

PBSIM3: a simulator for all types of PacBio and ONT long reads
GNU General Public License v2.0
46 stars 5 forks source link

Benchmarking running time for simulation #3

Closed casparbein closed 1 year ago

casparbein commented 1 year ago

Hi there,

I am running pbsim3 mainly because it offers the option the simulate ccs reads. In a recent use case, I wanted to simulate 10 pass ccs reads with an overall 30x coverage of a genome with around 3.5 GB. As there is no option to allocate memory to pbsim in the program, I was not sure how to optimize this in terms of runtime, and in the end, it took 48 hours on a computing cluster, using 15 cores with around 5GB RAM each. Is there a way to optimize this? I plan to use pbsim more often in the future, but these runtimes prevent it from being scaled up much. My command looks as follows:

pbsim --prefix hifi_sim --strategy wgs --genome genome.fa --depth 30 --method errhmm --errhmm /pbsim3/data/ERRHMM-SEQUEL.model --length-mean 10000 --pass-num 10

Thanks in advance and thanks for developing this tool. Cheers, Bernhard

yukiteruono commented 1 year ago

Thank you for your using PBSIM. PBSIM cannot do parallel processing and the runtime is the same even if you allocate more memory. An easy way to shorten the runtime is to run the command 10 times in parallel with --depth changed to 3 (don't forget to change --seed value for each run. If you don't change it, 10 identical read sets will be created !). The runtime can be reduced by 1/10. The runtime is roughly proportional to the coverage depth and the size of reference genome.

casparbein commented 1 year ago

Thanks for your reply. This suggestion works well. Another idea I had is to run pbsim on each chromosome of the focal genome, which will also only take a fraction of the time. Anyways, both approaches reduce running time to about 2-3 hours.