yukiteruono / pbsim3

PBSIM3: a simulator for all types of PacBio and ONT long reads
GNU General Public License v2.0
60 stars 5 forks source link

HiFi Read Generation is Prohibitively Slow #26

Closed jwalewski closed 5 months ago

jwalewski commented 5 months ago

Hello again,

I am just asking about HiFi read generation again as attempting to generate sequences from the D.rerio genome (https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000002035.6/) is taking days, if it works at all (sometimes my system crashes, and unfortunately it's really hard to capture the error even when directing shell output to log files).

The parameters I am using are:

$PBSIM3_PATH --strategy wgs --genome $Refrence_Path$Refrence_Name --depth $Cov_min --errhmm /mnt/c/ultimate/Other_Software/pbsim3-master/pbsim3-master/data/ERRHMM-SEQUEL.model --method errhmm --prefix $INTERMEDIATE_DIR$OUTPUTNAME --pass-num 20 --length-mean 15000 --length-sd 5000

The coverage amount varies from 5-25X and the pass number is 20 (this is pretty standard, right?)

If so, why is this taking so long? I can dedicate up to 16 cores @ 4.5 GHz and 128 GB of RAM to read simulation all for a genome around 1Gbp.

Any help would be greatly appreciated!

yukiteruono commented 5 months ago

Thank you for using PBSIM3. There is no problem with your PBSIM3 parameter settings. Your computer resources are sufficient for running PBSIM3, but may not be sufficient for running ccs. Regarding calculation time, if possible, run each chromosome in parallel. If you simulate from D.rerio with number of passes=20, coverage=25, SAM format data will be about 5T. The standard number of passes for real data is 20 or more, but for PBSIM3 simulation, number of passs=10 is sufficient. Although it has nothing to do with calculation time, please remove unknown bases (N) from the reference genome. Real HiFi reads do not include N, but reads generated by PBSIM do. N can have a negative impact on downstream analyses.

jwalewski commented 5 months ago

Thank you for your quick feedback.

And that makes sense - since PBSIM3 does an excellent job with the nanopore reads, so ccs could be the culprit. So, for the number of passes, is 10 truly equal to 20? Would the simulated error rate be (effectively) the same, and would there be no impact on downstream analysis? Also that's very interesting about no "N" nucleotides in HiFi reads; I didn't know that! I will remove them.

yukiteruono commented 5 months ago

The ccs documentation states that when the number of passes = 10, the Phred quality score reaches 30 (accuracy = 99.9%) (https://ccs.how/faq/accuracy-vs-passes.html ). In the PBSIM3 simulation, accuracy=99.7% when the number of passes=10 (Table S8 of the PBSIM3 paper). Although the accuracy is higher when the number of passes = 20, 10 is sufficient.

jwalewski commented 5 months ago

Understood, thank you! I had seen that graph before but did not want to make any assumptions about PBSIM3's error rate at a different number of passes than the default/suggested value.

Since I think this is the end of the discussion of the portion that uses your program, you can feel free to close the issue. Thanks once again for your help!