yjx1217 / simuG

simuG: a general-purpose genome simulator
MIT License
87 stars 12 forks source link

snp/indel simulation is remarkably slow #18

Closed pdimens closed 1 month ago

pdimens commented 1 month ago

Compared to the other variant types, SNP/indel variant calling is extremely slow. Is there a reason for this bottleneck?

yjx1217 commented 1 month ago

Hello, thanks for trying out simuG.

Yes, your observation is correct, which is mainly due to the cross check of every newly simulated SNP/INDEL variant against all previous ones that got introduced to avoid them being introduced into overlapped genomic regions.

Best, Jia-Xing

pdimens commented 1 month ago

It's somewhat prohibitively slow to simulate 8M SNPs onto a genome. The current runtime is 1 day and 17 hours and it's still going.

yjx1217 commented 1 month ago

hmm, I don't expect simuG to be that slow ... Could you provide the parameters that you have specified for your run? Also, if possible, a testing dataset with the full command lines that you used when running simuG will be great for me to looking into the possible cause and workaround if possible. Thanks in advance!

Best, Jia-Xing

pdimens commented 1 month ago

The command is

perl simuG -refseq geno.fasta -prefix prefix  -snp_count 1000000 > simuG.log
yjx1217 commented 1 month ago

Hi @pdimens

See the my simulation for introducing 1 million SNV to the human reference genome (GRCh38) with the following command:

refseq="Homo_sapiens.GRCh38.tidy.fa"
prefix="TEST"

time perl ./../simuG.pl -refseq $refseq -prefix $prefix  -snp_count 1000000 > simuG.log

This only cost ~2.5 minutes.

real    2m33.296s
user    2m19.688s
sys     0m10.401s

log file:

[Tue Sep 24 21:41:39 2024]
Starting simuG ..

[Tue Sep 24 21:41:39 2024]
Check specified options ..
Running simuG for SNP/INDEL simulation >>
Ignore all options for CNV/inversion/translocation simulation.

This simulation use the random seed: 1836364910

The option snp_count has been specified: snp_count = 1000000
The option titv_ratio has been specified: titv_ratio = 0.5

[Tue Sep 24 21:42:22 2024] Introducing random SNPs based on the following parameters:
 > snp_count = 1000000
 > titv_ratio = 0.5

[Tue Sep 24 21:43:49 2024]
Simulation completed! :) 

[Tue Sep 24 21:43:49 2024]
Generating output files .. 

Generating the correspondence map for genomic variants introduced during simulation:
TEST.refseq2simseq.map.txt

Generating reference-based vcf file for genomic variants introduced during simulation:
TEST.refseq2simseq.SNP.vcf
[Tue Sep 24 21:44:07 2024]
Done! :) 

What is the size of your reference genome? Could it be the case that your reference genome size is very similar to 8M? In that case, simuG will gradually get struggled to find available site for incorporating new variants as the simulation progresses.

Best, Jia-Xing

pdimens commented 1 month ago

My reference genomes are usually 400Mb or greater, up to 4.6Gb

yjx1217 commented 1 month ago

In that case, it is truly strange. Maybe double check if they are truly fasta files? And if there are space separated strings in the header lines begin with ">"?

Best, Jia-Xing

pdimens commented 1 month ago

They are true fasta files and the headers don't have a space after >, but may have a space after the sequence ID, per the standard fasta format. For example, the header may be:

>contig1
ATACATA....

or

>contig1 additional comments here
ATACCACAAGA...
yjx1217 commented 1 month ago

By space-separated strings, I mean this case that you mentioned: >contig1 additional comments here

But even in this case, simuG should still work.

Could you provide a sample genome (that simuG took extremely long to run on your side) for me to test?

Best, Jia-Xing

pdimens commented 1 month ago

I used this genome: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_018492685.1/ Also this: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000001215.4/

yjx1217 commented 1 month ago

Hi @pdimens ,

Here is my testing results with GCF_018492685.1, which took ~50sec to finish.

refseq="GCF_018492685.1_fAloSap1.pri_genomic.fna.gz"
prefix="GCF_018492685.1_TEST"
time perl ./simuG/simuG.pl -refseq $refseq -prefix $prefix  -snp_count 1000000 > $prefix.simuG.log
real    0m49.605s
user    0m51.517s
sys     0m3.077s
[Wed Sep 25 09:46:30 2024]
Starting simuG ..

[Wed Sep 25 09:46:30 2024]
Check specified options ..
Running simuG for SNP/INDEL simulation >>
Ignore all options for CNV/inversion/translocation simulation.

This simulation use the random seed: 637517621

The option snp_count has been specified: snp_count = 1000000
The option titv_ratio has been specified: titv_ratio = 0.5

[Wed Sep 25 09:46:38 2024] Introducing random SNPs based on the following parameters:
> snp_count = 1000000
> titv_ratio = 0.5

[Wed Sep 25 09:47:05 2024]
Simulation completed! :) 

[Wed Sep 25 09:47:05 2024]
Generating output files .. 

Generating the correspondance map for genomic variants introduced during simulation:
GCF_018492685.1_TEST.refseq2simseq.map.txt

Generating reference-based vcf file for genomic variants introduced during simulation:
GCF_018492685.1_TEST.refseq2simseq.SNP.vcf
[Wed Sep 25 09:47:16 2024]
Done! :) 

Interestingly, my simulation with the much smaller Drosophila genome went slower (~9min), but still not as slow as you have encountered:

refseq="GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.fna.gz"
prefix="GCF_000001215.4_TEST"
time perl ./simuG/simuG.pl -refseq $refseq -prefix $prefix  -snp_count 1000000 > $prefix.simuG.log
real    9m11.650s
user    9m9.918s
sys     0m2.436s
[Wed Sep 25 09:52:14 2024]
Starting simuG ..

[Wed Sep 25 09:52:14 2024]
Check specified options ..
Running simuG for SNP/INDEL simulation >>
Ignore all options for CNV/inversion/translocation simulation.

This simulation use the random seed: 1538934389

The option snp_count has been specified: snp_count = 1000000
The option titv_ratio has been specified: titv_ratio = 0.5

[Wed Sep 25 09:52:15 2024] Introducing random SNPs based on the following parameters:
> snp_count = 1000000
> titv_ratio = 0.5

[Wed Sep 25 10:01:11 2024]
Simulation completed! :) 

[Wed Sep 25 10:01:11 2024]
Generating output files .. 

Generating the correspondance map for genomic variants introduced during simulation:
GCF_000001215.4_TEST.refseq2simseq.map.txt

Generating reference-based vcf file for genomic variants introduced during simulation:
GCF_000001215.4_TEST.refseq2simseq.SNP.vcf
[Wed Sep 25 10:01:22 2024]
Done! :) 

My testing environment is a computing server equipped with CPU: Intel(R) Xeon(R) Gold 6248R CPU @ 3.00GHz RAM: 256 MB

simuG took very minimal of computing resources though.

Maybe have a try with another computing environment to see if things will change?

Best, Jia-Xing

pdimens commented 1 month ago

I did some further testing and confirmed that if you have a lot of variants relative to the genome size, simuG siezes up. For example, adding 1mil or 500k snps to a 148mb genome is near impossible. I had the program running over the last two weeks and it never finished that.

yjx1217 commented 1 month ago

Hi @pdimens ,

Yes, the number of variants relative to genome size does correlate with running time but again I would never expect it could be that slow. My guess is the running time issue that you encountered is related to the specific computing environment or how Perl was configured on top of it. What is the CPU, RAM, and OS configuration for your computing environment?

Best, Jia-Xing

pdimens commented 1 month ago

It was run on a cluster computing environment. I don't have the specs on hand right now, but it's CentOS, 42 Intel Xeon cores, 512GB RAM.

yjx1217 commented 1 month ago

Maybe have a try with your own laptop or desktop? simuG can run comfortably on laptops. The substantial running time on your cluster is really abnormal.

Best, Jia-Xing