Closed pdimens closed 1 month ago
Hello, thanks for trying out simuG.
Yes, your observation is correct, which is mainly due to the cross check of every newly simulated SNP/INDEL variant against all previous ones that got introduced to avoid them being introduced into overlapped genomic regions.
Best, Jia-Xing
It's somewhat prohibitively slow to simulate 8M SNPs onto a genome. The current runtime is 1 day and 17 hours and it's still going.
hmm, I don't expect simuG to be that slow ... Could you provide the parameters that you have specified for your run? Also, if possible, a testing dataset with the full command lines that you used when running simuG will be great for me to looking into the possible cause and workaround if possible. Thanks in advance!
Best, Jia-Xing
The command is
perl simuG -refseq geno.fasta -prefix prefix -snp_count 1000000 > simuG.log
Hi @pdimens
See the my simulation for introducing 1 million SNV to the human reference genome (GRCh38) with the following command:
refseq="Homo_sapiens.GRCh38.tidy.fa"
prefix="TEST"
time perl ./../simuG.pl -refseq $refseq -prefix $prefix -snp_count 1000000 > simuG.log
This only cost ~2.5 minutes.
real 2m33.296s
user 2m19.688s
sys 0m10.401s
log file:
[Tue Sep 24 21:41:39 2024]
Starting simuG ..
[Tue Sep 24 21:41:39 2024]
Check specified options ..
Running simuG for SNP/INDEL simulation >>
Ignore all options for CNV/inversion/translocation simulation.
This simulation use the random seed: 1836364910
The option snp_count has been specified: snp_count = 1000000
The option titv_ratio has been specified: titv_ratio = 0.5
[Tue Sep 24 21:42:22 2024] Introducing random SNPs based on the following parameters:
> snp_count = 1000000
> titv_ratio = 0.5
[Tue Sep 24 21:43:49 2024]
Simulation completed! :)
[Tue Sep 24 21:43:49 2024]
Generating output files ..
Generating the correspondence map for genomic variants introduced during simulation:
TEST.refseq2simseq.map.txt
Generating reference-based vcf file for genomic variants introduced during simulation:
TEST.refseq2simseq.SNP.vcf
[Tue Sep 24 21:44:07 2024]
Done! :)
What is the size of your reference genome? Could it be the case that your reference genome size is very similar to 8M? In that case, simuG will gradually get struggled to find available site for incorporating new variants as the simulation progresses.
Best, Jia-Xing
My reference genomes are usually 400Mb or greater, up to 4.6Gb
In that case, it is truly strange. Maybe double check if they are truly fasta files? And if there are space separated strings in the header lines begin with ">"?
Best, Jia-Xing
They are true fasta files and the headers don't have a space after >
, but may have a space after the sequence ID, per the standard fasta format. For example, the header may be:
>contig1
ATACATA....
or
>contig1 additional comments here
ATACCACAAGA...
By space-separated strings, I mean this case that you mentioned:
>contig1 additional comments here
But even in this case, simuG should still work.
Could you provide a sample genome (that simuG took extremely long to run on your side) for me to test?
Best, Jia-Xing
I used this genome: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_018492685.1/ Also this: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000001215.4/
Hi @pdimens ,
Here is my testing results with GCF_018492685.1, which took ~50sec to finish.
refseq="GCF_018492685.1_fAloSap1.pri_genomic.fna.gz"
prefix="GCF_018492685.1_TEST"
time perl ./simuG/simuG.pl -refseq $refseq -prefix $prefix -snp_count 1000000 > $prefix.simuG.log
real 0m49.605s
user 0m51.517s
sys 0m3.077s
[Wed Sep 25 09:46:30 2024]
Starting simuG ..
[Wed Sep 25 09:46:30 2024]
Check specified options ..
Running simuG for SNP/INDEL simulation >>
Ignore all options for CNV/inversion/translocation simulation.
This simulation use the random seed: 637517621
The option snp_count has been specified: snp_count = 1000000
The option titv_ratio has been specified: titv_ratio = 0.5
[Wed Sep 25 09:46:38 2024] Introducing random SNPs based on the following parameters:
> snp_count = 1000000
> titv_ratio = 0.5
[Wed Sep 25 09:47:05 2024]
Simulation completed! :)
[Wed Sep 25 09:47:05 2024]
Generating output files ..
Generating the correspondance map for genomic variants introduced during simulation:
GCF_018492685.1_TEST.refseq2simseq.map.txt
Generating reference-based vcf file for genomic variants introduced during simulation:
GCF_018492685.1_TEST.refseq2simseq.SNP.vcf
[Wed Sep 25 09:47:16 2024]
Done! :)
Interestingly, my simulation with the much smaller Drosophila genome went slower (~9min), but still not as slow as you have encountered:
refseq="GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.fna.gz"
prefix="GCF_000001215.4_TEST"
time perl ./simuG/simuG.pl -refseq $refseq -prefix $prefix -snp_count 1000000 > $prefix.simuG.log
real 9m11.650s
user 9m9.918s
sys 0m2.436s
[Wed Sep 25 09:52:14 2024]
Starting simuG ..
[Wed Sep 25 09:52:14 2024]
Check specified options ..
Running simuG for SNP/INDEL simulation >>
Ignore all options for CNV/inversion/translocation simulation.
This simulation use the random seed: 1538934389
The option snp_count has been specified: snp_count = 1000000
The option titv_ratio has been specified: titv_ratio = 0.5
[Wed Sep 25 09:52:15 2024] Introducing random SNPs based on the following parameters:
> snp_count = 1000000
> titv_ratio = 0.5
[Wed Sep 25 10:01:11 2024]
Simulation completed! :)
[Wed Sep 25 10:01:11 2024]
Generating output files ..
Generating the correspondance map for genomic variants introduced during simulation:
GCF_000001215.4_TEST.refseq2simseq.map.txt
Generating reference-based vcf file for genomic variants introduced during simulation:
GCF_000001215.4_TEST.refseq2simseq.SNP.vcf
[Wed Sep 25 10:01:22 2024]
Done! :)
My testing environment is a computing server equipped with CPU: Intel(R) Xeon(R) Gold 6248R CPU @ 3.00GHz RAM: 256 MB
simuG took very minimal of computing resources though.
Maybe have a try with another computing environment to see if things will change?
Best, Jia-Xing
I did some further testing and confirmed that if you have a lot of variants relative to the genome size, simuG
siezes up. For example, adding 1mil or 500k snps to a 148mb genome is near impossible. I had the program running over the last two weeks and it never finished that.
Hi @pdimens ,
Yes, the number of variants relative to genome size does correlate with running time but again I would never expect it could be that slow. My guess is the running time issue that you encountered is related to the specific computing environment or how Perl was configured on top of it. What is the CPU, RAM, and OS configuration for your computing environment?
Best, Jia-Xing
It was run on a cluster computing environment. I don't have the specs on hand right now, but it's CentOS, 42 Intel Xeon cores, 512GB RAM.
Maybe have a try with your own laptop or desktop? simuG can run comfortably on laptops. The substantial running time on your cluster is really abnormal.
Best, Jia-Xing
Compared to the other variant types, SNP/indel variant calling is extremely slow. Is there a reason for this bottleneck?