yjx1217 / simuG

simuG: a general-purpose genome simulator
MIT License
83 stars 11 forks source link

slow CNV creation #5

Closed RichardCorbett closed 1 year ago

RichardCorbett commented 2 years ago

Hi there,

I am giving your tool a try as it looks very simple to run and it seems to do exactly what I want for simulating CNV changes in germline nanopore reads.

I am running with this command:

simuG.pl -r hg38_no_alt.fa -cnv_count 50 -cnv_min_size 500 -cnv_max_size 300000000

But this has been running for 8 days. Do you have any tips to make this run faster?

yjx1217 commented 2 years ago

Hi Richard,

Thanks for trying out simuG. And yes, 8 days are way too long! I think this problem is likely due to the fact that you have set the cnv_max_size too large (300Mb in your current setting). Since simuG assumes a uniform distribution when sampling cnv size, so it is highly likely to occur that the sampled CNV size is too large to be placed into any human chromosome. In this case, simuG will keep trying to find a chromosomal location to place this sampled CNV but will never succeed. To prevent this to occur, I should probably implement an internal safe bound check for simuG in future. For now, please try to reduce the value of -cnv_max_size to a more realistic value (e.g. 1-10 Mb) and definitely make sure it is smaller than the size of the largest chromosomes of your input genome (which is ~249Mb for human genome). Let me know how it works.

Best, Jia-Xing

RichardCorbett commented 2 years ago

Thank you.

I tried this and it worked in a few seconds: simuG.pl -r hg38_no_alt.fa -cnv_count 50 -cnv_min_size 50 -cnv_max_size 50000000 as did this: simuG.pl -r hg38_no_alt.fa -cnv_count 50 -cnv_min_size 500 -cnv_max_size 50000000

this command, however, runs for a week and doesn't complete: simuG.pl -r hg38_no_alt.fa -cnv_count 50 -cnv_min_size 500 -cnv_max_size 100000000

yjx1217 commented 2 years ago

Hi Richard,

Thanks for the testing and the confirmation!

It is the same reason (not enough genomic space to place more simulated events) that your last run with -cnv_count 50 & -cnv_max_size 100Mb cannot complete. See here for the size of human chromosomes:

image

As you can see, there is only 16 chromosome can hold a CNV longer than 100 Mb. Since simuG doesn't allow for overlapping events by design, so it is likely to happen that simuG cannot find enough genomic space to place more CNV events given that you want 50 CNVs in total. So if you really want to simulate very large CNVs (e.g. >50 Mb), I would recommend you to reduce the number of -cnv_count parameter as a workaround.

Best, Jia-Xing