yjx1217 / simuG

simuG: a general-purpose genome simulator
MIT License
83 stars 11 forks source link

How to determine the quantity of various kinds of random mutations in a genomic simulation? #15

Open DayTimeMouse opened 3 months ago

DayTimeMouse commented 3 months ago

Hi,

I am quite puzzled about how to appropriately set the number of different types of mutations when simulating a human cancer genome.

Could you provide me with some guidance on this matter?

Thanks a lot!

yjx1217 commented 3 months ago

Hi there, this is a tricky question that depends on lots of things. 1) simuG is designed for simulating germline variants in general, which means it will not introduce variants with different somatic allelic frequencies during its simulation. You can of course let simuG to simulate a bunch of different versions of mutated genomes and then further simulate reads from these genomes to create a mosaic reads dataset to resemble cancer genomes. That being said, there should be some other more specific cancer genome simulators that can give you a more realistic simulation for somatic mutations. You can take a look. 2) Both the number and spectrum of somatic mutations in cancer genomes can vary a lot across different cancer types. So my guess is to extract the mutation burden and spectrum information from the real cancer genome data to guide your simulation.

Best, Jia-Xing

DayTimeMouse commented 3 months ago

First and foremost, I greatly appreciate your reply.

My research goal is to simulate a cancer genome derived from the human reference genome by individually simulating the paternal and maternal genomes, subsequently simulating reads based on these genomes, and finally combining the reads from both parental genomes into a single final fastq file.

I am wondering whether it is viable to use simuG for this purpose, considering that simuG includes features for simulating a variety of mutation types such as SNPs, INDELs, CNVs, INVERSIONS (INV), and TRANSLOCATIONS (TRA), all of which are integral aspects of simulating cancer genomes.

Thank you once again for your help.

yjx1217 commented 3 months ago

Hi @DayTimeMouse ,

I see. Yes, simuG can definitely do what you need. Just find a cancer genome sequencing paper for the specific types of cancer that you want to cover and plug in those estimated numbers of SNV, CNV and SV count per sample will be fine. We have a paper on NKTCL that with these numbers reported coming out soon. If you are interested in. I will post a link to the paper when it comes out in early-mid April.

One thing to keep in mind that there could be significant intratumor genomic heterogeneity across different cancer cells. And what we usually sequenced for real cancer genomes is a bulk of such heterogeneous cancer cell populations. I don't know the bigger scientific question of your study in which this simulation analysis is involved, but you might want to introduce or at least discuss the noise generated by such intercellular genomic heterogeneity when you want to compare your simulated data with real cancer sequencing data.

Best, Jia-Xing

DayTimeMouse commented 3 months ago

Yes, I am very interested in your related work and look forward to you posting the link in the future, I will follow it.

Finally, I wish you all the best!

yjx1217 commented 3 months ago

Hi @DayTimeMouse ,

Here is the link to our paper with per-sample variant number estimated for SNVs, CNVs, and SVs: https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-024-01324-5

Best, Jia-Xing

DayTimeMouse commented 3 months ago

Hi @yjx1217,

Thank you so much, I learned a lot from this paper.

Warm Regards.

DayTimeMouse commented 1 week ago

Hi @yjx1217,

I used pbsv(https://github.com/PacificBiosciences/pbsv) to call SVs, then DUPLICATION is called.

I want to ask how to set duplication varaints via simuG. Is there any difference between CNV and DUPLICATION setup?

The introduction of DUPLICATION is below:

image

yjx1217 commented 1 week ago

Hi @ DayTimeMouse ,

Thanks for the email. You can consider CNV as a consequence of segmental/tandem insertion + deletion + duplication + contraction.

So you can use simuG to introduce CNV in general, which will include some cases of duplication. If you only want to simulate duplication, you can still use simuG' s CNV simulation function with specialized settings for the following parameters:


-cnv_gain_loss_ratio
            Specify the relative ratio of DNA again over DNA loss. Default =
            1.0. Example: -cnv_gain_loss_ratio 1.0. For copy number gain
            only, set '-cnv_gain_loss Inf'. For copy number loss only, set
            '-cnv_gain_loss_ratio 0'.

    -cnv_max_copy_number
            Specify the maximal copy number for CNV. Default = 10. Example:
            -cnv_max_copy_number 10.

    -cnv_min_size
            Specify the minimal size (in basepair) for CNV variants. Default
            = 100. Example: -cnv_min_size 100.

    -cnv_max_size
            Specify the maximal size (in basepair) for CNV variants. Default
            = 100000. Example: -cnv_max_size 100.

    -duplication_tandem_dispersed_ratio
            Specify the expect frequency ratio between tandem duplication
            and dispersed duplication for CNV variants. Default = 1.
            Example: -duplication_tandem_dispersed_ratio 1. For simulating
            tandem duplication only, set
            '-duplication_tandem_dispersed_ratio Inf'. For simulating
            dispersed duplication only, set
            '-duplication_tandem_dispersed_ratio 0'.

Best, Jia-Xing