yukiteruono / pbsim2

PBSIM2: a simulator for long read sequencers with a novel generative model of quality scores
GNU General Public License v2.0
69 stars 15 forks source link

--difference-ratio with different accuracy #8

Closed lutfia95 closed 2 years ago

lutfia95 commented 3 years ago

Hi,

I am not sure, how can I use the option --difference-ratio correctly with different mutations.

E.g. pbsim genom_.fasta --length-min 360 --length-max 500 --accuracy-min 1.00 --acuracy-mean 1.00 --hmm_model R94.model --difference-ratio 0:0:0 gives 100% accuracy.

If I want to try e.g. mean accuracy 99%, 95% or even 85%. How should I change the --difference-ratio according to accuracy changes?

Best, Ahmad

yukiteruono commented 3 years ago

Thank you for your using PBSIM2. If you know --difference-ratio you want to specify for each accuracy, you should specify those values. If not, I recommend specifying the same value (such as 23:31:46 for Nanopore) for all accuracies.

lutfia95 commented 3 years ago

Thank you for your answer, so if I set --difference-ratio 0:0:0 then I am simulating error free reads.

Is it enough to set the --accuracy-mean or should I also set the --accuracy-min and --accuracy-max? E.g. I wan to simulate reads with error rate 1%, so I set the --accuracy-mean 0.99 and also --accuracy-min 0.99 and --accuracy-max 0.99 ?

lutfia95 commented 3 years ago

What are this values mean? so e.g. 23:31:46 ((substitution:insertion:deletion)). Is that mean, that I have then 23 substitutions pro read? or what is exactly the meaning of the values?

best, Ahmad

yukiteruono commented 3 years ago

Is it enough to set the --accuracy-mean or should I also set the --accuracy-min and --accuracy-max? E.g. I wan to simulate reads with error rate 1%, so I set the --accuracy-mean 0.99 and also --accuracy-min 0.99 and --accuracy-max 0.99 ?

PBSIM2 stochastically simulates reads, so it cannot simulate reads exactly with the specified accuracy (except error-free), and the accuracy of the generated reads follows an exponential distribution. Also, PBSIM2 is designed to simulate error-prone reads, so it cannot generate data with an average accuracy of 99%. About 96% should be the limit.

What are this values mean? so e.g. 23:31:46 ((substitution:insertion:deletion)). Is that mean, that I have then 23 substitutions pro read? or what is exactly the meaning of the values?

--difference-ratio specifies the ratio of error types introduced to the read. For example, if you specify 1: 2: 3, the ratio of the number of substitution, insertion, and deletion will be approximately 1: 2: 3. Specifying 2: 4: 6 or 10:20:30 are equal to 1: 2: 3.

lutfia95 commented 3 years ago

Thank you for the answers!