yukiteruono / pbsim3

PBSIM3: a simulator for all types of PacBio and ONT long reads
GNU General Public License v2.0
46 stars 5 forks source link

Quality score VS Error models #15

Open pabloangulo7 opened 4 months ago

pabloangulo7 commented 4 months ago

Hello,

In the paper it is not very clear to me when it is better to use the error model or the quality score model. If I understood correctly, the error model allows to simulate the non-uniformity of errors and the different types, better than the quality score model? And as the reads obtained by the error model lack quality scores (they only appear !), then for a second step of alignment of the reads with minimap2, it would be better to use the error model or the quality score model?

Thanks in advanced,

Pablo

yukiteruono commented 4 months ago

Thank you for your interest in PBSIM3. The strengths and weaknesses of each model and the evaluation results are as described in the paper https://academic.oup.com/nargab/article/4/4/lqac092/6855700 . As you say, it's not clear whether the error model or the quality score model is better. It is certain that the error model is slightly better in simulating the nonuniformity of errors. The quality score model has two advantages: one is that it generates quality scores, and the other is that you can choose any error ratio (substitution:insertion:deletion). In particular, the latter expands the range of simulation tests that can be performed. For alignment using minimap2, if quality scores are required, use the quality score model; otherwise (minimap2 allows input in FASTA format), you can use either model. For Nanopore simulations with an accuracy of 95% or higher, we strongly recommend the ERRHMM-ONT-HQ.model and QSHMM-ONT-HQ.model added in v3.0.1. In the paper, PBSIM3 simulated a slightly higher deletion rate for HiFi reads, but whether using the error model or the quality score model, you can simulate the desired HiFi reads by adjusting the CLR error rate, error ratio and number of passes.