yukiteruono / pbsim2

PBSIM2: a simulator for long read sequencers with a novel generative model of quality scores
GNU General Public License v2.0
69 stars 15 forks source link

Accuracy mean floating point #10

Closed esteinig closed 2 years ago

esteinig commented 2 years ago

Hey love the tool - I ran into some issues using the ONT HMM models and --accuracy-mean which produces the same results for --accuracy-mean 0.99 (Q20) and --accuracy-mean 0.999 (Q30) - which are in both cases the results for Q20. Is it possible to add support for finer-scale floats for this option?

Essentially this command is to simulate Kit12 chemistry (~ mean Q20) for R9.4.1 pores, but produces ~20% perfect reads (Q93)

pbsim --difference-ratio 23:31:46 --length-mean 15000 --length-sd 9000 --accuracy-mean 0.99 --hmm_model pbsim2/data/R94.model --depth 1 CHM13.fasta

Q score distribution:

> 5   203502        100.0%
> 7   201494        99.0%
> 10  183349        90.1%
> 12  158723        78.0%
> 15  119542        58.7%
> 20  60246         29.6%
> 25  40513         19.9% 
> 30  40513         19.9% 

Top ranking read qualities (Q)

1. 93.0
2. 93.0
3. 93.0
4. 93.0
5. 93.0

produces the same output as

pbsim --difference-ratio 23:31:46 --length-mean 15000 --length-sd 9000 --accuracy-mean 0.999 --hmm_model pbsim2/data/R94.model --depth 1 CHM13.fasta

Q score distribution:

> 5   203502        100.0%
> 7   201494        99.0%
> 10  183349        90.1%
> 12  158723        78.0%
> 15  119542        58.7%
> 20  60246         29.6%
> 25  40513         19.9% 
> 30  40513         19.9% 

Top ranking read qualities (Q)

1. 93.0
2. 93.0
3. 93.0
4. 93.0
5. 93.0

When I filter reads by average read quality scores > 90, the perfect reads are removed leaving a Q20 highest read quality, which would be inconsistent with --mean-accuracy 0.99 and default of --max-accuracy 1.0. I would expect reads in the simulated dataset between Q20 (1% error) and Q93 (0% error) given these configurations

Read quality thresholds (Q)

> 5   162989        100.0%
> 7   160981        98.8%
> 10  142836        87.6%
> 12  118210        72.5%
> 15  79029         48.5%
> 20  19733         12.1%
> 25  0             00.0%
> 30  0             00.0%

Top ranking read qualities (Q)

1. 20.0
2. 20.0
3. 20.0
4. 20.0
5. 20.0
yukiteruono commented 2 years ago

Thank you for your using PBSIM2. PBSIM2 is intended for error-prone long-read simulations, so it cannot handle more than 99% accuracy, such as PacBio HiFi read and ONT kit 12 chemistry. The next version of PBSIM currently under development has implemented a simulation of PacBio HiFi read. For ONT kit 12 chemistry, we are investigating the characteristics for developing a simulation method.

esteinig commented 2 years ago

Thanks! Would be amazing to have this finer-scale control over read accuracy as your tool is the fastest one I could find :)