weird EP distribution in QSHMM-ONT-HQ.model

yukiteruono / pbsim3

PBSIM3: a simulator for all types of PacBio and ONT long reads

GNU General Public License v2.0

46 stars 5 forks source link

weird EP distribution in QSHMM-ONT-HQ.model #21

Closed ocxtal closed 3 months ago

ocxtal commented 4 months ago

When I used the latest pbsim3 (3.0.2) with the --qshmm data/QSHMM-ONT-HQ.model option, I found some of generated reads have strange quality strings where both of sequence and quality look sequence:

Screenshot 2024-02-25 at 13 36 52

I investigated the code and the model file, and found columns correspond to A, C, G, and T for 97 EP lines in QSHMM-ONT-HQ.model have much higher values than the other columns of 97 EP. Is this due to a bug in the training script?

yukiteruono commented 4 months ago

Thank you for reporting the trouble. Due to a bug, most of the training data (quality score) for accuracy=97 was replaced with nucleotide sequences. It will take about a week to correct the model. Temporarily removed QSHMM-ONT-HQ.model and ERRHMM-ONT-HQ.model.

yukiteruono commented 4 months ago

We have confirmed that there are no problems with models other than QSHMM-ONT-HQ.model.

ocxtal commented 4 months ago

Thank you for investigation. For my own evaluation purpose, I've replaced the entire values of 97 EP with the 96 EP ones, and the generated sequences look fine.

yukiteruono commented 3 months ago

Please use v3.0.4.

ocxtal commented 3 months ago

Thanks, I'll try it.

ocxtal commented 3 months ago

Generated qual strings looked good. Thank you for fixing!