wdecoster / NanoPlot

Plotting scripts for long read sequencing data
http://nanoplot.bioinf.be
MIT License
401 stars 48 forks source link

Nanoplot quality thresholds are poorly tuned for current datasets #334

Closed nhartwic closed 6 months ago

nhartwic commented 1 year ago

Nanoplot (actually nanomath) computes the percentage of reads with average quality above a threshold. At the moment, those thresholds are 5, 7, 10, 12, and 15. When these packages were being written, these quality thresholds made a lot of sense. But ONT basecalling has improved dramatically since that time. I think the tool would benefit from having these values revised. I'm thinking something like 10, 12, 15, 17, 20 as new values or maybe 9, 13, 17, 21, 25. This should be as simple as updating the array defined at...

https://github.com/wdecoster/nanomath/blob/40aa42a11bd056c268ed10a5bc25a3f99a538317/nanomath/nanomath.py#LL51C18-L51C30

...I can put together the relevant pull requests if you would like. I wanted to post this issue so that we could potentially discuss what values make the most sense, or if other changes make more sense, before I make the PR.

wdecoster commented 1 year ago

Hi,

Yes, this makes a lot of sense to me. We may have to span a broader range to accommodate 'legacy', simplex and duplex data. I would include Q25, so the proposed 9, 13, 17, 21, and 25 seem appropriate.

Thanks! Wouter

chucknordy commented 6 months ago

I agree it would be nice to update the quality thresholds for the reported stats.

Or perhaps even better: add a new command-line option to specify an arbitrary list of threshold Q values (in which case, the current list of [5, 7, 10, 12, 15] could be left unchanged, as the default).

wdecoster commented 6 months ago

Quality thresholds have been updated to [10, 15, 20, 25, 30] in nanomath v1.4.0. I understand that adding new command line options would result in increased flexibility, but at the end of the day everything will be a command line option and that would be a mess.