Quality Scores - Githubissues

wdecoster / NanoPlot

Plotting scripts for long read sequencing data

http://nanoplot.bioinf.be

MIT License

419 stars 47 forks source link

Quality Scores #355

Closed BerrocalRubio333 closed 3 months ago

BerrocalRubio333 commented 7 months ago

Hi Wouter and all,

I am trying to understand specifically the outputs of Nanoplot. Could you please share with me:

-In the Stat Summary, when you show quality cut-offs, are this calculated based on the median or the mean of a read?

Which approach is used when calculating the average mean quality and the average read quality q scores shown in the kde plots

Best.

Miguel

wdecoster commented 7 months ago

Hi,

Thanks for your question. For every read, the average quality is calculated by converting phred-scale scores into error probabilities, taking the average, and converting it back to phred-scale. This is different from how other tools do it, but correct, in my opinion. For those per-read averages, you get the overall median or the mean in the NanoStat report. And for the quality cut-offs and the kde plot the per-read averages are used.

Hope that helps!

Wouter

katievigil commented 6 months ago

Hi @wdecoster is there a way to get the range of reads min-max in the NanoStats output? I only see these stats: General summary:
Mean read length: 282.7 Mean read quality: 9.9 Median read length: 249.0 Median read quality: 10.6 Number of reads: 1,518,294.0 Read length N50: 254.0 STDEV read length: 137.7 Total bases: 429,214,286.0 Number, percentage and megabases of reads above quality cutoffs

Q5: 1518292 (100.0%) 429.2Mb Q7: 1516705 (99.9%) 428.8Mb Q10: 994193 (65.5%) 277.6Mb Q12: 294118 (19.4%) 80.5Mb Q15: 12671 (0.8%) 3.7Mb Top 5 highest mean basecall quality scores and their read lengths 1: 38.0 (1) 2: 36.0 (1) 3: 34.0 (1) 4: 34.0 (1) 5: 33.0 (1) Top 5 longest reads and their mean basecall quality score 1: 18248 (11.4) 2: 14380 (10.9) 3: 7936 (9.1) 4: 7587 (9.1) 5: 7557 (9.5)

wdecoster commented 6 months ago

This doesn't seem related to the initial question here. Opening a separate issue would be more appropriate if that is the case. I also don't understand what you are missing from the summary - do you mean you want to know the shortest and longest read in the experiment?