Read quality 20 with low percentage of identity

Liukvr commented 1 year ago

Dear Nanoplot developer, i'm using NanoPlot to assess the percentage read identity of a ONT plant sequencing sequenced using P2 instrument. Looking at the tsv file output from Nanoplot we noticed that there are some reads with an average read quality greater than 20 (e.g. a read identity around 99% would be expected) which identity percentage is far below 99%

Here some exaples:

239509bc-8a13-411f-8c1d-9b0448e98a58    21.20879        21.478092       91889   73248   1       68.85691
3ab4c17e-cf80-4439-827b-b77c3d952e01    21.44239        21.838873       21769   8510    1       67.87681
1f2ce2c5-632b-4784-8b81-db7a5de56dbd    28.54513        28.556702       56165   56156   20      62.638786
35ba71d8-2952-4924-a4a3-3451909edd27    20.533953       20.673409       30288   30245   47      66.25387
c9cac6d1-d9b2-472f-b0de-6e938f1e1e19    34.925575       35.084343       15144   15018   60      68.29852

Resulting in the following plot:

photo_2023-03-15_12-12-56

Did you already faced situation like this? If so, how did you explain that? Thanks in advance, Luca

wdecoster commented 1 year ago

Hi Luca,

That Q-score is just something the basecaller made up or calculated. It doesn't know the true accuracy of the read. It just thinks, "well, this signal looks pretty decent, so I'll give it a high quality". Based on what you show here, it is not well-calibrated.

Wouter

Liukvr commented 1 year ago

Hi Wouter, Thanks for the explanation. The plot was generated using ONT reads basecalled using Guippy v6.3.8. From a naive point of view, i did not expect a such number of reads with a poor quality/identity values correlation. From your experience, is this a typical ONT reads identity plot? Did you already faced situation where the Q value revealed to be overestimated by the basecaller? Thanks in advance, Luca

wdecoster commented 1 year ago

It seems most of your reads are at the expected accuracies, looking at the top histogram.

It would presumably be more informative to convert those empirical percent identities to the Phred scale, and plot the accuracy "according to the basecaller" vs "according to the aligner". Note that also structural variants may affect the reference identity, which is an argument for using a gap-compressed reference identity (https://lh3.github.io/2018/11/25/on-the-definition-of-sequence-identity).

wdecoster / NanoPlot

Read quality 20 with low percentage of identity #320