wdecoster / NanoPlot

Plotting scripts for long read sequencing data
http://nanoplot.bioinf.be
MIT License
403 stars 48 forks source link

Mean read quality calculation #376

Open alvanuffelen opened 2 weeks ago

alvanuffelen commented 2 weeks ago

The mean read quality is calculated by:

  1. Converting the mean phred quality score of each read to the base-calling error probability
  2. Calculating the mean of all the mean error probabilities from step 1
  3. Converting the total mean error probability from step 2 back to a phred quality score

In Nanomath, for step 1, the mean phred score of each read is interpreted as an int instead of a float.

Does this conversion not introduce an error in the mean read quality? If I have a read with a mean quality of 16.8, the script converts it to 16 and the probability becomes 0.02512. But according to the formula $P = 10^{-\frac{10}{Q}}$, it should be 0.02089.

Using this test file (columns are read id, length and mean quality) and without converting to int:

import pandas as pd
import numpy as np
from math import log 

df = pd.read_table("test.txt",
                   header=None,
                   names=['id', 'length', 'quality'])

convert_to_probs = lambda q: 10 ** (-q/10)
vfunc = np.vectorize(convert_to_probs)
probs = vfunc(df['quality'])
-10 * log(probs.sum() / len(probs), 10)

The mean read quality 13.22437. If I convert the scores to int:

probs = vfunc(df['quality'].astype(int))
-10 * log(probs.sum() / len(probs), 10)

The mean read quality is 12.71993, the same as Nanoplot reports (see attached file). NanoStats.txt

Is there a reason to convert the mean score of each read to int before calculating the probabilities?

wdecoster commented 2 weeks ago

I think you are right and this is an error! I'll do my best to fix this soon, or you are welcome to open a pull request :)