Does this conversion not introduce an error in the mean read quality?
If I have a read with a mean quality of 16.8, the script converts it to 16 and the probability becomes 0.02512.
But according to the formula $P = 10^{-\frac{10}{Q}}$, it should be 0.02089.
Using this test file (columns are read id, length and mean quality) and without converting to int:
import pandas as pd
import numpy as np
from math import log
df = pd.read_table("test.txt",
header=None,
names=['id', 'length', 'quality'])
convert_to_probs = lambda q: 10 ** (-q/10)
vfunc = np.vectorize(convert_to_probs)
probs = vfunc(df['quality'])
-10 * log(probs.sum() / len(probs), 10)
The mean read quality 13.22437.
If I convert the scores to int:
The mean read quality is calculated by:
In Nanomath, for step 1, the mean phred score of each read is interpreted as an
int
instead of afloat
.Does this conversion not introduce an error in the mean read quality? If I have a read with a mean quality of 16.8, the script converts it to 16 and the probability becomes 0.02512. But according to the formula $P = 10^{-\frac{10}{Q}}$, it should be 0.02089.
Using this test file (columns are read id, length and mean quality) and without converting to
int
:The mean read quality 13.22437. If I convert the scores to
int
:The mean read quality is 12.71993, the same as Nanoplot reports (see attached file). NanoStats.txt
Is there a reason to convert the mean score of each read to
int
before calculating the probabilities?