Closed GeertvanGeest closed 4 years ago
No, that doesn't really make sense to me. There can be a bit of a difference, but this is rather large. For example, you can average quality scores in two ways (see blog post). What are the numbers in the NanoStats.txt file?
Thanks for the quick reply. Below is the content of the txt file, and they seem to correspond to the length-vs-quality plot. It's public data, so I can share the fastq if necessary.
General summary:
Mean read length: 6,003.3
Mean read quality: 7.3
Median read length: 6,454.0
Median read quality: 7.3
Number of reads: 3,735.0
Read length N50: 6,473.0
Total bases: 22,422,295.0
Number, percentage and megabases of reads above quality cutoffs
>Q5: 3657 (97.9%) 22.0Mb
>Q7: 2312 (61.9%) 14.1Mb
>Q10: 0 (0.0%) 0.0Mb
>Q12: 0 (0.0%) 0.0Mb
>Q15: 0 (0.0%) 0.0Mb
Top 5 highest mean basecall quality scores and their read lengths
1: 9.9 (4993)
2: 9.7 (6604)
3: 9.7 (6445)
4: 9.7 (6586)
5: 9.7 (6548)
Top 5 longest reads and their mean basecall quality score
1: 13102 (5.7)
2: 10779 (8.2)
3: 9564 (7.0)
4: 9532 (7.7)
5: 9436 (5.9)
Yes please, I'd like to take a look at those reads.
Here you go: cerebellum-5238-batch2.fastq.gz
Hi,
Thanks for your time. I just averaged the quality scores in that file in two ways.
Python code:
import math
from Bio import SeqIO
import matplotlib.pyplot as plt
def ave_qual_good(quals):
'''
Receive the integer quality scores of a read and return the average quality for that read
First convert Phred scores to probabilities, calculate average error probability and convert average back to Phred scale
'''
return -10*math.log(sum([10**(q/-10) for q in quals]) / len(quals), 10)
def ave_qual_bad(quals):
'''
Receive the integer quality scores of a read and return the average quality for that read
'''
return sum(quals) / len(quals)
badQ = [aveQual(read.letter_annotations["phred_quality"]) for read in SeqIO.parse(gzip.open("cerebellum-5238-batch2.fastq.bgz", 'rt'), "fastq")]
goodQ = [aveQual(read.letter_annotations["phred_quality"]) for read in SeqIO.parse(gzip.open("cerebellum-5238-batch2.fastq.bgz", 'rt'), "fastq")]
plt.scatter(badQ, goodQ)
plt.xlabel("How not to calculate average quality scores")
plt.ylabel("Good method to average quality scores")
plt.show()
This results in the scatter plot below. Correlation between both calculations is okay, but the "badly calculated scores" (seemingly similar to FastQC) are always much higher than the "correctly calculated scores" (as shown by NanoPlot). I didn't expect the difference to be that large, but this is the explanation of the discrepancy you see.
Cheers, Wouter
Thanks a lot for your clear answer, highly appreciated!
Happy to help, please let me know if there is anything else.
Hi, I've run NanoPlot and FastQC on the same fastq file, and they seem to give me different read quality scores.
Here's the log file: cerebellum-5238-batch2_NanoPlot_20200722_1305.log
Does this make sense? Thanks in advance.