wdecoster / NanoPlot

Plotting scripts for long read sequencing data
http://nanoplot.bioinf.be
MIT License
432 stars 47 forks source link

Different quality scores between NanoPlot and FastQC #191

Closed GeertvanGeest closed 4 years ago

GeertvanGeest commented 4 years ago

Hi, I've run NanoPlot and FastQC on the same fastq file, and they seem to give me different read quality scores.

image

image

Here's the log file: cerebellum-5238-batch2_NanoPlot_20200722_1305.log

Does this make sense? Thanks in advance.

wdecoster commented 4 years ago

No, that doesn't really make sense to me. There can be a bit of a difference, but this is rather large. For example, you can average quality scores in two ways (see blog post). What are the numbers in the NanoStats.txt file?

GeertvanGeest commented 4 years ago

Thanks for the quick reply. Below is the content of the txt file, and they seem to correspond to the length-vs-quality plot. It's public data, so I can share the fastq if necessary.

General summary:        
Mean read length:              6,003.3
Mean read quality:                 7.3
Median read length:            6,454.0
Median read quality:               7.3
Number of reads:               3,735.0
Read length N50:               6,473.0
Total bases:              22,422,295.0
Number, percentage and megabases of reads above quality cutoffs
>Q5:    3657 (97.9%) 22.0Mb
>Q7:    2312 (61.9%) 14.1Mb
>Q10:   0 (0.0%) 0.0Mb
>Q12:   0 (0.0%) 0.0Mb
>Q15:   0 (0.0%) 0.0Mb
Top 5 highest mean basecall quality scores and their read lengths
1:  9.9 (4993)
2:  9.7 (6604)
3:  9.7 (6445)
4:  9.7 (6586)
5:  9.7 (6548)
Top 5 longest reads and their mean basecall quality score
1:  13102 (5.7)
2:  10779 (8.2)
3:  9564 (7.0)
4:  9532 (7.7)
5:  9436 (5.9)
wdecoster commented 4 years ago

Yes please, I'd like to take a look at those reads.

GeertvanGeest commented 4 years ago

Here you go: cerebellum-5238-batch2.fastq.gz

wdecoster commented 4 years ago

Hi,

Thanks for your time. I just averaged the quality scores in that file in two ways.

Python code:

import math
from Bio import SeqIO
import matplotlib.pyplot as plt

def ave_qual_good(quals):
    '''
    Receive the integer quality scores of a read and return the average quality for that read
    First convert Phred scores to probabilities, calculate average error probability and convert average back to Phred scale
    '''
    return -10*math.log(sum([10**(q/-10) for q in quals]) / len(quals), 10)

def ave_qual_bad(quals):
    ''' 
    Receive the integer quality scores of a read and return the average quality for that read
    '''
    return sum(quals) / len(quals)

badQ = [aveQual(read.letter_annotations["phred_quality"]) for read in SeqIO.parse(gzip.open("cerebellum-5238-batch2.fastq.bgz", 'rt'), "fastq")]
goodQ = [aveQual(read.letter_annotations["phred_quality"]) for read in SeqIO.parse(gzip.open("cerebellum-5238-batch2.fastq.bgz", 'rt'), "fastq")]
plt.scatter(badQ, goodQ)
plt.xlabel("How not to calculate average quality scores")
plt.ylabel("Good method to average quality scores")
plt.show()

This results in the scatter plot below. Correlation between both calculations is okay, but the "badly calculated scores" (seemingly similar to FastQC) are always much higher than the "correctly calculated scores" (as shown by NanoPlot). I didn't expect the difference to be that large, but this is the explanation of the discrepancy you see.

Cheers, Wouter

image

GeertvanGeest commented 4 years ago

Thanks a lot for your clear answer, highly appreciated!

wdecoster commented 4 years ago

Happy to help, please let me know if there is anything else.