wdecoster / NanoPlot

Plotting scripts for long read sequencing data
http://nanoplot.bioinf.be
MIT License
432 stars 47 forks source link

Datas difference with fastq files or sequencing summary #190

Closed pgtb closed 4 years ago

pgtb commented 4 years ago

Hi Wouter,

We are with NanoPlot 1.27.0.

I have never checked before, but I've just seen that there are differences in numbers based on input files sequencing summary versus fastq file. These differences can be quite important especially about Q Score means, median etc.

ex attached (from crappy 2 years old fast5s, but basecalled with new Guppy 4.0.11. I see same or bigger differences with modern fast5s)

NanoPlot_Comparison_SeqSum_vs_FastQ.xlsx

Thanx

best

wdecoster commented 4 years ago

Yes, quality scores calculated from the fastq are not exactly the same as what Guppy writes in the sequencing_summary file. A long time ago I corrected how I calculate the average quality score, after communication with @fbrennen (see this blog post). He might have some insight on why there is still a residual difference between Guppy sequencing summary and the calculated average quality from the Fastq...

Cheers, Wouter

pgtb commented 4 years ago

OK

Thank you Wouter

Have a nice summer !

-- Christophe Boury Assistant Ingénieur (Lab Geek) Plateforme Genome Transcriptome de Bordeaux INRAE - UMR 1202 BIOGECO 69 route d'Arcachon 33612 Cestas - Pierroton FRANCE

05.35.38.53.36 christophe.boury@inrae.frmailto:christophe.boury@inrae.fr - pgtb@inrae.frmailto:pgtb@inrae.fr

https://pgtb.cgfb.u-bordeaux.fr/ https://twitter.com/pgt_bordeaux

[PGTB_Marie]

De : Wouter De Coster [mailto:notifications@github.com] Envoyé : mercredi 22 juillet 2020 16:35 À : wdecoster/NanoPlot NanoPlot@noreply.github.com Cc : pgtb pgtb@inrae.fr; Author author@noreply.github.com Objet : Re: [wdecoster/NanoPlot] Datas difference with fastq files or sequencing summary (#190)

Yes, quality scores calculated from the fastq are not exactly the same as what Guppy writes in the sequencing_summary file. A long time ago I corrected how I calculate the average quality score, after communication with @fbrennenhttps://github.com/fbrennen (see this blog posthttps://gigabaseorgigabyte.wordpress.com/2017/06/26/averaging-basecall-quality-scores-the-right-way/). He might have some insight on why there is still a residual difference between Guppy sequencing summary and the calculated average quality from the Fastq...

Cheers, Wouter

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/wdecoster/NanoPlot/issues/190#issuecomment-662490021, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AFZGNAJHCPV6X5I2BLYH5GTR432K3ANCNFSM4PEWO4DQ.

phpeters commented 3 years ago

Hej Wouter,

Thanks a lot for your work! Did you get closer to the mystery, why the quality scores differ? And which output is "more true", using the fastq, or the summary file?

Thanks and best! Philipp

wdecoster commented 3 years ago

No, but I'll try tagging @fbrennen again :-) Both are a proxy for the true accuracy of these nucleotides, which you can get a better idea from by aligning reads to a reference genome and looking at the percent identity. I don't think the difference is that large that it will matter if you trust the fastq or summary files.

phpeters commented 3 years ago

Haha, thanks! Indeed, the difference is not that big, in my test sets it was close to 1 QC-point difference for the mean read quality. I probably switch then to using the summary, this is quicker and provides a bit less optimistic view on the data. Thanks again!

fbrennen commented 3 years ago

Hi @wdecoster @phpeters -- the mean_qscore_template value you see in the sequencing summary file is taken directly from error estimates produced by the basecaller, and is the most accurate representation we have. The qstrings in your fastq files have rounding applied to convert raw errors to integer qscores (as you can't have a base in a fastq file with a qscore of, say, 7.5), and the accumulation of those rounding errors is what produces the disparity you're seeing.

We could make the two numbers match better but we would be throwing away information if we did that, which is why we haven't. If this is something important to you then drop by the Community and make a feature request -- it would not be too hard to do.

wdecoster commented 3 years ago

Thanks for the clarification!

phpeters commented 3 years ago

Thanks for the clarification, @fbrennen ! So, using the summary file for the nanoplot plots would be the most accurate here. Thanks, you two! Philipp