Closed schorlton closed 1 year ago
Hi,
I think this is more a comment than a question, as you have already demonstrated that it is only calculated adequately within reads, and not across reads. Is this then a suggestion to change things?
Cheers, Wouter
If possible? I recognize you're potentially working on cramino as a successor but as far as I can tell, that is not compatible with FASTQs, and this is a moderately confusing bug given the previous issues and blog post on how to correctly average Phred scores. Thank you for all of your tools and effort!
It has taken a while, but nanomath v1.3.0 now has the requested change.
Cheers, Wouter
Thank you!! Really appreciate it.
Out of curiosity, how do you push updates to PyPI/Bioconda? I don't see it tagged in GitHub, so do you have to trigger a manual push to PyPI? I assume Bioconda tracks the PyPI version but don't see it pushed yet.
Thanks again!
Ah, pushing to PyPI is on the command line, using twine (https://gigabaseorgigabyte.wordpress.com/2017/08/03/a-bash-alias-for-easier-uploading-a-project-to-pypi/). Usually, the biocondabot picks up that PyPI is updated, and then opens a PR on bioconda with the change. Let's give it some time :) I will do it manually if necessary, and feel free to remind me if it looks like I'm forgetting this.
This has made its way into bioconda today. Thank you again for the fix - closing!
Hi @wdecoster , not sure if this is actually fixed? I'm looking at the median. My thinking is median read quality should be a median of medians. Does this make sense? See python logic.
docker run --rm -it -v $(pwd):$(pwd) -w $(pwd) mambaorg/micromamba
micromamba install -y nanostat 'nanomath>=1.3.0' -c conda-forge -c bioconda -c defaults
cat test.fastq
@03c90146-89b4-4e0f-aacb-29f6434c78a9
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
+
FDFGFEGHGHFDB@@AA72@/IHCBBBBEDEE??RJFGC@@;;;=GHDED<829--89HHIKKFGFHIKIIFFFFHN00000CFFJMJ@;JDEAAABCCBBBDFJHIHGEEEGLN{PLDDCCDHHDEEEOKEB752/-,+++++,./.----..%%&&&'&&&''(--544IHMGAAA@AEGGGPNHHGF@6=/,,,,2/))&%%%%%&)'('''(-78EG???=;=:=><8==;?><96/,)(&&'*'%$%'&%$$&%'&''++*****))(.-.022346709:;<1)((,1155?B@@@:-,++****+++++,9556;=210*&&&&&('(()*0///0+))),=?=:;443207-4555:732222:;;:?@:57<<A?>=,,,,1**'''()+*,--0:8763-,,//8AAA?A=(''''''('''''9A977*)))+77767:;>AA??B7666685-././;;753-+,-0111.--,(*56730*)+((((,*06-76443532...+),+++,6978,+(('''''2005.34481.***//.**()*&%%'+:=8543324.29IHB@?<32234=11-)$##$&'0=>=A852200000ACMRROQ?89('((()(&'200.--*)))('''')'&'''%&&%%&&%&(')**))('&&''&
@391f2d96-ea70-4665-8c4a-21b185c41525
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
+
3===RONT{{QKLIFGHEH=::::CAAA@BBA@CCC....00002::;IBCHKJIIHHHEGGGHHGIJGFFFFMKKPILMLGGGGGIMNMEB@CBCDEKNMQQSJJMT{UKOKLLEEDEELGKIJKPEBA==>??@@@HD@?@AH?>?>?IIKIPFFEEFFKIEDDCEHFFHIHIKMLPOHQOH
@7c333852-7b92-44d0-a532-f241c478ed97
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
+
"#%$$$$$$')++**'%%%%''&%$$$$&+++-.54542/-)&&&$$$$%&%'0-)&'''(&%%&&,,..,+--))+***+.*+)('&$$%&'(&$$$$%%$'))746?CEFDB=<;BDCCDCB@BAACCCDDEFHHJJHGGHJR{KIHHBBBBEGHLLEQHFCSLNQNOPCBAAAGFEFEFGEGIFLLJJLLQKLJIJOIIDBAAAAHHHEGQM<7640/-.?@EJ?;;<;IKKHF<9763)(&')****++BAA>>:7/4.+)&&'''(,*%%%%%57AC;AABB33333<<=C-,,,,IHGIFFFFELIKHIGHC@A@ADDDDEGHJHGGIIPEMMHGHFR{NJNHFGFEFFHIGHIJHIKHIKPQKLIRKOPOLNQU{IIGGFFGFFEFDEFDCEDEEEGFGFFDBIJD>====C000030/2('''1DEFMJGCDDEEDDFEKPPOKIIK<;;;;GHGGKQ{MF@6455?EDDFHJLIIGGHH32226BCDIJNLDJA=<<=?DFLI@=;@DJHHGGDB42*((),)**,,,,,;>CDGBCKRKMGEBA<<3,+.:C><<<<DKFHEEDDDEG;321113;NPTOLNLOMILJIKLIIJIHIJHJJIHGGIIJOKLLROFDEEDHHDCDDDHGDGEGHHCC@@960,,+&'&''(:;?GHHCILMKKHGHECCDDEJJIHEDBB,+++**&$%%+78=BBCCDDIIHJJKJGFGDEEDEEFIIKQLRL@@@@>;999:IHLMFDDCDD@G{KJGFFEDFHFEEDEFACBEFEHOIAAFGFDCCCECEEEFFEEC:956?E
@f52f5b20-8ca0-4d70-b034-7fdc9e4a4781
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
+
===>?HIGJMXOKLIGFFIMDDDCDIHH76666666JKKJKKJKIKLRLIKILLQBN{KJIHKLO??>;;;<=AAAA@@HJGGBAAABIEKKJGHHGKOIKMJKOMIKLKHIHFJJKJLJKGIHKKIGFEEE***=@@A{KMNNQQNKIQ{KQKQNLIIJKNOKJKHIIHHGFJHGIGFDDDAB@DDECDCCB<:1--2223CDGFHHLNNKLJJKLLNJHJJNGGECDDEJKIIIFEJOM{8881****75:OIGSJP{{NLP{UMJL{TQ{SMJHIJIDEDCEOIMOR{H:::::7,(''(.//222;=CLIKKHGGHJIIHFEFGEGIHGEDDDFHIKH
@e8b38fb1-f33e-4f85-9eb6-6f2544ac1a88
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
+
%$&&&&%%(((*4++))'%$$$(.+*('(&%%%''(',******%%%%&&'%%$###$##$$$%%$$%%$$$$'))%%%$$$&%%(((/((((((%$$$%%%%%$%)..,++(((%&(&'))++)%%%%'&&%%)'&%%%%%&$%&&'',*%%%$$)'((%%&&%'&%'()(%*('&&'%(&&)&&$##$$#$$##&%'&&(''''''&'&%%%%$%'*'&&%%&%&&&&$$&&&))()'''(&%$###%&'&&$%&&(,,86556779;?>AJGECDDHLLPLKHGEFEGIGKBAAABHHGHINII===:)('$$$&(('''''.48??DBA:5666683333?BDDCBDCBBBCEBBABB)(77:29=<<=A:54421444HGFEEDFECCBCCCCCGJGGJ?B0///4899BBAABCBBD3221000()*??=AEG????AF<;:99=<@@@CDFGHGC@??>ABFHK;:;;:::;666CCDCFGFEEFHHRMIFGFDBA@>>>??BEEEGIMC????FFHGDCCCBB=;;;:<<=OPKJGCDDEKHYLC>>==?BCEDBBA?ADFEADDHH>;;;;@BABCEDB@??=<<89<?@?==<<<:9850568?FFHFEFEHGFDDDDDCAA@=0+--/0=<;;;<KLMIKDDINQQKHDCDABCDLJIFECDDC7JICBBBBHFHEED****+0.&&'+../1122225BBDDCDCCDDBBD,++*)(()?;76:>44225;<CAA@@AD32222221600334=DFGFFGFAI9;;;:997;;:=)(((1/-----/DCCCEEEDEIGIKHGFFIL{GJNNLOKJE?989?<ABHCAAAAFGHGFJHIEGGGFCFHDIHKGFFEHG733@@?/......----99:;AADQQLJKJHJJKJIKNLOMNGNNLL;;;;;NJI+)))*.,+,-110+-0.0001BCCBCKPJKKP{JKH---,,++129;;;FGLJGIED<;..*(''',-=ACBGKIJUVWNOKNLNKJKJJJIHIM5544<;977767;KLNIFEGGBBCBCKID:99844>CLJMKLNL{LONQPKTMLKOLRMLLKJLNJLMNN9766:CEKHIGHJKMIJD97,,,,)&&%(.()../@DEDFEH>A::;HDC...**+,-8BKJmRILKNSHIRQSRQMKIJLIQOMQHLGIJNA7,,GJIKHCQUPNOGMJOJKIIL?@@@MUNMIVKIIO{{QKOORL{POMMQMMGCBCCEKJJQKNWN{QMQKNMMOJNMAAB@?2110000/5557<;;433*+AFJQMKJHGFDD@;;3.....'$$$$&####$)()(''',=ACCAA?899:IJMOMPJIAAA@A<65311666A=56ECDBA@><<<<>A?@GFHFB
NanoStat --fastq test.fastq
General summary:
Mean read length: 677.4
Mean read quality: 12.6
Median read length: 674.0
Median read quality: 12.7
Number of reads: 5.0
Read length N50: 805.0
STDEV read length: 416.8
Total bases: 3,387.0
Number, percentage and megabases of reads above quality cutoffs
>Q5: 5 (100.0%) 0.0Mb
>Q7: 5 (100.0%) 0.0Mb
>Q10: 5 (100.0%) 0.0Mb
>Q12: 3 (60.0%) 0.0Mb
>Q15: 2 (40.0%) 0.0Mb
Top 5 highest mean basecall quality scores and their read lengths
1: 26.3 (184)
2: 21.4 (342)
3: 12.7 (805)
4: 11.0 (1382)
5: 10.8 (674)
Top 5 longest reads and their mean basecall quality score
1: 1382 (11.0)
2: 805 (12.7)
3: 674 (10.8)
4: 342 (21.4)
5: 184 (26.3
#! /usr/bin/env python
from statistics import median
from Bio.SeqIO.QualityIO import FastqGeneralIterator
import math
def decode(c):
return ord(c) - 33
def phred_to_probability(phred_score):
return 10**(-phred_score/10)
def probability_to_phred(error_probability):
return -10 * math.log10(error_probability)
def raw_q_to_probability(q):
return phred_to_probability(decode(q))
def calc_stats(fastq):
median_read_errors = []
median_read_quals = []
with open(fastq) as in_handle:
for _title, _seq, qual in FastqGeneralIterator(in_handle):
error_probabilities = list(map(raw_q_to_probability, qual))
median_read_error = median(error_probabilities)
median_read_errors.append(median_read_error)
print(f"Median of median read errors: {probability_to_phred(median(median_read_errors))}")
calc_stats("test.fastq")
python calc_stats.py
Median of median read errors: 35.0
Hmmm, I expect it to do what you expect, but I have to confirm. Median Q of 35 seems unreasonably high, though.
You are correct. Taking median of the base error probabilities does not reflect the overall error probability for the read; average is more appropriate. I believe NanoStat performs this correctly and closing with an example:
Thanks for your help!
As per title.
Manual implementation:
NanoStat:
Thank you!!