Closed rohank63 closed 3 years ago
Could you please post the numbers? We'd probably need a few more organisms like worms (Caenorhabditis elegans) and rats (Rattus norvegicus) and possibly some primates. But would be good to see the numbers for all organisms.
Homo sapiens 2118481
Macaca mulatta 25179
Pan troglodytes 15700
Macaca fascicularis 6561
Mus musculus 904243
Rattus norvegicus 44925
Salmonella enterica 325039
Escherichia coli 174248
Streptococcus pneumoniae 85209
Mycobacterium tuberculosis 78434
Staphylococcus aureus 72084
Campylobacter jejuni 49654
Listeria monocytogenes 41560
Streptococcus pyogenes 30877
Hordeum vulgare 121866
Zea mays 86502
Arabidopsis thaliana 71759
Oryza sativa 69719
Lolium perenne 45598
Plasmodium falciparum 121323
Saccharomyces cerevisiae 107829
Schizosaccharomyces pombe 10001
Danio rerio 85877
Sus scrofa 55010
Bos taurus 46958
Gallus gallus 24341
Drosophila melanogaster 66176
Caenorhabditis elegans 29076
Pristionchus pacificus 1182
Hi @rohank63, thanks a lot! Since the results seem to be a little less close to what we hoped, I have done the analysis a bit more rigorously. Here is what I did:
# download SRA SQLite database
wget -c https://starbuck1.s3.amazonaws.com/sradb/SRAmetadb.sqlite.gz && gunzip SRAmetadb.sqlite.gz
# start SQLite shell
sqlite3 SRAmetadb.sqlite
# in SQLite shell
# get organism for each run
.mode csv
.output sra_orgs.csv
SELECT DISTINCT run_accession, taxon_id, platform, instrument_model, library_strategy FROM sra;
.quit
# remove duplicate entries
cut -f1 -d"," sra_orgs.csv | sort | uniq -c | sort -k1,1rn | awk '$1 > 1 {print $2}' > sra_orgs_duplicates
grep -vf sra_orgs_duplicates sra_orgs.csv > sra_orgs_no_dupes.csv
# select only Illumina samples
awk 'BEGIN {FS=","} $3 == "ILLUMINA"' sra_orgs_no_dupes.csv > sra_orgs_illumina.csv
# select only relevant protocols
egrep "RNA-Seq|RIP-Seq|miRNA-Seq|ncRNA-Seq" sra_orgs_illumina.csv > sra_orgs_rna_seq.csv
# rearrange results
cut -f1 sra_orgs_count_per_org.tab > sra_orgs_taxon_ids.csv
# get list of taxon IDs
tail -n +2 sra_orgs_count_per_org.tab | cut -f1 > sra_orgs_taxon_ids.csv
# convert taxon IDs to organisms: https://www.ncbi.nlm.nih.gov/Taxonomy/TaxIdentifier/tax_identifier.cgi
# save to file "tax_report.txt"
# rearrange taxon ID conversion output
awk '{print $1=$2=$4=$5=$6=""; print $0}' tax_report.txt | sed -e '/^$/d' -e 's/^[ \t]*//;s/[ \t]*$//' -e 's/ \{1,\}/\t/' | tail -n +3 > sra_orgs_taxon_ids_org_names.tab
# start R shell
R
# compute cumulative fraction
tax2org <- read.delim("sra_orgs_taxon_ids_org_names.tab", header=FALSE, row.names=NULL, col.names=c("taxon_id", "org_name"), skip=1)
cts <- read.delim("sra_orgs_count_per_org.tab", header=FALSE, row.names=NULL, col.names=c("taxon_id", "counts"), skip=1)
cts_name <- merge(cts, tax2org, by="taxon_id")
cts_sorted <- cts_name[rev(order(cts_name$counts)), ]
# filter out metagenomes
cts_sorted <- cts_sorted[-grep("metagenome", cts_sorted$org_name), ]
# TODO: manually filter out unclassified/unidentified/uncultivated and merge duplicates
cts_sorted$fract <- cts_sorted$counts / sum(cts_sorted$counts)
cts_sorted$fract_cum <- cumsum(cts_sorted$fract)
write.table(cts_sorted, "sra_orgs_results.tab", row.names=FALSE, col.names=TRUE, quote=FALSE, sep="\t")
As you see, there's still some manual filtering of unclassified and duplicate organisms to do. Without that filtering the results look this (only listing the first 179 entries as they, together, make up >90% of samples):
taxon counts org_name fract fract_cum
9606 511138 Homo sapiens 0.316063566658422 0.316063566658422
10090 189192 Mus musculus 0.116987385604749 0.433050952263171
112509 74210 Hordeum vulgare subsp. vulgare 0.0458879544892407 0.478938906752412
1313 44422 Streptococcus pneumoniae 0.0274684640118724 0.506407370764284
7955 42508 Danio rerio 0.0262849369280237 0.532692307692308
4530 39594 Oryza sativa 0.0244830571357903 0.557175364828098
1280 36741 Staphylococcus aureus 0.0227188968587682 0.579894261686866
562 28533 Escherichia coli 0.0176434578283453 0.597537719515212
32630 28064 synthetic construct 0.0173534504081128 0.614891169923324
4932 24798 Saccharomyces cerevisiae 0.0153339104625278 0.630225080385852
5833 23056 Plasmodium falciparum 0.0142567400445214 0.644481820430373
7165 21685 Anopheles gambiae 0.0134089784813257 0.657890798911699
3702 19959 Arabidopsis thaliana 0.0123417017066535 0.670232500618353
1773 18019 Mycobacterium tuberculosis 0.0111420974523868 0.68137459807074
7227 16666 Drosophila melanogaster 0.0103054662379421 0.691680064308682
487 16027 Neisseria meningitidis 0.00991033885728419 0.701590403165966
6239 13654 Caenorhabditis elegans 0.00844298788028692 0.710033391046253
573 11587 Klebsiella pneumoniae 0.0071648528320554 0.717198243878308
1314 10384 Streptococcus pyogenes 0.00642097452386841 0.723619218402177
485 8330 Neisseria gonorrhoeae 0.00515087806084591 0.728770096463022
205 8070 Campylobacter sp. 0.00499010635666584 0.733760202819688
32644 7680 unidentified 0.00474894880039575 0.738509151620084
197 6855 Campylobacter jejuni 0.00423880781597823 0.742747959436062
1496 6204 Clostridioides difficile 0.00383626020281969 0.746584219638882
28901 6121 Salmonella enterica 0.00378493692802374 0.750369156566906
10116 5888 Rattus norvegicus 0.00364086074697007 0.754010017313876
90370 5827 Salmonella enterica subsp. enterica serovar Typhi 0.0036031412317586 0.757613158545634
9913 5743 Bos taurus 0.00355119960425427 0.761164358149889
90371 5735 Salmonella enterica subsp. enterica serovar Typhimurium 0.00354625278258719 0.764710610932476
4577 5697 Zea mays 0.00352275537966856 0.768233366312144
1352 5651 Enterococcus faecium 0.00349431115508286 0.771727677467227
1311 5384 Streptococcus agalactiae 0.0033292109819441 0.775056888449171
9940 5258 Ovis aries 0.00325129854068761 0.778308186989859
4565 5023 Triticum aestivum 0.00310598565421717 0.781414172644076
9031 4768 Gallus gallus 0.00294830571357903 0.784362478357655
9823 4538 Sus scrofa 0.00280608459065051 0.787168562948306
4896 4481 Schizosaccharomyces pombe 0.00277083848627257 0.789939401434578
9615 4459 Canis lupus familiaris 0.0027572347266881 0.792696636161266
1869227 4411 bacterium 0.00272755379668563 0.795424189957952
77133 4239 uncultured bacterium 0.00262119713084343 0.798045387088795
10092 4097 Mus musculus domesticus 0.00253339104625278 0.800578778135048
287 3780 Pseudomonas aeruginosa 0.00233737323769478 0.802916151372743
1494075 3539 Mycobacterium tuberculosis complex sp. 0.00218835023497403 0.805104501607717
8030 3182 Salmo salar 0.00196759831808063 0.807072099925798
9796 3053 Equus caballus 0.00188783081869899 0.808959930744497
666 3045 Vibrio cholerae 0.00188288399703191 0.810842814741529
381124 2945 Zea mays subsp. mays 0.00182104872619342 0.812663863467722
624 2931 Shigella sonnei 0.00181239178827603 0.814476255255998
559292 2748 Saccharomyces cerevisiae S288C 0.0016992332426416 0.81617548849864
470 2709 Acinetobacter baumannii 0.00167511748701459 0.817850605985654
1639 2688 Listeria monocytogenes 0.00166213208013851 0.819512738065793
28450 2633 Burkholderia pseudomallei 0.00162812268117734 0.82114086074697
89462 2489 Bubalus bubalis 0.00153907989116992 0.82267994063814
5855 2440 Plasmodium vivax 0.00150878060845906 0.824188721246599
7460 2407 Apis mellifera 0.00148837496908236 0.825677096215681
6289 2358 Haemonchus contortus 0.00145807568637151 0.827135171902053
4558 2329 Sorghum bicolor 0.00144014345782835 0.828575315359881
149539 2242 Salmonella enterica subsp. enterica serovar Enteritidis 0.00138634677219886 0.82996166213208
69293 2233 Gasterosteus aculeatus 0.0013807815978234 0.831342443729904
1307 2206 Streptococcus suis 0.00136408607469701 0.8327065298046
49928 2202 unclassified Bacteria 0.00136161266386347 0.834068142468464
100272 2090 uncultured eukaryote 0.00129235716052436 0.835360499628988
39947 2029 Oryza sativa Japonica Group 0.00125463764531289 0.836615137274301
175245 1920 uncultured fungus 0.00118723720009894 0.8378023744744
12908 1871 unclassified sequences 0.00115693791738808 0.838959312391788
623 1867 Shigella flexneri 0.00115446450655454 0.840113776898343
727 1762 Haemophilus influenzae 0.00108953747217413 0.841203314370517
59201 1742 Salmonella enterica subsp. enterica 0.00107717041800643 0.842280484788523
511145 1717 Escherichia coli str. K-12 substr. MG1655 0.00106171160029681 0.84334219638882
6183 1666 Schistosoma mansoni 0.00103017561216918 0.844372372000989
8187 1649 Lates calcarifer 0.00101966361612664 0.845392035617116
9544 1639 Macaca mulatta 0.00101348008904279 0.846405515706159
4081 1579 Solanum lycopersicum 0.000976378926539698 0.847381894632698
480 1531 Moraxella catarrhalis 0.000946697996537225 0.848328592629236
11103 1462 Hepacivirus C 0.000904031659658669 0.849232624288894
446 1458 Legionella pneumophila 0.00090155824882513 0.85013418253772
7173 1454 Anopheles arabiensis 0.00089908483799159 0.851033267375711
738 1435 Glaesserella parasuis 0.000887336136532278 0.851920603512243
4513 1428 Hordeum vulgare 0.000883007667573584 0.852803611179817
3847 1412 Glycine max 0.000873114024239426 0.853676725204056
3329 1405 Picea abies 0.000868785555280732 0.854545510759337
9598 1398 Pan troglodytes 0.000864457086322038 0.855409967845659
1765 1344 Mycobacterium tuberculosis variant bovis 0.000831066040069256 0.856241033885728
72658 1317 Boechera stricta 0.000814370516942864 0.857055404402671
11676 1299 Human immunodeficiency virus 1 0.000803240168191937 0.857858644570863
29159 1292 Crassostrea gigas 0.000798911699233243 0.858657556270096
63677 1273 Arabidopsis halleri subsp. gemmifera 0.00078716299777393 0.85944471926787
198806 1247 Calidris pugnax 0.000771085827355924 0.860215805095226
113636 1203 Populus tremula 0.00074387830818699 0.860959683403413
3708 1169 Brassica napus 0.000722854316101905 0.861682537719515
4528 1151 Oryza longistaminata 0.000711723967350977 0.862394261686866
1351 1134 Enterococcus faecalis 0.000701211971308434 0.863095473658175
1185650 1117 Mycobacteroides abscessus subsp. abscessus 0.000690699975265892 0.86378617363344
599 1105 Salmonella sp. 0.000683279742765273 0.864469453376206
8364 1059 Xenopus tropicalis 0.00065483551817957 0.865124288894385
486 1058 Neisseria lactamica 0.000654217165471185 0.865778506059857
3711 1027 Brassica rapa 0.000635048231511254 0.866413554291368
4555 1022 Setaria italica 0.00063195646796933 0.867045510759337
40324 1008 Stenotrophomonas maltophilia 0.000623299530051942 0.867668810289389
813 993 Chlamydia trachomatis 0.000614024239426169 0.868282834528815
550 970 Enterobacter cloacae 0.000599802127133317 0.868882636655949
10091 963 Mus musculus castaneus 0.000595473658174623 0.869478110314123
8090 959 Oryzias latipes 0.000593000247341083 0.870071110561464
194 956 Campylobacter 0.000591145189215929 0.87066225575068
195 942 Campylobacter coli 0.000582488251298541 0.871244744001979
54388 908 Salmonella enterica subsp. enterica serovar Paratyphi A 0.000561464259213455 0.871806208261192
4120 881 Ipomoea batatas 0.000544768736087064 0.872350976997279
54126 875 Pristionchus pacificus 0.000541058619836755 0.872892035617116
39946 874 Oryza sativa Indica Group 0.00054044026712837 0.873432475884244
471473 872 Chlamydia trachomatis L2b/UCH-1/proctitis 0.0005392035617116 0.873971679445956
8049 861 Gadus morhua 0.000532401681919367 0.874504081127875
5661 845 Leishmania donovani 0.000522508038585209 0.87502658916646
62337 839 Miscanthus sinensis 0.0005187979223349 0.875545387088795
43150 838 Hirundo rustica 0.000518179569626515 0.876063566658422
9925 838 Capra hircus 0.000518179569626515 0.876581746228048
8022 807 Oncorhynchus mykiss 0.000499010635666584 0.877080756863715
174621 805 Wyeomyia smithii 0.000497773930249815 0.877578530793965
315576 798 Chironomus riparius 0.00049344546129112 0.878071976255256
1427524 791 mixed sample 0.000489116992332426 0.878561093247588
36809 785 Mycobacteroides abscessus 0.000485406876082117 0.87904650012367
319705 780 Mycobacteroides abscessus subsp. bolletii 0.000482315112540193 0.879528815236211
1392002 760 Mycobacterium avium 05-4293 0.000469948058372496 0.879998763294583
10376 757 Human gammaherpesvirus 4 0.000468093000247341 0.880466856294831
42229 751 Prunus avium 0.000464382883997032 0.880931239178828
37296 748 Human gammaherpesvirus 8 0.000462527825871877 0.8813937670047
4182 748 Sesamum indicum 0.000462527825871877 0.881856294830571
5476 736 Candida albicans 0.000455107593371259 0.882311402423943
55363 717 Diospyros lotus 0.000443358891911947 0.882754761315855
29760 715 Vitis vinifera 0.000442122186495177 0.88319688350235
113334 696 Melitaea cinxia 0.000430373485035864 0.883627256987386
31033 696 Takifugu rubripes 0.000430373485035864 0.884057630472421
663951 693 Staphylococcus aureus subsp. aureus TW20 0.00042851842691071 0.884486148899332
178876 686 Cryptococcus neoformans var. grubii 0.000424189957952016 0.884910338857284
6359 668 Platynereis dumerilii 0.000413059609201088 0.885323398466485
632 664 Yersinia pestis 0.000410586198367549 0.885733984664853
3880 659 Medicago truncatula 0.000407494434825625 0.886141479099678
520 649 Bordetella pertussis 0.000401310907741776 0.88654279000742
3055 637 Chlamydomonas reinhardtii 0.000393890675241158 0.886936680682661
3818 635 Arachis hypogaea 0.000392653969824388 0.887329334652486
1114792 631 Equus ferus 0.000390180558990848 0.887719515211477
67825 629 Citrobacter rodentium 0.000388943853574079 0.888108459065051
1512 623 [Clostridium] symbiosum 0.000385233737323769 0.888493692802374
580240 615 Saccharomyces cerevisiae W303 0.000380286915656691 0.888873979718031
151458 606 HIV-1 vector pNL4-3 0.000374721741281227 0.889248701459312
36329 605 Plasmodium falciparum 3D7 0.000374103388572842 0.889622804847885
34305 605 Lotus japonicus 0.000374103388572842 0.889996908236458
198431 592 uncultured prokaryote 0.000366064803363839 0.890362973039822
715 587 Actinobacillus pleuropneumoniae 0.000362973039821914 0.890725946079644
220341 581 Salmonella enterica subsp. enterica serovar Typhi str. CT18 0.000359262923571605 0.891085209003215
1336 580 Streptococcus equi 0.00035864457086322 0.891443853574079
5825 573 Plasmodium chabaudi 0.000354316101904526 0.891798169675983
1809 567 Mycobacterium ulcerans 0.000350605985654217 0.892148775661637
35525 551 Daphnia magna 0.000340712342320059 0.892489488003957
11320 549 Influenza A virus 0.00033947563690329 0.892828963640861
633 549 Yersinia pseudotuberculosis 0.00033947563690329 0.893168439277764
1045010 548 Escherichia coli O157 0.000338857284194905 0.893507296561959
272634 545 Mycoplasma pneumoniae M129 0.00033700222606975 0.893844298788029
9915 538 Bos indicus 0.000332673757111056 0.89417697254514
9825 528 Sus scrofa domesticus 0.000326490230027208 0.894503462775167
4113 527 Solanum tuberosum 0.000325871877318823 0.894829334652486
615 527 Serratia marcescens 0.000325871877318823 0.895155206529805
13616 517 Monodelphis domestica 0.000319688350234974 0.89547489488004
539665 513 Parnassius szechenyii incognitus 0.000317214939401435 0.895792109819441
63221 512 Homo sapiens neanderthalensis 0.00031659658669305 0.896108706406134
4529 505 Oryza rufipogon 0.000312268117734356 0.896420974523868
5821 499 Plasmodium berghei 0.000308558001484047 0.896729532525352
1282 493 Staphylococcus epidermidis 0.000304847885233737 0.897034380410586
1826778 492 bacterium 0.000304229532525352 0.897338609943112
1879010 490 Firmicutes bacterium 0.000302992827108583 0.89764160277022
208964 488 Pseudomonas aeruginosa PAO1 0.000301756121691813 0.897943358891912
8839 477 Anas platyrhynchos 0.00029495424189958 0.898238313133812
1349 473 Streptococcus uberis 0.00029248083106604 0.898530793964878
9595 463 Gorilla gorilla gorilla 0.000286297303982191 0.89881709126886
5691 463 Trypanosoma brucei 0.000286297303982191 0.899103388572842
83333 461 Escherichia coli K-12 0.000285060598565422 0.899388449171407
12814 461 Respiratory syncytial virus 0.000285060598565422 0.899673509769973
37657 458 Silene latifolia 0.000283205540440267 0.899956715310413
72407 457 Klebsiella pneumoniae subsp. pneumoniae 0.000282587187731882 0.900239302498145
Could you please do the manual filtering and also do it with and without selecting for the sequencing protocols (RNA-Seq
, RIP-Seq
, miRNA-Seq
, ncRNA-Seq
)?
For the latter, you just need to skip the corresponding filtering (egrep
) command and adjust the input file name for the following step.
For the former, you need to
Sure, I will do the manual filtering with and without and once done I will update you , I just wanna add that the numbers that I posted for SRA experiments, it included file-type: fastQ filter enabled, as we were dealing with the fastQ files so I just let that on, May be that's why the numbers appear low to you. But if will I remove all filters and just post numbers as per as SRA experiments archive , they can grow significantly and I think will make more sense.
No no, your numbers weren't low. And it makes sense to keep that filter in.
After merging all the transcripts, running Kallisto and testing sample fastQ files.
# Shell
# Index command : builds an index from a FASTA formatted file of target sequences
> kallisto index -i transcripts.idx -k 31 transcripts.fasta
# Quant command: runs the quantification algorithm
# For single end fastQ file
# Finding fragment_length and standard deviation for single end fastQ
> awk 'BEGIN { t=0.0;sq=0.0; n=0;} ;NR%4==2 {n++;L=length($0);t+=L;sq+=L*L;}END{m=t/n;printf("total %d avg=%f stddev=%f\n",n,m,sqrt(sq/n-m*m));}' sra_data.fastq
total 1334218 avg=49.377981 stddev=4.109351
> kallisto quant -i transcripts.idx -o output -s 4.109351 -l 49.377981 --single sra_data.fastq
# For paired end fastQ file
> kallisto quant -i transcripts.idx -o output pairA_1.fastq pairA_2.fastq
# Python
"""Infer organism information for sequencing library."""
import os
import pandas as pd
def count_info(file_name: str):
"""Infers organisms count.
Args:
file_name: FastQ file to process.
Returns:
Dictionary with count percentage for every organism.
"""
organism_count: Dict[(str, int), float] = {}
df = pd.read_csv(file_name)
dimension = df.shape
rows = dimension[0]
total_count = 0
for i in range(rows):
row = df.iloc[i,0]
cols = list(map(str, row.split("\t")))
contents = list(map(str,cols[0].split("|")))
organism_name = contents[3]
organism_tax_id = contents[4]
# Update organism count
if (organism_name,organism_tax_id) in organism_count:
organism_count[(organism_name,organism_tax_id)] += float(cols[3])
else:
organism_count[(organism_name,organism_tax_id)] = float(cols[3])
total_count += float(cols[3])
# Sorting as per organism with the highest counts
sorted_organism_count = {k: v for k, v in sorted(organism_count.items(), key=lambda item: -1*item[1])}
for i in sorted_organism_count:
sorted_organism_count[i] = (sorted_organism_count[i]/total_count)*100
return sorted_organism_count
path = os.path.dirname(os.path.abspath(__file__))
file_path = os.path.join(path, "output\\abundance.tsv")
> print(count_info(file_path))
{('panubis', '9555'): 36.04259720716562, ('ptroglodytes', '9598'): 15.189547301341014, ('ppaniscus', '9597'): 15.188078685853707, ('malbus', '43700'): 9.70738657188484, ('lchalumnae', '7897'): 4.690350075283531, ('mnemestrina', '9545'): 3.6380770042418007, ('nleucogenys', '61853'): 2.17412727133832, ('pabelii', '9601'): 2.134916210287037, ('lbergylta', '56723'): 1.9347405963295994, ('anancymaae', '37293'): 1.5466814683494206, ('mmulatta', '9544'): 1.1547486852025006, ('mfascicularis', '9541'): 1.1463986859973792, ('csabaeus', '60711'): 1.0259372461970249, ('catys', '9531'): 0.8432172391193068, ('ogarnettii', '30611'): 0.5692197103339778, ('sbboliviensis', '39432'): 0.5486835252578721, ('ptephrosceles', '591936'): 0.4059442474611024, ('cporcellus', '10141'): 0.25191468694359026, ('capalliatus', '336983'): 0.24130625904668185, ('rbieti', '61621'): 0.22822133658912208, ('lcalcarifer', '8187'): 0.18316937610124437, ('tgelada', '9565'): 0.17354099478143795, ('psinensis', '13735'): 0.11206718660092017, ('atestudineus', '64144'): 0.10330776155916598, ('sldorsalis', '1841481'): 0.08807212181700995, ('ipunctatus', '7998'): 0.08493157635340838, ('mzebra', '106582'): 0.06766320700432917, ('nvison', '452646'): 0.058843096523792965, ('lafricana', '9785'): 0.05752577846242743, ('eaasinus', '83772'): 0.05074740525324686, ('mleucophaeus', '9568'): 0.03863000326113315, ('cporosus', '8502'): 0.038060553939935154, ('fheteroclitus', '8078'): 0.02960265306439401, ('munguiculatus', '10047'): 0.024642115345641313, ('smaximus', '52904'): 0.02279658022984605, ('enaucrates', '173247'): 0.017306176165992893, ('pcinereus', '38626'): 0.01691580175108229, ('sscrofa', '9823'): 0.01691580175108229, ('mmurinus', '30608'): 0.010678818776945117, ('ptaltaica', '74533'): 0.010452527639020014, ('vvulpes', '9627'): 0.008684213158218438, ('ecaballus', '9796'): 0.008605744982845604, ('amelanoleuca', '9646'): 0.008516006654556112, ('rferrumequinum', '59479'): 0.008469192173209994, ('tbelangeri', '37347'): 0.006630063917327947, ('pnyererei', '303518'): 0.006428998468764146, ('csemilaevis', '244447'): 0.006346703093245131, ('ccanadensis', '51338'): 0.006344969223565645, ('mochrogaster', '79684'): 0.006343425656655858, ('mspicilegus', '10103'): 0.006343425656655858, ('bgrunniens', '30521'): 0.0060126900152939165, ('rroxellana', '61622'): 0.00600830247921473, ('psimus', '1328070'): 0.004374891517378035, ('mmmarmota', '9994'): 0.004268258532089651, ('mpfuro', '9669'): 0.004234511507596241, ('aplatyrhynchos', '8839'): 0.004228950437770573, ('falbicollis', '59894'): 0.004228950437770573, ('ttruncatus', '9739'): 0.00223569808318398, ('caperea', '37548'): 0.0021776980279299566, ('cjacchus', '9483'): 0.002137015604247188, ('clanigera', '34839'): 0.002123398304308982, ('umaritimus', '29073'): 0.002117710365970181, ('mauratus', '10036'): 0.002117583497457048, ('itridecemlineatus', '43179'): 0.0021172028919176486, ('dnovemcinctus', '9361'): 0.002115934206786317, ('mmurdjan', '586833'): 0.002114475380172593, ('cldingo', '286419'): 0.0021144752188852864, ('eburgeri', '7764'): 0.0021144752188852864, ('mcaroli', '10089'): 0.0021144752188852864, ('btaurus', '9913'): 0.00018350473187095958, ('bbbison', '43346'): 0.00018350473187095958, ('acalliptera', '8154'): 0.0}
To infer organisms form the fastQ samples, the top most over-represented organisms in SRA are being selected based on the following parameters: