zavolanlab / htsinfer

Infer metadata for your downstream analysis straight from your RNA-seq data
Apache License 2.0
9 stars 22 forks source link

Infer Organism from FastQ Sample: Selecting Organisms #10

Closed rohank63 closed 3 years ago

rohank63 commented 4 years ago

To infer organisms form the fastQ samples, the top most over-represented organisms in SRA are being selected based on the following parameters:

uniqueg commented 4 years ago

Could you please post the numbers? We'd probably need a few more organisms like worms (Caenorhabditis elegans) and rats (Rattus norvegicus) and possibly some primates. But would be good to see the numbers for all organisms.

rohank63 commented 4 years ago

Organisms as per number of SRA Experiments -

Primates:

Homo sapiens                  2118481
Macaca mulatta                  25179
Pan troglodytes                 15700
Macaca fascicularis              6561

Rodent:

Mus musculus                   904243
Rattus norvegicus               44925

Bacteria:

Salmonella enterica            325039
Escherichia coli               174248
Streptococcus pneumoniae        85209
Mycobacterium tuberculosis      78434
Staphylococcus aureus           72084
Campylobacter jejuni            49654
Listeria monocytogenes          41560
Streptococcus pyogenes          30877

Plant:

Hordeum vulgare                121866
Zea mays                        86502
Arabidopsis thaliana            71759
Oryza sativa                    69719
Lolium perenne                  45598

Plasmodium:

Plasmodium falciparum          121323

Fungi:

Saccharomyces cerevisiae       107829
Schizosaccharomyces pombe       10001

Mixed:

Danio rerio                     85877
Sus scrofa                      55010
Bos taurus                      46958
Gallus gallus                   24341

Anthropod:

Drosophila melanogaster         66176

Nematode:

Caenorhabditis elegans          29076
Pristionchus pacificus           1182
uniqueg commented 4 years ago

Hi @rohank63, thanks a lot! Since the results seem to be a little less close to what we hoped, I have done the analysis a bit more rigorously. Here is what I did:

# download SRA SQLite database
wget -c https://starbuck1.s3.amazonaws.com/sradb/SRAmetadb.sqlite.gz && gunzip SRAmetadb.sqlite.gz
# start SQLite shell
sqlite3 SRAmetadb.sqlite
# in SQLite shell
# get organism for each run
.mode csv
.output sra_orgs.csv
SELECT DISTINCT run_accession, taxon_id, platform, instrument_model, library_strategy FROM sra;
.quit
# remove duplicate entries
cut -f1 -d"," sra_orgs.csv | sort | uniq -c | sort -k1,1rn | awk '$1 > 1 {print $2}' > sra_orgs_duplicates
grep -vf sra_orgs_duplicates sra_orgs.csv > sra_orgs_no_dupes.csv
# select only Illumina samples
awk 'BEGIN {FS=","} $3 == "ILLUMINA"' sra_orgs_no_dupes.csv > sra_orgs_illumina.csv
# select only relevant protocols
egrep "RNA-Seq|RIP-Seq|miRNA-Seq|ncRNA-Seq" sra_orgs_illumina.csv > sra_orgs_rna_seq.csv
# rearrange results
cut -f1 sra_orgs_count_per_org.tab > sra_orgs_taxon_ids.csv
# get list of taxon IDs
tail -n +2 sra_orgs_count_per_org.tab | cut -f1 > sra_orgs_taxon_ids.csv
# convert taxon IDs to organisms: https://www.ncbi.nlm.nih.gov/Taxonomy/TaxIdentifier/tax_identifier.cgi
# save to file "tax_report.txt"
# rearrange taxon ID conversion output
awk '{print $1=$2=$4=$5=$6=""; print $0}' tax_report.txt | sed -e '/^$/d' -e 's/^[ \t]*//;s/[ \t]*$//' -e 's/ \{1,\}/\t/' | tail -n +3 > sra_orgs_taxon_ids_org_names.tab
# start R shell
R
# compute cumulative fraction
tax2org <- read.delim("sra_orgs_taxon_ids_org_names.tab", header=FALSE, row.names=NULL, col.names=c("taxon_id", "org_name"), skip=1)
cts <- read.delim("sra_orgs_count_per_org.tab", header=FALSE, row.names=NULL, col.names=c("taxon_id", "counts"), skip=1)
cts_name <- merge(cts, tax2org, by="taxon_id")
cts_sorted <- cts_name[rev(order(cts_name$counts)), ]
# filter out metagenomes
cts_sorted <- cts_sorted[-grep("metagenome", cts_sorted$org_name), ]
# TODO: manually filter out unclassified/unidentified/uncultivated and merge duplicates
cts_sorted$fract <- cts_sorted$counts / sum(cts_sorted$counts)
cts_sorted$fract_cum <- cumsum(cts_sorted$fract)
write.table(cts_sorted, "sra_orgs_results.tab", row.names=FALSE, col.names=TRUE, quote=FALSE, sep="\t")

As you see, there's still some manual filtering of unclassified and duplicate organisms to do. Without that filtering the results look this (only listing the first 179 entries as they, together, make up >90% of samples):

taxon   counts  org_name    fract   fract_cum
9606    511138  Homo sapiens    0.316063566658422   0.316063566658422
10090   189192  Mus musculus    0.116987385604749   0.433050952263171
112509  74210   Hordeum vulgare subsp. vulgare  0.0458879544892407  0.478938906752412
1313    44422   Streptococcus pneumoniae    0.0274684640118724  0.506407370764284
7955    42508   Danio rerio 0.0262849369280237  0.532692307692308
4530    39594   Oryza sativa    0.0244830571357903  0.557175364828098
1280    36741   Staphylococcus aureus   0.0227188968587682  0.579894261686866
562 28533   Escherichia coli    0.0176434578283453  0.597537719515212
32630   28064   synthetic construct 0.0173534504081128  0.614891169923324
4932    24798   Saccharomyces cerevisiae    0.0153339104625278  0.630225080385852
5833    23056   Plasmodium falciparum   0.0142567400445214  0.644481820430373
7165    21685   Anopheles gambiae   0.0134089784813257  0.657890798911699
3702    19959   Arabidopsis thaliana    0.0123417017066535  0.670232500618353
1773    18019   Mycobacterium tuberculosis  0.0111420974523868  0.68137459807074
7227    16666   Drosophila melanogaster 0.0103054662379421  0.691680064308682
487 16027   Neisseria meningitidis  0.00991033885728419 0.701590403165966
6239    13654   Caenorhabditis elegans  0.00844298788028692 0.710033391046253
573 11587   Klebsiella pneumoniae   0.0071648528320554  0.717198243878308
1314    10384   Streptococcus pyogenes  0.00642097452386841 0.723619218402177
485 8330    Neisseria gonorrhoeae   0.00515087806084591 0.728770096463022
205 8070    Campylobacter sp.   0.00499010635666584 0.733760202819688
32644   7680    unidentified    0.00474894880039575 0.738509151620084
197 6855    Campylobacter jejuni    0.00423880781597823 0.742747959436062
1496    6204    Clostridioides difficile    0.00383626020281969 0.746584219638882
28901   6121    Salmonella enterica 0.00378493692802374 0.750369156566906
10116   5888    Rattus norvegicus   0.00364086074697007 0.754010017313876
90370   5827    Salmonella enterica subsp. enterica serovar Typhi   0.0036031412317586  0.757613158545634
9913    5743    Bos taurus  0.00355119960425427 0.761164358149889
90371   5735    Salmonella enterica subsp. enterica serovar Typhimurium 0.00354625278258719 0.764710610932476
4577    5697    Zea mays    0.00352275537966856 0.768233366312144
1352    5651    Enterococcus faecium    0.00349431115508286 0.771727677467227
1311    5384    Streptococcus agalactiae    0.0033292109819441  0.775056888449171
9940    5258    Ovis aries  0.00325129854068761 0.778308186989859
4565    5023    Triticum aestivum   0.00310598565421717 0.781414172644076
9031    4768    Gallus gallus   0.00294830571357903 0.784362478357655
9823    4538    Sus scrofa  0.00280608459065051 0.787168562948306
4896    4481    Schizosaccharomyces pombe   0.00277083848627257 0.789939401434578
9615    4459    Canis lupus familiaris  0.0027572347266881  0.792696636161266
1869227 4411    bacterium   0.00272755379668563 0.795424189957952
77133   4239    uncultured bacterium    0.00262119713084343 0.798045387088795
10092   4097    Mus musculus domesticus 0.00253339104625278 0.800578778135048
287 3780    Pseudomonas aeruginosa  0.00233737323769478 0.802916151372743
1494075 3539    Mycobacterium tuberculosis complex sp.  0.00218835023497403 0.805104501607717
8030    3182    Salmo salar 0.00196759831808063 0.807072099925798
9796    3053    Equus caballus  0.00188783081869899 0.808959930744497
666 3045    Vibrio cholerae 0.00188288399703191 0.810842814741529
381124  2945    Zea mays subsp. mays    0.00182104872619342 0.812663863467722
624 2931    Shigella sonnei 0.00181239178827603 0.814476255255998
559292  2748    Saccharomyces cerevisiae S288C  0.0016992332426416  0.81617548849864
470 2709    Acinetobacter baumannii 0.00167511748701459 0.817850605985654
1639    2688    Listeria monocytogenes  0.00166213208013851 0.819512738065793
28450   2633    Burkholderia pseudomallei   0.00162812268117734 0.82114086074697
89462   2489    Bubalus bubalis 0.00153907989116992 0.82267994063814
5855    2440    Plasmodium vivax    0.00150878060845906 0.824188721246599
7460    2407    Apis mellifera  0.00148837496908236 0.825677096215681
6289    2358    Haemonchus contortus    0.00145807568637151 0.827135171902053
4558    2329    Sorghum bicolor 0.00144014345782835 0.828575315359881
149539  2242    Salmonella enterica subsp. enterica serovar Enteritidis 0.00138634677219886 0.82996166213208
69293   2233    Gasterosteus aculeatus  0.0013807815978234  0.831342443729904
1307    2206    Streptococcus suis  0.00136408607469701 0.8327065298046
49928   2202    unclassified Bacteria   0.00136161266386347 0.834068142468464
100272  2090    uncultured eukaryote    0.00129235716052436 0.835360499628988
39947   2029    Oryza sativa Japonica Group 0.00125463764531289 0.836615137274301
175245  1920    uncultured fungus   0.00118723720009894 0.8378023744744
12908   1871    unclassified sequences  0.00115693791738808 0.838959312391788
623 1867    Shigella flexneri   0.00115446450655454 0.840113776898343
727 1762    Haemophilus influenzae  0.00108953747217413 0.841203314370517
59201   1742    Salmonella enterica subsp. enterica 0.00107717041800643 0.842280484788523
511145  1717    Escherichia coli str. K-12 substr. MG1655   0.00106171160029681 0.84334219638882
6183    1666    Schistosoma mansoni 0.00103017561216918 0.844372372000989
8187    1649    Lates calcarifer    0.00101966361612664 0.845392035617116
9544    1639    Macaca mulatta  0.00101348008904279 0.846405515706159
4081    1579    Solanum lycopersicum    0.000976378926539698    0.847381894632698
480 1531    Moraxella catarrhalis   0.000946697996537225    0.848328592629236
11103   1462    Hepacivirus C   0.000904031659658669    0.849232624288894
446 1458    Legionella pneumophila  0.00090155824882513 0.85013418253772
7173    1454    Anopheles arabiensis    0.00089908483799159 0.851033267375711
738 1435    Glaesserella parasuis   0.000887336136532278    0.851920603512243
4513    1428    Hordeum vulgare 0.000883007667573584    0.852803611179817
3847    1412    Glycine max 0.000873114024239426    0.853676725204056
3329    1405    Picea abies 0.000868785555280732    0.854545510759337
9598    1398    Pan troglodytes 0.000864457086322038    0.855409967845659
1765    1344    Mycobacterium tuberculosis variant bovis    0.000831066040069256    0.856241033885728
72658   1317    Boechera stricta    0.000814370516942864    0.857055404402671
11676   1299    Human immunodeficiency virus 1  0.000803240168191937    0.857858644570863
29159   1292    Crassostrea gigas   0.000798911699233243    0.858657556270096
63677   1273    Arabidopsis halleri subsp. gemmifera    0.00078716299777393 0.85944471926787
198806  1247    Calidris pugnax 0.000771085827355924    0.860215805095226
113636  1203    Populus tremula 0.00074387830818699 0.860959683403413
3708    1169    Brassica napus  0.000722854316101905    0.861682537719515
4528    1151    Oryza longistaminata    0.000711723967350977    0.862394261686866
1351    1134    Enterococcus faecalis   0.000701211971308434    0.863095473658175
1185650 1117    Mycobacteroides abscessus subsp. abscessus  0.000690699975265892    0.86378617363344
599 1105    Salmonella sp.  0.000683279742765273    0.864469453376206
8364    1059    Xenopus tropicalis  0.00065483551817957 0.865124288894385
486 1058    Neisseria lactamica 0.000654217165471185    0.865778506059857
3711    1027    Brassica rapa   0.000635048231511254    0.866413554291368
4555    1022    Setaria italica 0.00063195646796933 0.867045510759337
40324   1008    Stenotrophomonas maltophilia    0.000623299530051942    0.867668810289389
813 993 Chlamydia trachomatis   0.000614024239426169    0.868282834528815
550 970 Enterobacter cloacae    0.000599802127133317    0.868882636655949
10091   963 Mus musculus castaneus  0.000595473658174623    0.869478110314123
8090    959 Oryzias latipes 0.000593000247341083    0.870071110561464
194 956 Campylobacter   0.000591145189215929    0.87066225575068
195 942 Campylobacter coli  0.000582488251298541    0.871244744001979
54388   908 Salmonella enterica subsp. enterica serovar Paratyphi A 0.000561464259213455    0.871806208261192
4120    881 Ipomoea batatas 0.000544768736087064    0.872350976997279
54126   875 Pristionchus pacificus  0.000541058619836755    0.872892035617116
39946   874 Oryza sativa Indica Group   0.00054044026712837 0.873432475884244
471473  872 Chlamydia trachomatis L2b/UCH-1/proctitis   0.0005392035617116  0.873971679445956
8049    861 Gadus morhua    0.000532401681919367    0.874504081127875
5661    845 Leishmania donovani 0.000522508038585209    0.87502658916646
62337   839 Miscanthus sinensis 0.0005187979223349  0.875545387088795
43150   838 Hirundo rustica 0.000518179569626515    0.876063566658422
9925    838 Capra hircus    0.000518179569626515    0.876581746228048
8022    807 Oncorhynchus mykiss 0.000499010635666584    0.877080756863715
174621  805 Wyeomyia smithii    0.000497773930249815    0.877578530793965
315576  798 Chironomus riparius 0.00049344546129112 0.878071976255256
1427524 791 mixed sample    0.000489116992332426    0.878561093247588
36809   785 Mycobacteroides abscessus   0.000485406876082117    0.87904650012367
319705  780 Mycobacteroides abscessus subsp. bolletii   0.000482315112540193    0.879528815236211
1392002 760 Mycobacterium avium 05-4293 0.000469948058372496    0.879998763294583
10376   757 Human gammaherpesvirus 4    0.000468093000247341    0.880466856294831
42229   751 Prunus avium    0.000464382883997032    0.880931239178828
37296   748 Human gammaherpesvirus 8    0.000462527825871877    0.8813937670047
4182    748 Sesamum indicum 0.000462527825871877    0.881856294830571
5476    736 Candida albicans    0.000455107593371259    0.882311402423943
55363   717 Diospyros lotus 0.000443358891911947    0.882754761315855
29760   715 Vitis vinifera  0.000442122186495177    0.88319688350235
113334  696 Melitaea cinxia 0.000430373485035864    0.883627256987386
31033   696 Takifugu rubripes   0.000430373485035864    0.884057630472421
663951  693 Staphylococcus aureus subsp. aureus TW20    0.00042851842691071 0.884486148899332
178876  686 Cryptococcus neoformans var. grubii 0.000424189957952016    0.884910338857284
6359    668 Platynereis dumerilii   0.000413059609201088    0.885323398466485
632 664 Yersinia pestis 0.000410586198367549    0.885733984664853
3880    659 Medicago truncatula 0.000407494434825625    0.886141479099678
520 649 Bordetella pertussis    0.000401310907741776    0.88654279000742
3055    637 Chlamydomonas reinhardtii   0.000393890675241158    0.886936680682661
3818    635 Arachis hypogaea    0.000392653969824388    0.887329334652486
1114792 631 Equus ferus 0.000390180558990848    0.887719515211477
67825   629 Citrobacter rodentium   0.000388943853574079    0.888108459065051
1512    623 [Clostridium] symbiosum 0.000385233737323769    0.888493692802374
580240  615 Saccharomyces cerevisiae W303   0.000380286915656691    0.888873979718031
151458  606 HIV-1 vector pNL4-3 0.000374721741281227    0.889248701459312
36329   605 Plasmodium falciparum 3D7   0.000374103388572842    0.889622804847885
34305   605 Lotus japonicus 0.000374103388572842    0.889996908236458
198431  592 uncultured prokaryote   0.000366064803363839    0.890362973039822
715 587 Actinobacillus pleuropneumoniae 0.000362973039821914    0.890725946079644
220341  581 Salmonella enterica subsp. enterica serovar Typhi str. CT18 0.000359262923571605    0.891085209003215
1336    580 Streptococcus equi  0.00035864457086322 0.891443853574079
5825    573 Plasmodium chabaudi 0.000354316101904526    0.891798169675983
1809    567 Mycobacterium ulcerans  0.000350605985654217    0.892148775661637
35525   551 Daphnia magna   0.000340712342320059    0.892489488003957
11320   549 Influenza A virus   0.00033947563690329 0.892828963640861
633 549 Yersinia pseudotuberculosis 0.00033947563690329 0.893168439277764
1045010 548 Escherichia coli O157   0.000338857284194905    0.893507296561959
272634  545 Mycoplasma pneumoniae M129  0.00033700222606975 0.893844298788029
9915    538 Bos indicus 0.000332673757111056    0.89417697254514
9825    528 Sus scrofa domesticus   0.000326490230027208    0.894503462775167
4113    527 Solanum tuberosum   0.000325871877318823    0.894829334652486
615 527 Serratia marcescens 0.000325871877318823    0.895155206529805
13616   517 Monodelphis domestica   0.000319688350234974    0.89547489488004
539665  513 Parnassius szechenyii incognitus    0.000317214939401435    0.895792109819441
63221   512 Homo sapiens neanderthalensis   0.00031659658669305 0.896108706406134
4529    505 Oryza rufipogon 0.000312268117734356    0.896420974523868
5821    499 Plasmodium berghei  0.000308558001484047    0.896729532525352
1282    493 Staphylococcus epidermidis  0.000304847885233737    0.897034380410586
1826778 492 bacterium   0.000304229532525352    0.897338609943112
1879010 490 Firmicutes bacterium    0.000302992827108583    0.89764160277022
208964  488 Pseudomonas aeruginosa PAO1 0.000301756121691813    0.897943358891912
8839    477 Anas platyrhynchos  0.00029495424189958 0.898238313133812
1349    473 Streptococcus uberis    0.00029248083106604 0.898530793964878
9595    463 Gorilla gorilla gorilla 0.000286297303982191    0.89881709126886
5691    463 Trypanosoma brucei  0.000286297303982191    0.899103388572842
83333   461 Escherichia coli K-12   0.000285060598565422    0.899388449171407
12814   461 Respiratory syncytial virus 0.000285060598565422    0.899673509769973
37657   458 Silene latifolia    0.000283205540440267    0.899956715310413
72407   457 Klebsiella pneumoniae subsp. pneumoniae 0.000282587187731882    0.900239302498145
uniqueg commented 4 years ago

Could you please do the manual filtering and also do it with and without selecting for the sequencing protocols (RNA-Seq, RIP-Seq, miRNA-Seq, ncRNA-Seq)?

For the latter, you just need to skip the corresponding filtering (egrep) command and adjust the input file name for the following step.

For the former, you need to

rohank63 commented 4 years ago

Sure, I will do the manual filtering with and without and once done I will update you , I just wanna add that the numbers that I posted for SRA experiments, it included file-type: fastQ filter enabled, as we were dealing with the fastQ files so I just let that on, May be that's why the numbers appear low to you. But if will I remove all filters and just post numbers as per as SRA experiments archive , they can grow significantly and I think will make more sense.

uniqueg commented 4 years ago

No no, your numbers weren't low. And it makes sense to keep that filter in.

rohank63 commented 3 years ago

After merging all the transcripts, running Kallisto and testing sample fastQ files.

# Shell
# Index command : builds an index from a FASTA formatted file of target sequences
> kallisto index -i transcripts.idx -k 31 transcripts.fasta
# Quant command: runs the quantification algorithm

# For single end fastQ file
# Finding fragment_length and standard deviation for single end fastQ
> awk 'BEGIN { t=0.0;sq=0.0; n=0;} ;NR%4==2 {n++;L=length($0);t+=L;sq+=L*L;}END{m=t/n;printf("total %d avg=%f stddev=%f\n",n,m,sqrt(sq/n-m*m));}'  sra_data.fastq

total 1334218 avg=49.377981 stddev=4.109351

> kallisto quant -i transcripts.idx -o output -s 4.109351 -l 49.377981 --single sra_data.fastq
# For paired end fastQ file
> kallisto quant -i transcripts.idx -o output pairA_1.fastq pairA_2.fastq
# Python

"""Infer organism information for sequencing library."""

import os
import pandas as pd

def count_info(file_name: str):
    """Infers organisms count.
    Args:
        file_name: FastQ file to process.

    Returns:
        Dictionary with count percentage for every organism.
    """

    organism_count: Dict[(str, int), float] = {}
    df = pd.read_csv(file_name)
    dimension = df.shape
    rows = dimension[0]
    total_count = 0
    for i in range(rows):
        row = df.iloc[i,0]
        cols = list(map(str, row.split("\t")))
        contents = list(map(str,cols[0].split("|")))
        organism_name = contents[3]
        organism_tax_id = contents[4]
        # Update organism count
        if (organism_name,organism_tax_id) in organism_count:
            organism_count[(organism_name,organism_tax_id)] += float(cols[3])
        else:
            organism_count[(organism_name,organism_tax_id)] = float(cols[3])
        total_count += float(cols[3])
    # Sorting as per organism with the highest counts
    sorted_organism_count = {k: v for k, v in sorted(organism_count.items(),  key=lambda item: -1*item[1])}

    for i in sorted_organism_count:
        sorted_organism_count[i] = (sorted_organism_count[i]/total_count)*100
    return sorted_organism_count

path = os.path.dirname(os.path.abspath(__file__))
file_path = os.path.join(path, "output\\abundance.tsv")
> print(count_info(file_path))
{('panubis', '9555'): 36.04259720716562, ('ptroglodytes', '9598'): 15.189547301341014, ('ppaniscus', '9597'): 15.188078685853707, ('malbus', '43700'): 9.70738657188484, ('lchalumnae', '7897'): 4.690350075283531, ('mnemestrina', '9545'): 3.6380770042418007, ('nleucogenys', '61853'): 2.17412727133832, ('pabelii', '9601'): 2.134916210287037, ('lbergylta', '56723'): 1.9347405963295994, ('anancymaae', '37293'): 1.5466814683494206, ('mmulatta', '9544'): 1.1547486852025006, ('mfascicularis', '9541'): 1.1463986859973792, ('csabaeus', '60711'): 1.0259372461970249, ('catys', '9531'): 0.8432172391193068, ('ogarnettii', '30611'): 0.5692197103339778, ('sbboliviensis', '39432'): 0.5486835252578721, ('ptephrosceles', '591936'): 0.4059442474611024, ('cporcellus', '10141'): 0.25191468694359026, ('capalliatus', '336983'): 0.24130625904668185, ('rbieti', '61621'): 0.22822133658912208, ('lcalcarifer', '8187'): 0.18316937610124437, ('tgelada', '9565'): 0.17354099478143795, ('psinensis', '13735'): 0.11206718660092017, ('atestudineus', '64144'): 0.10330776155916598, ('sldorsalis', '1841481'): 0.08807212181700995, ('ipunctatus', '7998'): 0.08493157635340838, ('mzebra', '106582'): 0.06766320700432917, ('nvison', '452646'): 0.058843096523792965, ('lafricana', '9785'): 0.05752577846242743, ('eaasinus', '83772'): 0.05074740525324686, ('mleucophaeus', '9568'): 0.03863000326113315, ('cporosus', '8502'): 0.038060553939935154, ('fheteroclitus', '8078'): 0.02960265306439401, ('munguiculatus', '10047'): 0.024642115345641313, ('smaximus', '52904'): 0.02279658022984605, ('enaucrates', '173247'): 0.017306176165992893, ('pcinereus', '38626'): 0.01691580175108229, ('sscrofa', '9823'): 0.01691580175108229, ('mmurinus', '30608'): 0.010678818776945117, ('ptaltaica', '74533'): 0.010452527639020014, ('vvulpes', '9627'): 0.008684213158218438, ('ecaballus', '9796'): 0.008605744982845604, ('amelanoleuca', '9646'): 0.008516006654556112, ('rferrumequinum', '59479'): 0.008469192173209994, ('tbelangeri', '37347'): 0.006630063917327947, ('pnyererei', '303518'): 0.006428998468764146, ('csemilaevis', '244447'): 0.006346703093245131, ('ccanadensis', '51338'): 0.006344969223565645, ('mochrogaster', '79684'): 0.006343425656655858, ('mspicilegus', '10103'): 0.006343425656655858, ('bgrunniens', '30521'): 0.0060126900152939165, ('rroxellana', '61622'): 0.00600830247921473, ('psimus', '1328070'): 0.004374891517378035, ('mmmarmota', '9994'): 0.004268258532089651, ('mpfuro', '9669'): 0.004234511507596241, ('aplatyrhynchos', '8839'): 0.004228950437770573, ('falbicollis', '59894'): 0.004228950437770573, ('ttruncatus', '9739'): 0.00223569808318398, ('caperea', '37548'): 0.0021776980279299566, ('cjacchus', '9483'): 0.002137015604247188, ('clanigera', '34839'): 0.002123398304308982, ('umaritimus', '29073'): 0.002117710365970181, ('mauratus', '10036'): 0.002117583497457048, ('itridecemlineatus', '43179'): 0.0021172028919176486, ('dnovemcinctus', '9361'): 0.002115934206786317, ('mmurdjan', '586833'): 0.002114475380172593, ('cldingo', '286419'): 0.0021144752188852864, ('eburgeri', '7764'): 0.0021144752188852864, ('mcaroli', '10089'): 0.0021144752188852864, ('btaurus', '9913'): 0.00018350473187095958, ('bbbison', '43346'): 0.00018350473187095958, ('acalliptera', '8154'): 0.0}