xiaoli-dong / metaerg

40 stars 18 forks source link

issues with KEGG annotations and KO mapping #26

Open camillaln opened 4 years ago

camillaln commented 4 years ago

Hi,

I have used metaerg to annotate metagenome bins, and have found some issues with the KEGG annotations.

I tested out a Methanogen MAG. It has the McrA gene and its correctly annotated in the cds.faa file. However, this does not show up in the KEGG map.

metaerg.pl|00402 Methyl-coenzyme M reductase II subunit alpha OS=s__Methanothermobacter_sp000828575 len=553 cid=Contig_2,9,4,3_consensus_sequence

We then tried to use the cds.gene2ko.mapping.txt The McrA gene comes up here, but this file appears to be over annotated and adds in functions that are not present in the genome - e.g. DsrA, K11180. The gene with this KO added is metaerg.pl|01003 - which is correctly annotated as >metaerg.pl|01003 Coenzyme F420-dependent sulfite reductase OS=s__Methanothermobacter_thermautotrophicus len=689 cid=Contig_1_consensus_sequence in the cds.faa file (KO K21816 with blastkoala)

Have anyone else noted discrepancies like this? I have attached the data folder from the analysis.

data.zip

Camilla

Finesim97 commented 4 years ago

Hi, For a recent paper and my bachelor thesis we used MetaErg and other annotation tools to annotate anaerobic digestion metagenomes from biogas plants and I have to say that a similar problem often occurs for other pathways as well. Disulfid reductase and other key anaerobic digestion genes appear to be not well represented in annotation databases. Sadly it seems like a human check will always be necessary.

MetaErg performs its KO annotation by a BLAST search against a UniProt database. While this is a valid approach, only reference genes also present in KEGG have KO terms annotated.

I like KofamKOALA. It is also not perfect, but they tried to reduce false positives and can be run offline. Sometimes even the reference genomes miss a annotation, making certain pathways not complete. https://academic.oup.com/bioinformatics/article/36/7/2251/5631907

eggNOGmapper can also provide you with EC numbers and KO terms, but many are automatically generated and appear to be quite questionable sometimes (especially KO terms), but the orthologs can be compared among different genomes and help to identify certain genes https://github.com/eggnogdb/eggnog-mapper

InterProScan pathway annotations also contain EC numbers and those often helped me to complete certain pathways. https://github.com/ebi-pf-team/interproscan

Hopefully this helps in one way or another, Lukas Jansen

Finesim97 commented 4 years ago

MetaErg also adds the FOAM and TIGRFAM KO hits, wasn't sure, but just found it in the source code.

camillaln commented 4 years ago

Hi, Thanks for your comments Lukas. What I find the most puzzling (and problematic) with metaerg is the difference in what it draws in the maps provided and what is present in the KO list in the cds.gene2ko.mapping.txt file since I assume they should be identified by the same BLAST search against a UniProt database ? We are also using Metasanity for annotation which seems to arrive at better results for functional annotation. (I really like Metaergs taxonomy of CDSs though, so it would be great if the KEGG part was as useful). Camilla

Finesim97 commented 4 years ago

Hi, Thanks for sharing Metasanity, will have a look. Regarding the maps, maybe MetaErg only uses a subset of annotated KOs? Will search for that in the source.

Finesim97 commented 4 years ago

Found the issue. The report uses the KOs after they have been selected by MinPath, while the cds.gene2ko.mapping.txtand cds.gene2ko.tab.txt contain all entries:

Using your data:

Sizes
report  837
gene2ko.mapping 1476
gene2ko.tab     1476
gene2ko_minpath 837

Similarity (%):
report  gene2ko.mapping 56.71
report  gene2ko.tab     56.71
report  gene2ko_minpath 100.0
gene2ko.mapping gene2ko.tab     100.0
gene2ko.mapping gene2ko_minpath 56.71
gene2ko.tab     gene2ko_minpath 56.71
#!/usr/bin/env python3

import json
from io import StringIO
import re
import itertools

# Table used for the fam_found entries:
reporttable = 'data/data/kegg.cds.profile.datatable.json'
kolist = 'data/data/cds.gene2ko.mapping.txt'
tab = 'data/data/cds.gene2ko.tab.txt'
minpath = 'data/data/cds.gene2ko.minpath.details'

def kosfromtable(filepath):
    with open(filepath, 'rt') as jsonin:
        # Memory inefficient, but ok for now
        jsonstr = jsonin.read()
        jsonstr = re.findall(r'data = (\[.+\])',jsonstr, re.DOTALL)[0]
        jsonstr = re.sub(r',\s*}','}',jsonstr)
        jsonstr = json.load(StringIO(jsonstr))
        kos = {ko for path in jsonstr for ko in path['kos'].split('+')}
        return kos

def kosfromgene2ko(filepath,i=1,skip=False):
    with open(filepath, 'rt') as jsonin:
        if skip:
            next(jsonin)
        kos = {line.split('\t')[i].strip() for line in jsonin }
        return kos

kohit = re.compile(r'^   (K\d{5}) .+')
def kosfromminpath(filepath):
    with open(filepath, 'rt') as jsonin:
        kos = {line.group(1) for line in map(kohit.match, jsonin) if line is not None}
        return kos

def jaccardsim(A: set, B: set):
    return len(A.intersection(B))/len(A.union(B))*100

def main():
    kos = {
    'report': kosfromtable(reporttable),
    'gene2ko.mapping': kosfromgene2ko(kolist),
    'gene2ko.tab': kosfromgene2ko(tab,8,True),
    'gene2ko_minpath': kosfromminpath(minpath)
    }
    print('Sizes:')
    for koset in kos.keys():
        print(f'{koset}\t{len(kos[koset])}')
    print('\nSimilarity:')
    for A, B in itertools.combinations(kos.keys(),2):
        print('\t'.join((A,B,str(round(jaccardsim(kos[A],kos[B]),2)))))

if __name__ == '__main__':
    main()
camillaln commented 4 years ago

Thanks - that makes sense! Thanks you so much for sorting this out for me.