xiaoli-dong / metaerg

42 stars 18 forks source link

master.tsv.txt has no entries for kegg and metacyc pathway #11

Open johanneswerner opened 4 years ago

johanneswerner commented 4 years ago

This issue is possibly the same as in #4 but since it has not received any attention, I will open this issue.

After fixing my workflow as described in #10, both with and without docker the workflow runs without errors.

However, the master.tsv.txt file has no entry in the columns kegg_pathway or metacyc_pathway, but those are annotated in master.gff.txt (example for one entry below)

gnl|Prokka|BS_1 Prodigal_v2.6.3 CDS     3283    4338    .       -       0       ID=metaerg.pl|00004;allec_ids=1.18.1.2;allgo_ids=GO:0016491,GO:0055114,GO:0004324,GO:0050660,GO:0050661;allko_ids=K00382,K00384,K00266;genomedb_OC=d__Bacteria%3Bp__Bacteroidota%3Bc__Bacteroidia%3Bo__Flavobacteriales%3Bf__Flavobacteriaceae%3Bg__MED-G13%3Bs__MED-G13 sp002697255;genomedb_acc=GCA_002697255.1;kegg_pathway_id=00240,00260,00251,00910,00010,00252,00620,00280,00020;kegg_pathway_name=Pyrimidine metabolism,Glycine%2C serine and threonine metabolism,Glutamate metabolism,Nitrogen metabolism,Glycolysis / Gluconeogenesis,Alanine and aspartate metabolism,Pyruvate metabolism,Valine%2C leucine and isoleucine degradation,Citrate cycle (TCA cycle);metacyc_pathway_id=PWY-101,PHOTOALL-PWY;metacyc_pathway_name=photosynthesis light reactions%3B,oxygenic photosynthesis%3B;metacyc_pathway_type=Electron-Transfer%3B Photosynthesis%3B,Photosynthesis%3B Super-Pathways%3B;pfam_acc=PF03486,PF13450,PF00070,PF07992,PF13738;pfam_desc=HI0933-like protein,NAD(P)-binding Rossmann-like domain,Pyridine nucleotide-disulphide oxidoreductase,Pyridine nucleotide-disulphide oxidoreductase,Pyridine nucleotide-disulphide oxidoreductase;pfam_id=HI0933_like,NAD_binding_8,Pyr_redox,Pyr_redox_2,Pyr_redox_3;sprot_desc=Ferredoxin--NADP reductase;sprot_id=sp|A5FJT9|FENR_FLAJ1

and in master.tbl.txt

4338    3283    CDS
                        ID      metaerg.pl|00004
                        allec_ids       1.18.1.2;
                        allgo_ids       GO:0016491;     GO:0055114;     GO:0004324;     GO:0050660;     GO:0050661;
                        allko_ids       K00382; K00384; K00266;
                        genomedb_OC     d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Flavobacteriales;f__Flavobacteriaceae;g__MED-G13;s__MED-G13 sp002697255;
                        genomedb_acc    GCA_002697255.1;
                        genomedb_target db:genomedb|GCA_002697255.1|MAP99322.1 1 350 evalue:3.5e-155 qcov:99.70 identity:75.70;
                        kegg_pathway_id 00240;  00260;  00251;  00910;  00010;  00252;  00620;  00280;  00020;
                        kegg_pathway_name       Pyrimidine metabolism;  Glycine, serine and threonine metabolism;       Glutamate metabolism;   Nitrogen metabolism;    Glycolysis / Gluconeogenesis;   Alanine and aspartate metabolism;       Pyruvate metabolism;    Valine, leucine and isoleucine degradation;     Citrate cycle (TCA cycle);
                        metacyc_pathway_id      PWY-101;        PHOTOALL-PWY;
                        metacyc_pathway_name    photosynthesis light reactions;;        oxygenic photosynthesis;;
                        metacyc_pathway_type    Electron-Transfer; Photosynthesis;;     Photosynthesis; Super-Pathways;;
                        pfam_acc        PF03486;        PF13450;        PF00070;        PF07992;        PF13738;
                        pfam_desc       HI0933-like protein;    NAD(P)-binding Rossmann-like domain;    Pyridine nucleotide-disulphide oxidoreductase;  Pyridine nucleotide-disulphide oxidoreductase;  Pyridine nucleotide-disulphide oxidoreductase;
                        pfam_id HI0933_like;    NAD_binding_8;  Pyr_redox;      Pyr_redox_2;    Pyr_redox_3;
                        pfam_target     db:Pfam-A.hmm|PF03486.14 evalue:2.4e-06 score:31.2 best_domain_score:19.2 name:HI0933_like;     db:Pfam-A.hmm|PF13450.6 evalue:0.0014 score:23.4 best_domain_score:21.5 name:NAD_binding_8;     db:Pfam-A.hmm|PF00070.27 evalue:2.8e-11 score:48.4 best_domain_score:45.1 name:Pyr_redox;       db:Pfam-A.hmm|PF07992.14 evalue:7.1e-29 score:105.7 best_domain_score:105.3 name:Pyr_redox_2;   db:Pfam-A.hmm|PF13738.6 evalue:6.1e-23 score:86.2 best_domain_score:74.4 name:Pyr_redox_3;
                        sprot_desc      Ferredoxin--NADP reductase;
                        sprot_id        sp|A5FJT9|FENR_FLAJ1;
                        sprot_target    db:uniprot_sprot|sp|A5FJT9|FENR_FLAJ1 1 350 evalue:1.5e-144 qcov:99.70 identity:70.00;

Any idea why this is the case and how this can be fixed?

Thank you very much.

johanneswerner commented 4 years ago

Hello, I have posted this issue seven days ago, no activity so far. Could you tell me if this is going to be fixed any time soon? Any response would be highly appreciated. Thank you.

Finesim97 commented 4 years ago

Hi, I don't know what will happen with this project in the future, but I will provide you with a fix. If I can't get it working within metaerg, I will give you an R script to extract this columns from the gff/tbl file. Best wishes, Lukas Jansen

Finesim97 commented 4 years ago

Ok, I found the problem. The gene2pathways hash(map) is always empty, the predictKegg/Metacyc functions have a local variable of the same name. Only the master.tsv file uses this hash. The pathways are added to the feature hash (f). This hash is used for the other files.

Replace this in bin/output_reports.pl: https://github.com/xiaoli-dong/metaerg/blob/7d6f785dab2b776db0209f35eb6d974a67df1290/bin/output_reports.pl#L149-L156 With https://github.com/xiaoli-dong/metaerg/blob/0f0d46995b62d12417c63ef4937d1b10ef12ec1b/bin/output_reports.pl#L143-L144

johanneswerner commented 4 years ago

oh cool, thank you very much for your help, testing it now. :-)

johanneswerner commented 4 years ago

almost solved: this is the current output (only showing the columns 2, 8 and 9):

feature_id      kegg_pathways   metacyc_pathways
BS|00001
BS|00002        00260
BS|00003        00650;00640;00280;00071;00632;03320;00930;01040;00410;00380;00310;00592
BS|00004        00240;00910;00252;00280;00020;00260;00251;00620;00010   PHOTOALL-PWY;PWY-101

How can the IDs be translated?

Finesim97 commented 4 years ago

Good morning, The MetaCyc pathways don't really have names, at least easily available. Two ideas: 1) The hack: Instead of kegg_pathway_id use kegg_pathway_name in output_reports.pl 2) R with minpath data

#!/usr/bin/env Rscript

readMaster = function(infile) {
  require(data.table)
  master = fread(master_tsv_path, fill = T, sep = "\t")
  # Weird col V17
  if ("V17" %in% colnames(master) && master[, all(is.na(V17))])
    master[, V17 := NULL]
  master
}

translateIds = function(tolookup, lookuptable, sep = ";") {
  require(data.table)
  unnested = data.table(elements = strsplit(tolookup, sep),
                        row = seq_along(tolookup))
  unnested = unnested[, .(elements = unlist(elements)), by = row]
  res = merge(
    unnested,
    lookuptable,
    by.x = "elements",
    by.y = colnames(lookuptable)[[1]],
    all.x = T
  )
  res = res[, .(collapsed = paste0(get(colnames(lookuptable)[[2]]), collapse = sep)), by =
              row]
  setorder(res, "row")
  res$collapsed
}

readMinPath = function(infile) {
  require(data.table)
  minpath = fread(infile,
                  header = F,
                  fill = T,
                  sep = "")
  # https://omics.informatics.indiana.edu/MinPath/output-readme.txt
  minpath_regex = "path (\\d+) (.+)  naive (0|1)  minpath (0|1)  fam0  (\\d+)  fam-found  (\\d+)  name  (.+)"
  minpath_colnames = c(
    "pathway_number",
    "reconstruction",
    "naive_recon",
    "minpath_recon",
    "total_fam_count",
    "hit_fam_count",
    "pathway_name"
  )
  minpath[, V1 := sub(minpath_regex, "\\1;\\2;\\3;\\4;\\5;\\6;\\7", V1)]
  minpath[, (minpath_colnames) := tstrsplit(V1, ";", fixed = T)]
  minpath[, V1 := NULL]
  minpath
}

master_tsv_path = "master.tsv.txt"
result_path = "translated.tsv"
minpath_path = "cds.gene2ko.minpath.txt"

minpath = readMinPath(minpath_path)
lookuptable = minpath[, .(pathway_number, pathway_name)]
master = readMaster(master_tsv_path)
master[, kegg_pathways := translateIds(kegg_pathways, lookuptable)]
fwrite(master, result_path)
johanneswerner commented 4 years ago

thank you for your help. I used the first hack and can confirm that it worked for me. I changed line 143 as follows:

 my @kegg_pathways = $f->get_tag_values("kegg_pathway_name") if $f->has_tag("kegg_pathway_name");