will-rowe / groot

A resistome profiler for Graphing Resistance Out Of meTagenomes
MIT License
65 stars 7 forks source link

groot-db output issues #40

Open LeonardosMageiros opened 4 years ago

LeonardosMageiros commented 4 years ago

Hi!

I am exploring the option of using groot-db as it combines 3 well used AMR databases. I believe that this is a very good idea but i am facing some issues that I believe is good to address.

Here is a typical output that I have from my results:


C_RESFINDERerm(F)_3_M17808 211 801 762M39D groot-db_CARD__gb|GQ342996|+|797-1793|ARO:3003097|CfxA6 346 997 38D948M11D groot-db_ARGANNOT(Bla)cfxA6:GQ342996:798-1793:966 346 996 38D948M10D groot-db_RESFINDER__tet(Q)_4_Z21523 194 1926 12D1850M64D groot-db_ARGANNOT__(Tet)TetQ:Z21523:362-2287:1926 197 1974 1910M64D

It is clear that entries 2-3 and 4-5 are duplicates. Same gene (maybe different allele?) presented 2 times in the report. This makes parsing and summarizing the results quite tricky to handle. Can you see any way to tackle that?

Also the format of each entry is dependent from the database of origin. So the first column is different for CARD, ARGANNOT and RESFINDER. This is also a bit confusing and difficult to handle. Do you think that you could homogenize that? if not maybe give a description of the format for each different DB in the report files?

Please let me know what you think. Thank you in advance Leonardos

LeonardosMageiros commented 4 years ago

To add one additional thing I noticed: in one of my output files I have the following entry: groot-db_CARD__gb|NC_000913.3|-|484425-485619|ARO:3004043|Escherichia 26 1194 460M9D719M6D

which in card corresponds to that gene: https://card.mcmaster.ca/ontology/41090

Notice that in the end of the first column only the word Escherichia appears instead of Escherichia coli acrA

Maybe that is something else that needs to be fixed?

Best Leonardos

will-rowe commented 4 years ago

Hi Leonardos

This is great feedback - thank you! As you have spotted, I did no curation when I merged those databases. This would definitely be something to re-visit. The databases in general could do with a bit of TLC.

I will endeavour to get around to this asap, but things are pretty busy for me at the moment so no promises on when I'll be able to do this by!

By the way and in case you didn't see it, a new version of groot is in conda as of yesterday - this version is much more efficient than previous versions so I recommend updating to it if you haven't already