zellerlab / GECCO

GEne Cluster prediction with COnditional random fields.
https://gecco.embl.de
GNU General Public License v3.0
54 stars 7 forks source link

manual on the tool output #2

Open smb20200615 opened 3 years ago

smb20200615 commented 3 years ago

Hello,

Is there a manual that would explain the output files? I am interested in seeing what BCGs are shared by a range of genomes. The command to run the tool seems very simple but I am having trouble interpreting the output.

Many thanks!

althonos commented 3 years ago

Hi Sara ! There is a small disclaimer on the website (gecco.embl.de) but nothing in detail. So I'm gonna explain it right here :smiley:

Formats

When you run GECCO, you get three type of files:

You may also want to run GECCO in verbose mode (gecco -v run ...) to get more feeling of what's going on. Now, about the files:

features.tsv

So, the features.tsv is really about domain annotation, it's probably not really interesting for your use case. With this table, you get one line per protein domain. The columns are:

Since each row corresponds to a domain, and a protein can have several domains, you can see a lot of lines with these values in common. The ones that change after that are:

clusters.tsv

Here, you get a row for each BGC that was detected in your input:

GenBank files

Each GenBank file created contains only the sequence for the BGC; genes found by Prodigal are marked with /CDS annotations, and domains found by HMMER are marked with a /misc_feature annotation.

Checking similar BGCs

For this, I'd recommend having a look at MMseqs2, in particular the linclust command. Once you are done finding BGCs with GECCO, you can just use that to check if the ones you found cluster together, and then map that back to the genomes they originate from. Notably, if you use --cov-mode=1 you should be able to detect fragmented BGCs at the nucleotide level. Note that it only helps to detect BGCs with the same synteny (because you stay at the nucleotide level).

Otherwise there are more dedicated tools like BiG-SLiCE to explore similar BGCs, but you may need to convert format the inputs properly to make it work.

smb20200615 commented 3 years ago

Thank you so much for your reply. just to clarify, I see that bigslice can work with output of deepBGC and antismash. Is it also possible to run with the output of your tool? I have the gbk file but also need a csv file describing all region coordinates. https://github.com/medema-group/bigslice/blob/master/misc/generate_antismash_gbk/generate_antismash_gbk.py

althonos commented 3 years ago

@smb20200615 : you can find the region coordinates in the {something}.clusters.tsv file among the GECCO outputs. You'll just need to adapt the script you linked to load from that :+1:

althonos commented 3 years ago

Hi @smb20200615 ,

in v0.7.0 I added a dedicated subcommand to help using GECCO results with BiG-SLiCE without having to write the conversion script yourself. Have a look at the new documentation page for BiG-SLiCE integration!

smb20200615 commented 3 years ago

Thank you! Is this version downloadable via bioconda?

althonos commented 3 years ago

@smb20200615 it will be soon, I need to address #3 first.

smb20200615 commented 3 years ago

@althonos, thank you so much for your help. I just tried the region gbk files outputted by gecco convert and they still are not parsed correctly by tools such as bigscape. Is there anyway to generate them so they resemble more the antismash gbks? I am not sure about the is difference between the two

althonos commented 3 years ago

Ah, I haven't tried with BiG-SCAPE, i'll see if there is a way to make the GenBank files compatible. IIRC, the issue is that BIG-SCAPE expects the GenBank files to label genes by kind (e.g. biosynthetic, transport, regulatory) but GECCO is not doing that, and there is no simple way to get that without doing an extra round of annotation with HMMER and AntiSMASH smCOGS.

Another issue is that AntiSMASH GenBank files 1. have non-standard features and qualifiers that often make sense only in AntiSMASH context and 2. have different type predictions compared to GECCO (and MIBIG or DeepBGC).

smb20200615 commented 3 years ago

Makes sense. Are there any other methods for clustering with known BGCs? I am not fully sure what you used in your paper.