manual on the tool output

smb20200615 commented 3 years ago

Hello,

Is there a manual that would explain the output files? I am interested in seeing what BCGs are shared by a range of genomes. The command to run the tool seems very simple but I am having trouble interpreting the output.

Many thanks!

althonos commented 3 years ago

Hi Sara ! There is a small disclaimer on the website (gecco.embl.de) but nothing in detail. So I'm gonna explain it right here :smiley:

Formats

When you run GECCO, you get three type of files:

the XXX.features.tsv, which contains the genes and proteins domains found in your input sequence, in tab-separated-values format.
the XXX.clusters.tsv, which is created if BGCs were detected in your input, and contains one line per BGC, in tab-separated-values format.
the XXX_cluster_N.gbk files, where one GenBank file is created for each cluster

You may also want to run GECCO in verbose mode (gecco -v run ...) to get more feeling of what's going on. Now, about the files:

`features.tsv`

So, the features.tsv is really about domain annotation, it's probably not really interesting for your use case. With this table, you get one line per protein domain. The columns are:

sequence_id: the identifier of the sequence this domain belongs to
protein_id: the identifier of the protein this domain belongs to (named after sequence_id, since GECCO handles the gene finding)
start, end and strand: The coordinates of the genes within the sequence, and the strand (as + or -)

Since each row corresponds to a domain, and a protein can have several domains, you can see a lot of lines with these values in common. The ones that change after that are:

domain: The accession of the domain
hmm: The HMM library the domain comes from (either Pfam or Tigrfam)
i_evalue: The independent e-value that was given to this domain by hmmsearch
domain_start and domain_end: the coordinates of the domain in the protein sequence
bgc_probability: the probability assigned by GECCO to whether or not this gene belongs to a BGC.

`clusters.tsv`

Here, you get a row for each BGC that was detected in your input:

sequence_id: the identifier of the sequence this BGC was found in
bgc_id: the identifier given to the BGC by GECCO
start and end: the coordinates of the BGC within the sequence
average_p: the average probability for all the genes to be in a BGC ( sum of the probability for each gene / number of genes)
max_p: the probability of the gene with the highest probability to belong to a BGC
type: the predicted BGC type / biosynthetic class
alkaloid_probability, polyketide_probability, ripp_probability, saccharide_probability, terpene_probability, nrp_probability, other_probability: the probability the BGC has to be of a given type (you can use that to inspect what the type column is reporting to check the confidence of the type assignment)
proteins: a semicolon-separated list of the proteins belonging to the BGCs
domains: a semicolon-separated list of the protein domains within the BGC

GenBank files

Each GenBank file created contains only the sequence for the BGC; genes found by Prodigal are marked with /CDS annotations, and domains found by HMMER are marked with a /misc_feature annotation.

Checking similar BGCs

For this, I'd recommend having a look at MMseqs2, in particular the linclust command. Once you are done finding BGCs with GECCO, you can just use that to check if the ones you found cluster together, and then map that back to the genomes they originate from. Notably, if you use --cov-mode=1 you should be able to detect fragmented BGCs at the nucleotide level. Note that it only helps to detect BGCs with the same synteny (because you stay at the nucleotide level).

Otherwise there are more dedicated tools like BiG-SLiCE to explore similar BGCs, but you may need to convert format the inputs properly to make it work.

smb20200615 commented 3 years ago

Thank you so much for your reply. just to clarify, I see that bigslice can work with output of deepBGC and antismash. Is it also possible to run with the output of your tool? I have the gbk file but also need a csv file describing all region coordinates. https://github.com/medema-group/bigslice/blob/master/misc/generate_antismash_gbk/generate_antismash_gbk.py

althonos commented 3 years ago

@smb20200615 : you can find the region coordinates in the {something}.clusters.tsv file among the GECCO outputs. You'll just need to adapt the script you linked to load from that :+1:

althonos commented 3 years ago

Hi @smb20200615 ,

in v0.7.0 I added a dedicated subcommand to help using GECCO results with BiG-SLiCE without having to write the conversion script yourself. Have a look at the new documentation page for BiG-SLiCE integration!

smb20200615 commented 3 years ago

Thank you! Is this version downloadable via bioconda?

althonos commented 3 years ago

@smb20200615 it will be soon, I need to address #3 first.

smb20200615 commented 3 years ago

@althonos, thank you so much for your help. I just tried the region gbk files outputted by gecco convert and they still are not parsed correctly by tools such as bigscape. Is there anyway to generate them so they resemble more the antismash gbks? I am not sure about the is difference between the two

althonos commented 3 years ago

Ah, I haven't tried with BiG-SCAPE, i'll see if there is a way to make the GenBank files compatible. IIRC, the issue is that BIG-SCAPE expects the GenBank files to label genes by kind (e.g. biosynthetic, transport, regulatory) but GECCO is not doing that, and there is no simple way to get that without doing an extra round of annotation with HMMER and AntiSMASH smCOGS.

Another issue is that AntiSMASH GenBank files 1. have non-standard features and qualifiers that often make sense only in AntiSMASH context and 2. have different type predictions compared to GECCO (and MIBIG or DeepBGC).

smb20200615 commented 3 years ago

Makes sense. Are there any other methods for clustering with known BGCs? I am not fully sure what you used in your paper.

zellerlab / GECCO