Open smb20200615 opened 3 years ago
Hi Sara ! There is a small disclaimer on the website (gecco.embl.de) but nothing in detail. So I'm gonna explain it right here :smiley:
When you run GECCO, you get three type of files:
XXX.features.tsv
, which contains the genes and proteins domains found in your input sequence, in tab-separated-values format.XXX.clusters.tsv
, which is created if BGCs were detected in your input, and contains one line per BGC, in tab-separated-values format.XXX_cluster_N.gbk
files, where one GenBank file is created for each clusterYou may also want to run GECCO in verbose mode (gecco -v run ...
) to get more feeling of what's going on. Now, about the files:
features.tsv
So, the features.tsv
is really about domain annotation, it's probably not really interesting for your use case. With this table, you get one line per protein domain. The columns are:
sequence_id
: the identifier of the sequence this domain belongs toprotein_id
: the identifier of the protein this domain belongs to (named after sequence_id
, since GECCO handles the gene finding)start
, end
and strand
: The coordinates of the genes within the sequence, and the strand (as +
or -
)Since each row corresponds to a domain, and a protein can have several domains, you can see a lot of lines with these values in common. The ones that change after that are:
domain
: The accession of the domain hmm
: The HMM library the domain comes from (either Pfam or Tigrfam)i_evalue
: The independent e-value that was given to this domain by hmmsearch
domain_start
and domain_end
: the coordinates of the domain in the protein sequencebgc_probability
: the probability assigned by GECCO to whether or not this gene belongs to a BGC.clusters.tsv
Here, you get a row for each BGC that was detected in your input:
sequence_id
: the identifier of the sequence this BGC was found inbgc_id
: the identifier given to the BGC by GECCOstart
and end
: the coordinates of the BGC within the sequenceaverage_p
: the average probability for all the genes to be in a BGC ( sum of the probability for each gene / number of genes)max_p
: the probability of the gene with the highest probability to belong to a BGCtype
: the predicted BGC type / biosynthetic classalkaloid_probability
, polyketide_probability
, ripp_probability
, saccharide_probability
, terpene_probability
, nrp_probability
, other_probability
: the probability the BGC has to be of a given type (you can use that to inspect what the type
column is reporting to check the confidence of the type assignment)proteins
: a semicolon-separated list of the proteins belonging to the BGCsdomains
: a semicolon-separated list of the protein domains within the BGCEach GenBank file created contains only the sequence for the BGC; genes found by Prodigal are marked with /CDS
annotations, and domains found by HMMER are marked with a /misc_feature
annotation.
For this, I'd recommend having a look at MMseqs2, in particular the linclust
command. Once you are done finding BGCs with GECCO, you can just use that to check if the ones you found cluster together, and then map that back to the genomes they originate from. Notably, if you use --cov-mode=1
you should be able to detect fragmented BGCs at the nucleotide level. Note that it only helps to detect BGCs with the same synteny (because you stay at the nucleotide level).
Otherwise there are more dedicated tools like BiG-SLiCE to explore similar BGCs, but you may need to convert format the inputs properly to make it work.
Thank you so much for your reply. just to clarify, I see that bigslice can work with output of deepBGC and antismash. Is it also possible to run with the output of your tool? I have the gbk file but also need a csv file describing all region coordinates. https://github.com/medema-group/bigslice/blob/master/misc/generate_antismash_gbk/generate_antismash_gbk.py
@smb20200615 : you can find the region coordinates in the {something}.clusters.tsv
file among the GECCO outputs. You'll just need to adapt the script you linked to load from that :+1:
Hi @smb20200615 ,
in v0.7.0
I added a dedicated subcommand to help using GECCO results with BiG-SLiCE without having to write the conversion script yourself. Have a look at the new documentation page for BiG-SLiCE integration!
Thank you! Is this version downloadable via bioconda?
@smb20200615 it will be soon, I need to address #3 first.
@althonos, thank you so much for your help. I just tried the region gbk files outputted by gecco convert and they still are not parsed correctly by tools such as bigscape. Is there anyway to generate them so they resemble more the antismash gbks? I am not sure about the is difference between the two
Ah, I haven't tried with BiG-SCAPE, i'll see if there is a way to make the GenBank files compatible. IIRC, the issue is that BIG-SCAPE expects the GenBank files to label genes by kind (e.g. biosynthetic, transport, regulatory) but GECCO is not doing that, and there is no simple way to get that without doing an extra round of annotation with HMMER and AntiSMASH smCOGS.
Another issue is that AntiSMASH GenBank files 1. have non-standard features and qualifiers that often make sense only in AntiSMASH context and 2. have different type predictions compared to GECCO (and MIBIG or DeepBGC).
Makes sense. Are there any other methods for clustering with known BGCs? I am not fully sure what you used in your paper.
Hello,
Is there a manual that would explain the output files? I am interested in seeing what BCGs are shared by a range of genomes. The command to run the tool seems very simple but I am having trouble interpreting the output.
Many thanks!