[Question] Documentation - Gecco use cases for 'annotation', downstream 'antismash'

tamuanand commented 3 years ago

Hi @althonos

I have some questions pertaining to documentation . I know you mention here some documentation and also have a disclaimer

Before I ask my questions, I there is a bug or something wrong in the help text for -vvv (verbose debugging). I do not think that the -vvv is working. Does it stand for very very verbose

When I invoke it, it causes the program to exit gecco -vvv run --genome GENOME.fasta -o gecco_GENOME >& verbose_GENOME_gecco.txt &
However, the same works if I change vvv to vv

Here is the relevant gecco --help text - it states vvv shows debug information

gecco --help

Parameters:
    -h, --help                 show the message for ``gecco`` or
                               for a given subcommand.
    -q, --quiet                silence any output other than errors
                               (-qq silences everything).
    -v, --verbose              increase verbosity (-v is minimal,
                               -vv is verbose, and -vvv shows
                               debug information).
    -V, --version              show the program version and exit.

I have some questions/feature requests:

When do you use the gecco annotate command and what is the purpose of it
In what scenarios does one use gecco for downstream post-processing with antismash. I could not understand the use case for it from the preprint
I am assuming you would have done a downstream BiG-SLiCE process with your datasets. As a feature request or enhancement, it would be nice to have gecco outputs (or scripts) in a compatible way for BiG-SLiCE.
- I do also note that you mention here to write our own scripts to make it compatible for BiG-SLiCE

Parameters - Cluster Detection:
    -c, --cds <N>                 the minimum number of coding sequences a
                                  valid cluster must contain. [default: 3]
    -m <m>, --threshold <m>       the probability threshold for cluster
                                  detection. Default depends on the
                                  post-processing method (0.4 for gecco,
                                  0.6 for antismash).
    --postproc <method>           the method to use for cluster validation
                                  (antismash or gecco). [default: gecco]

althonos commented 3 years ago

Hi @tamuanand

I do not think that the -vvv is working.

Yes, this is an old option and it doesn't work anymore, I just forgot to remove the old prompt. There are just three verbosity level now (nothing, -v and -vv). I've fixed the help message but we have yet to publish the next release with that fix.

When do you use the gecco annotate command and what is the purpose of it

I added this command to make it easier to create training data, it creates the feature tables that are then to be used with gecco embed and gecco train. It basically does the ORF detection and the HMM annotation stages. If you don't plan to re-train GECCO yourself you won't have much interest for this command.

In what scenarios does one use gecco for downstream post-processing with antismash

Well, none really. You'd probably want to use them in complement with one another, as they will give you different putative clusters (AntiSMASH being very good at finding clusters close to known things, GECCO being better at identifying novel architectures)

If you are confused about the --postproc option, it's not actually for post-processing AntiSMASH results with GECCO or anything: it controls how we filter candidate cluster regions identified by the CRF (the antismash criterion being harsher, and requiring some domains AntiSMASH considers "biosynthetic" to be present in the candidate BGC).

I am assuming you would have done a downstream BiG-SLiCE process with your datasets

We actually didn't, as we didn't find BiG-SLiCE scalable enough for our dataset: it doesn't support heavily-distributed computations and requires to annotate the entirety of the BGCs with hmmscan (which couldn't be done on our HPC cluster).

I do also note that you mention here to write our own scripts to make it compatible for BiG-SLiCE

I am currently writing a dedicated command to help getting results into BiG-SLiCE, but everything is already still there in the GenBank "structured comments" of the output.

smb20200615 commented 3 years ago

Hi @althonos, I am not able to get the datasets.tsv file and the taxonomy folders. Are those supposed to be generated via the convert command?

althonos commented 3 years ago

I am not able to get the datasets.tsv file and the taxonomy folders. Are those supposed to be generated via the convert command?

BiG-SLiCE requires these files to work because of their expected input structure, GECCO cannot generate them for you.

tamuanand commented 3 years ago

Hi @althonos

Thanks for responding to my queries.

I have a follow up query: You suggest to use gecco as a complement to antiSMASH

gecco being better at identifying novel architectures and antiSMASH at finding known things.

My question: I am assuming gecco will still be able to find clusters to known things also - correct? Based on Fig 3a of the pre-print, is my understanding below correct for just the gecco vs antiSMASH comparison

gecco alone - 374,849
gecco and antiSMASH intersection - 301,201 plus 75,048
antiSMASH alone - 524,420

Were the above done with antiSMASH 5.1 or 5.2 ?

The reason I ask this is because the preprint at one place talks about antiSMASH 4.2 - any specific reason as to why 4.2 when 5.1 or 5.2 was already available.

The command-line implementation of antiSMASH v4.2.0
was then used to identify the coordinates of known BGCs in all selected contigs (using default
settings), and ORFs/domains that overlapped with the resulting known BGC regions were
removed from the feature table, yielding a final BGC-negative feature table for each
prokaryotic contig (Supplementary Figure S2).

tamuanand commented 3 years ago

Hi @althonos

I was wondering if you could elaborate on the above.

Thanks

althonos commented 3 years ago

@tamuanand : The Figure 3.a was done with antiSMASH 5.2.

We used antiSMASH 4.2 to mask the biosynthetic regions from our training data, because we prepared the sequences at a time were antiSMASH 5 was not available. We are in the process of improving our training set, which includes rebuilding our set of contigs, and for this will use antiSMASH 5.2 as well.

tamuanand commented 3 years ago

Hi @althonos

AntiSMASH 6 is now available - if you are planning to use antiSMASH I would recommend using antiSMASH 6.0

zellerlab / GECCO