weberlab-hhu / Helixer

Using Deep Learning to predict gene annotations
GNU General Public License v3.0
139 stars 20 forks source link

[Question] Can this be used broadly for all microeukaryotes (e.g., fungi, protists, etc)? #100

Open jolespin opened 1 year ago

jolespin commented 1 year ago

I'm thinking about trying this out for a backend for my VEBA eukaryotic binning module (https://github.com/jolespin/veba) as an alternative to MetaEuk.

Looking at the examples here:

Helixer.py --lineage land_plant --fasta-path Arabidopsis_lyrata.v.1.0.dna.chromosome.8.fa  \
  --species Arabidopsis_lyrata --gff-output-path Arabidopsis_lyrata_chromosome8_helixer.gff3

I won't have the lineage or species at this point in the pipeline.

BjoernUsadel commented 1 year ago

Hi the lineage selects the trained models currently one of

Do you want to align it to the BUSCO lineages? This could be done by a simple lookup. However as gene paramters change for the gene detection depending on these broadf lineages these are needed. Cheers b

jolespin commented 1 year ago

I meant if Helixer had a model for each eukaryotic BUSCO lineage, then it does auto detection in the backend to know which model to use.

I guess I'm asking if there are plans to build out the models available.

alisandra commented 11 months ago

Hi Jolespin,

Thanks for your interest!

This is a cool idea on auto-detection. One would have to test it, but I'd hazard a guess that auto detection might work off of the confidence of Helixer's raw predictions.

That said, I don't currently have much capacity for larger schemes, so in summary

That said on the no's, if any one wants to see this enough to give it a shot, I'm happy to do what I can at a high-level; drop a line.

jolespin commented 11 months ago

Understood the bandwidth issue. I'm in the same boat right now w/ my VEBA package.

I would be using this for microeukaryotic organisms (e.g., protists). Do you have any instructions on training a custom model? If so, what is required for this? Genomic and CDS sequences or could this be done directly from protein sequences?

alisandra commented 11 months ago

Yes, I'm currently working on the latest instructions here: https://github.com/weberlab-hhu/Helixer/blob/cleanup/docs/training.md (I will merge them to main soon).

You will need fasta and gff3 files for the training species (it's supervised training). I am not on top of the availability of good references in protists, I can imagine it might be a challenge.

Drop questions whenever!

jolespin commented 11 months ago

For the training data, are genes with alternative start codons used or discarded?

alisandra commented 11 months ago

In the current default implementation, genes with non-ATG start codons are used, but the upstream region is masked. So the network will learn there's a gene there, but not receive feedback on where exactly it started.

This behavior is of course not designed for alternative start codons, but is a side effect of the partial-gene-model detection and masking, which assumes the standard genetic code.

In general, supporting genetic code variants is on the "think about it" list; not doing so yet has been a rare--but still noticeable--issue in fungi already.

jolespin commented 2 weeks ago

Any updates on whether or not this should be able to handle protists soon?

I've compiled this huge protein set for protists and fungi: https://zenodo.org/records/10139451 Though, the genomes aren't available in this dataset unfortunately.