weberlab-hhu / Helixer

Using Deep Learning to predict gene annotations
GNU General Public License v3.0
164 stars 27 forks source link

animalia training file #81

Closed wyim-pgl closed 1 year ago

wyim-pgl commented 2 years ago

Dear Helixer, I am looking for animal, specifically arthropod training file. The results have been mentioned on your paper but couldn't find the training file. Can you please tell me where it is? Thanks.

alisandra commented 2 years ago

Dear wyim-pgl,

Thanks for your interest! The models from the paper, which work well for vertebrates or land plants are available here: https://zenodo.org/record/3974409; and the instructions for using these models can still be found in the v0.2.0 tag (https://github.com/weberlab-hhu/Helixer/tree/v0.2.0). Unfortunately, as these 1. were never validated for arthropods, and are really only expected to work well for vertebrates, and 2. were not as applicable as current models (couldn't yet produce a gff3 file); so I kinda doubt this will help you much.

What I can say, is that when I get a chance to work on this again, getting some current fungi and broadly applicable animal models up is basically top priority. Still I don't know when this will be, so I cannot recommend to wait on it. I will reply here again when they are there.

Cheers, Alisandra

wyim-pgl commented 2 years ago

Hi, We are trying to train multiple reference genome for Helixer. We couldn't find any way to merge multiple GFFs into one sqlite3. Do you have any recommendations for it?

alisandra commented 2 years ago

Dear wyim-pgl,

Some quick tips on training

e.g., an excerpt from a recent training run of mine on fungi

data_directory
├── training_data.Alternaria_rosae.h5
├── training_data.Aspergillus_aculeatus.h5
├── training_data.Aspergillus_chevalieri.h5
├── training_data.Aspergillus_nomiae.h5
...
├── validation_data.Bipolaris_sorokiniana.h5
├── validation_data.Bipolaris_victoriae.h5
├── validation_data.Blastomyces_gilchristii.h5
├── validation_data.Candida_albicans.h5
...
└── validation_data.Zasmidium_cellare.h5

note that since initial publication, and perhaps not surprisingly, we have gotten substantially better generalization using separate species for validation, and not a subset of the training genomes. For the sake of time, we do generally still down-sample the validation genomes via https://github.com/weberlab-hhu/helixer_scratch/blob/master/data_scripts/sample-single-genomes.py

alisandra commented 1 year ago

Dear wyim-pgl,

At long last we've released invertebrate models. The current best one is invertebrate_v0.3_m_0100 found here: https://uni-duesseldorf.sciebo.de/s/lQTB7HYISW71Wi0 or obtainable by running fetch_helixer_models.py with the latest version (v0.3.0) installed.

These invertebrate models remain more experimental than for other phylogenetic groups, but nevertheless appear to be better than competing de novo annotation tools, so we're releasing them.

Cheers, Alisandra