Open srobb1 opened 5 months ago
Hi Sofia,
Thanks for the question.
The --min-coding-length
is indeed a cutoff for length, but the default is 100 nucleotides in the CDS,
so much shorter than 200aa and it's unlikely this threshold is responsible for your missing genes.
A few other parameters may be worth playing with, namely
--edge-threshold
(default 0.1) may reduce fragmentation of genes (but will increase run time, and in rare cases lead to concatenated gene models)--peak_threshold
(default 0.8) may increase recall (but will reduce precision)However it's likely that the neural net simply didn't learn a good representation for this class of genes, and you're right that retraining may help. Certainly 3,000 gene copies from a single family should be enough to drastically improve performance on that family. While I haven't tried to boost performance by gene family, I could potentially speculate on how I'd try.
Before I do that, a question: are you interested in only that gene family, or whole genome annotations that specifically perform better on that gene family?
Hi Alisandra, Thank you for the reply. Looking at the parameters that you listed, my guess is that I would still likely miss my genes. It would probably be best to create a new model using the 3k genes in this family.
On this project (alfalfa plus other plants), I would only need to improve gene models for this gene family. There are published genes that are good for the whole genome but they are missing this family so we tried helixer specifically to see if we could find those genes, and we didn't.
I have other organisms in totally different projects that I would like to improve the whole genome annotations (first in line a couple sea anemones and corals). I will try to follow the documentation on how to build models for new organisms. I get a wide variety of species, especially invertebrates that come to me for structural and functional annotation. Helixer seems like a great option that has shown to provide good models in a short time for a some other species (vertebrates) that I have helped with.
Sofia
Hi.
Is there a way to alter the filtering to gain more small genes? We are missing most of a whole class of genes in a plant. These genes are about 200 amino acids long. Will changing --min-coding-length help? What is the default for this parameter?
If altering the filtering doesn't work is 3,000 of these genes spread across a wide range of plant species enough to build a model?
Thank you, Sofia