Question: How to find more smaller genes

srobb1 commented 5 months ago

Hi.

Is there a way to alter the filtering to gain more small genes? We are missing most of a whole class of genes in a plant. These genes are about 200 amino acids long. Will changing --min-coding-length help? What is the default for this parameter?

If altering the filtering doesn't work is 3,000 of these genes spread across a wide range of plant species enough to build a model?

Thank you, Sofia

alisandra commented 4 months ago

Hi Sofia,

Thanks for the question.

The --min-coding-length is indeed a cutoff for length, but the default is 100 nucleotides in the CDS, so much shorter than 200aa and it's unlikely this threshold is responsible for your missing genes.

A few other parameters may be worth playing with, namely

reducing--edge-threshold (default 0.1) may reduce fragmentation of genes (but will increase run time, and in rare cases lead to concatenated gene models)
reducing --peak_threshold (default 0.8) may increase recall (but will reduce precision)

However it's likely that the neural net simply didn't learn a good representation for this class of genes, and you're right that retraining may help. Certainly 3,000 gene copies from a single family should be enough to drastically improve performance on that family. While I haven't tried to boost performance by gene family, I could potentially speculate on how I'd try.

Before I do that, a question: are you interested in only that gene family, or whole genome annotations that specifically perform better on that gene family?

srobb1 commented 4 months ago

Hi Alisandra, Thank you for the reply. Looking at the parameters that you listed, my guess is that I would still likely miss my genes. It would probably be best to create a new model using the 3k genes in this family.

On this project (alfalfa plus other plants), I would only need to improve gene models for this gene family. There are published genes that are good for the whole genome but they are missing this family so we tried helixer specifically to see if we could find those genes, and we didn't.

I have other organisms in totally different projects that I would like to improve the whole genome annotations (first in line a couple sea anemones and corals). I will try to follow the documentation on how to build models for new organisms. I get a wide variety of species, especially invertebrates that come to me for structural and functional annotation. Helixer seems like a great option that has shown to provide good models in a short time for a some other species (vertebrates) that I have helped with.

Sofia

weberlab-hhu / Helixer

Question: How to find more smaller genes #135