over-prediction small contigs

mhooykaas commented 3 weeks ago

Hello, In the supplemental information of the BioRxiv preprint (https://www.biorxiv.org/content/10.1101/2023.02.06.527280v2.supplementary-material) I read that over-prediction in small contigs (smaller than the subsequence length) is an issue due to the way the models were trained, but updated models were announced. As the default model at least for land_plants seems to be the same as yet (v0.3_a_0080 model) I wonder if updated models are still expected to become available? With the current Helixer models, should gene models predicted on small contigs be considered less reliable than those on chromosomes/contigs larger than subsequence_length?

I also wonder what is the expected effect (on gene prediction precision and recall) of, instead of annotating loose contigs, annotating a "chromosome 0", constructed by concatenating unanchored (variably sized) contigs with stretches of Ns in between. Will precision be similar as in both cases artificial neighbouring sequence is added to sequence ends (though I do not fully understand how the padding is performed by Helixer), or may it be less inflated because the padding that the models were trained to recognize by accident is absent (there is 'padding' but different sequences).

Thank you in advance!

alisandra commented 2 weeks ago

Hi @mhooykaas ,

Good questions!

I wonder if updated models are still expected to become available?

To keep a long story short, I'm more optimistic here than I've been in a while, but it also won't be tomorrow.

With the current Helixer models, should gene models predicted on small contigs be considered less reliable than those on chromosomes/contigs larger than subsequence_length?

Yes, where 'small' is less than the --subsequence-length parameter (defaults are lineage-specific, see readme)

annotating a "chromosome 0", constructed by concatenating unanchored (variably sized) contigs with stretches of Ns in between

This would be a great approach and would (theoretically, not tested) be a complete work around for the bug in question. N's are treated as [0.25, 0.25, 0.25, 0.25] where-as the padding is [0, 0, 0, 0], so I would expect this to substantially improve precision on small contigs/chromosome 0. Other challenges, such as those caused if a contig is too small to contain a full gene, may of course remain.

mhooykaas commented 2 weeks ago

Thanks for your answers!

Regarding the N's I guess in that case ideally one would add stretches of Ns of the --subsequence-length length(?), which would be quite long and therefore a bit impractical. But it is good to know that there would indeed be a difference vs normal padding.

alisandra commented 2 weeks ago

Hmmm, one choice would be to handle the new chr0 in a different Helixer run than the rest of the genome so that you could customize handling as follows.

in preparation: add Ns such that every small contig is padded to exactly --subsequence-length. So where len(contig) is C, and --subsequence-length is S, you could add S - C N's.
when calling Helixer.py on chr0: turn off overlapping --no-overlap

There would be no theoretical benefit to adding more Ns than that.

It may however be simpler, and potentially very similar results wise, to simply chose an arbitrary, practical length number.

mhooykaas commented 1 week ago

Would having the N-padding only at the front of the contigs affect the predictions at the contigs ends? Otherwise I think indeed a fixed length would be more convenient. In that case perhaps it would maybe be easiest to add to each contig a prefix of N's of length S (the subsequence-length), then to annotate the loose prefixed contigs with Helixer (or the whole genome including them). Then the GFF could be reconstructed to match the original contigs without prefix by subtracting the known length of the prefix.

The --no-overlap would be to speed-up, is that correct?

weberlab-hhu / Helixer

over-prediction small contigs #147