wilburnlab / collage

Codon Likelihihoods Learned Against Genome Evolution (CoLLAGE): a deep learning framework for identifying naturally selected patterns of codon preference within a species
MIT License
1 stars 2 forks source link

GC 100 check applied too early #24

Closed alope107 closed 3 months ago

alope107 commented 3 months ago

When enabled, the GC 100 check is incorrectly applied on sequences of length less than 100. This causes issues in the beam generator.

For example, suppose an amino acid sequence begins with MA. M corresponds 1:1 with ATG, so the sequence must begin in this way. The GC 100 check passes as the sequence is 33.3% GC (less than the threshold of 65%). When next predicting the codon for A, the beam generator should be able to explore any of ['GCT', 'GCC', 'GCA', 'GCG']. However, choosing GCC or GCG would trigger the GC 100 heuristic with a proportion of 66.6%. In essence, it becomes a GC 6 check instead of a GC 100 check.

This greatly constrains the search space beyond what is intended by the GC 100 heuristic. It can even cause the program to crash if all candidate sequences are ruled out by the. For a degenerate example, imagine that the start codon is omitted and A is the first amino acid in the sequence. Every one of of its codons has a GC ratio of more than 65%, meaning that every possibility is immediately ruled out.