wwood / OrfM

simple and not slow ORF caller
GNU Lesser General Public License v3.0
20 stars 5 forks source link

ORFs with embedded stop codons #10

Open CuriousTim opened 2 years ago

CuriousTim commented 2 years ago

It is my understanding that OrfM outputs continuous stretches of codons without a stop codon in the middle, but I am getting output with embedded stop codons, or what I believe are stop codons. I tried predicting ORFs in the human genome (GCF_000001405.39_GRCh38.p13) and I get many sequences like the ones below with an asterix in the middle. Does the asterix not mean a stop codon?

>NC_000001.11_11335_1_13 Homo sapiens chromosome 1, GRCh38.p13 Primary Assembly
WWHAACWQLGTLQGPLAQGVVAARPPAGSWGHCRALLLQQYWRIIGKHPEHMLFGLSRLLNMGFLGLKVKNKYV*FVN
>NC_000001.11_11388_6_19 Homo sapiens chromosome 1, GRCh38.p13 Primary Assembly
ETANTHEQKEEVKEKADGKLTKRKMVNDTRCWQSRLNYMQEQQRKSGKFAQSFSTPAMQQNHQWKFKKIHMARPQPKSLIRISRASPVRLAKIQK*TLCGETGIPRHCWWDTEQYNSDGNQFTN*TYLFFTFKPRNPIFRSLLRPNSICSGCFPIIRQYCWSKRARQCPQLPAGGRAATTP
>NC_000001.11_12890_2_36 Homo sapiens chromosome 1, GRCh38.p13 Primary Assembly
SGSKAWQSLSQGKLQAANSLHGSSPSLPAQSPGQGPPRKALVENLCMKAVNQSIGKPGCLQLGRQTGAGEGEKRKVRLPALSPT*G*GRRRGCTVGEAAVTQSLSLCSHEGRAIRHQRDSASIVLLDQ

Thanks

wwood commented 2 years ago

Hi,

Thanks for your interest, and thanks for the reproducible bug report.

This issue appears to be caused by mixed-case input- orfm expects either all upper-case or all lower-case. So you can workaround by

cat input.fna |tr a A |tr c C |tr g G |tr t T |tr n N | ./orfm

I'll try to implement a proper fix soon. Thanks, ben

CuriousTim commented 2 years ago

Thanks for the workaround. That solved the issue. I would recommend something like this

awk -F'>' 'NF > 1 NF == 1{print toupper($0)}' input.fna | ./orfm

to avoid modifying any definition lines.