widdowquinn / ncfp

Program and package that retrieves nucleotide coding sequences from NCBI that correspond to a set of input protein sequences.
https://widdowquinn.github.io/ncfp/
MIT License
3 stars 2 forks source link

Drop terminal stop codons #23

Open HobnobMancer opened 2 years ago

HobnobMancer commented 2 years ago

Summary:

The complete nucleotide sequence retrieved from NCBI is written to the output, including any terminal stop codons. These sequences often cannot be used for backthreading onto aligned protein sequences, because the cds and protein sequences differ due to the presence of terminal stop codons in the nucleotide sequence that are not present in the protein sequence.

Description:

A --drop_stop_codons flag could be added, and when used all terminal stop codons in the cds sequence are removed, so that the retrieved cds matches the protein codon sequence for backthreading. Otherwise additional parsing of the output is required when using the ncfp output for backthreading nucleotide sequences onto aligned protein sequences.

Current Output:

The only output is the complete nucleotide sequence.

Expected Output:

When using the flag --drop_stop_codons, terminal stop codons are removed from the end of each cds.

ncfp Version:

v0.2.0

HobnobMancer commented 2 years ago

@widdowquinn I can't add labels to the issue

Is this a feature that's ok with you to be added? If so do you want me to added it via a new branch and PR, or add it to the HobnobMancer:issue_21_protein_ids branch and associated PR?

widdowquinn commented 2 years ago

do you want me to added it via a new branch and PR

Yes, please!

HobnobMancer commented 2 years ago

do you want me to added it via a new branch and PR

Yes, please!

Cool, I'll get on that! Should be submitted as a PR hopefully by this weekend