widdowquinn / ncfp

Program and package that retrieves nucleotide coding sequences from NCBI that correspond to a set of input protein sequences.
https://widdowquinn.github.io/ncfp/
MIT License
3 stars 2 forks source link

Incorrect protein sequences being retrieved for some accessions #31

Closed widdowquinn closed 2 years ago

widdowquinn commented 2 years ago

Summary:

Input protein sequences deriving from a known organism (e.g. human) are retrieiving nucleotide sequences from a different organism (e.g. bos taurus).

Description:

The input sequence

>CAD6020544.1/6-36 amtB [Escherichia coli] GN=CAD6020544.1
------DKADNAFMMICTALVLFMTIPGIALFYGGLI

does not give an output nucleotide sequence, as the wrong originating sequence is identified in the Elink linker step.

Reproducible Steps:

With the above sequence as input, run ncfp as normal:

Current Output:

[INFO] [ncbi_cds_from_protein.scripts.ncfp]: Sequence CAD6020544.1/6-36 matches GenBank entry X60065.1
[INFO] [ncbi_cds_from_protein.scripts.ncfp]: Searching for CDS: CAD6020544.1
[INFO] [ncbi_cds_from_protein.scripts.ncfp]: Could not identify CDS feature for CAD6020544.1/6-36

Expected Output:

A nucleotide coding sequence corresponding to the input protein, in the output directory.

ncfp Version:

commit 694d806

Python Version:

3.9

Operating System:

macOS

widdowquinn commented 2 years ago

This issue was brought to my attention by @tharis.

widdowquinn commented 2 years ago

First inspection indicates that: