widdowquinn / ncfp

Program and package that retrieves nucleotide coding sequences from NCBI that correspond to a set of input protein sequences.
https://widdowquinn.github.io/ncfp/
MIT License
3 stars 2 forks source link

`ncfp` not recovering all coding sequences from NCBI #20

Closed widdowquinn closed 2 years ago

widdowquinn commented 2 years ago

Summary:

ncfp does not recover all coding sequences from NCBI, even if a coding sequence is available

Description:

The UniProt sequence below

>tr|F5NV06|F5NV06_SHIFL MliC domain-containing protein OS=Shigella flexneri K-227 OX=766147 GN=SFK227_1958 PE=4 SV=1
MKKLLIIILPVLLSGCSAFNQLVERMQTDTLEYQCDEKPLTVKLNNPCQEVSFVYDNQLL
HLKQGLSASGARYSDGIYVFWSKGEEATVYKRDRIVLNNCQLQNPQR

corresponds to the NCBI record

https://www.ncbi.nlm.nih.gov/protein/333018885

whose coding sequence is in the nucleotide accession

https://www.ncbi.nlm.nih.gov/nuccore/AFGY01000021.1

but in debug mode ncfp reports:

[DEBUG] [ncbi_cds_from_protein.sequences]: Guessing sequence type for tr|F5NV06|F5NV06_SHIFL...
[DEBUG] [ncbi_cds_from_protein.sequences]: ...guessed UniProt
[DEBUG] [ncbi_cds_from_protein.sequences]: Uniprot record has GN field: SFK227_1958
[DEBUG] [ncbi_cds_from_protein.sequences]: Recovered EMBL database record: AFGY01000021
[DEBUG] [ncbi_cds_from_protein.sequences]: Adding record tr|F5NV06|F5NV06_SHIFL to cache with query AFGY01000021
Process input sequences: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  5.12it/s]
[INFO] [ncbi_cds_from_protein.scripts.ncfp]: 1 sequences taken forward with query
[INFO] [ncbi_cds_from_protein.scripts.ncfp]: Identifying nucleotide accessions...
Search NT IDs:   0%|                                                                                                                    | 0/1 [00:00<?, ?it/s][DEBUG] [ncbi_cds_from_protein.entrez]: Entry has nt query, using direct ESearch
[DEBUG] [ncbi_cds_from_protein.entrez]: ESearch query: ('AFGY01000021',)
Search NT IDs: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.81it/s]
[INFO] [ncbi_cds_from_protein.scripts.ncfp]: Added 1 new UIDs to cache
[INFO] [ncbi_cds_from_protein.scripts.ncfp]: Collecting GenBank accessions...
Fetch UID accessions: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.24s/it]
[INFO] [ncbi_cds_from_protein.scripts.ncfp]: Updated GenBank accessions for 1 UIDs
[INFO] [ncbi_cds_from_protein.scripts.ncfp]: Fetching GenBank headers...
[DEBUG] [ncbi_cds_from_protein.entrez]: Found 1 UIDs with no GenBank headers
[DEBUG] [ncbi_cds_from_protein.entrez]: Checking EPost histories, batch size is 1
[DEBUG] [ncbi_cds_from_protein.entrez]: Found 1 EPost histories, fetching headers
[...]
DEBUG:ncbi_cds_from_protein.entrez:Parsed 1 records
Fetching GenBank headers: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:07<00:00,  7.22s/it]
[INFO] [ncbi_cds_from_protein.scripts.ncfp]: Fetched GenBank headers for 0 UIDs
INFO:ncbi_cds_from_protein.scripts.ncfp:Fetched GenBank headers for 0 UIDs
[WARNING] [ncbi_cds_from_protein.scripts.ncfp]: No GenBank header downloads were required! (in cache?)
WARNING:ncbi_cds_from_protein.scripts.ncfp:No GenBank header downloads were required! (in cache?)
[...]
[WARNING] [ncbi_cds_from_protein.scripts.ncfp]: No record found for sequence input tr|F5NV06|F5NV06_SHIFL
WARNING:ncbi_cds_from_protein.scripts.ncfp:No record found for sequence input tr|F5NV06|F5NV06_SHIFL
[INFO] [ncbi_cds_from_protein.scripts.ncfp]: Matched 0/1 records
INFO:ncbi_cds_from_protein.scripts.ncfp:Matched 0/1 records

and the ncfp*.fasta output files are empty.

Reproducible Steps:

  1. Create an input file containing only the sequence above.
  2. Call ncfp on that input file, e.g. with ncfp --debug -l test.log -b 1 --keepcache test.fasta test_ncfp me@my.email

ncfp Version:

Commit 0f70697

Python Version:

Python 3.8

Operating System:

macOS

widdowquinn commented 2 years ago

It may be relevant that, locally, the tests fail with warnings like:

[...]
[WARNING] [ncbi_cds_from_protein.scripts.ncfp]: No record found for sequence input XP_004520832.1
[WARNING] [ncbi_cds_from_protein.scripts.ncfp]: No record found for sequence input XP_004520832.1
[WARNING] [ncbi_cds_from_protein.scripts.ncfp]: No record found for sequence input XP_004520832.1
[WARNING] [ncbi_cds_from_protein.scripts.ncfp]: No record found for sequence input XP_004520832.1
[WARNING] [ncbi_cds_from_protein.scripts.ncfp]: No record found for sequence input XP_004520832.1
[...]
widdowquinn commented 2 years ago

Issue closed with fix in 3a5eb88