zyxue / ncbitax2lin

🐞 Convert NCBI taxonomy dump into lineages
MIT License
138 stars 29 forks source link

Does this still work with current taxonomy dump? #13

Closed tgolubch closed 4 years ago

tgolubch commented 4 years ago

Hi,

Thanks for providing this very useful script! Unfortunately I haven't been able to run ncbitax2lin on the latest taxonomy dump in ncbi. I don't know if NCBI have change the format? It looks like the number of columns isn't what the script expects (see stack trace below). Please could you verify whether it works on the current nodes.dmp and names.dmp in ncbi, and if it does, please would you be able to save a current lineage file version and share via gitlab (latest one there is from 2019, which doesn't contain SARS-CoV-2). Many thanks!

Traceback (most recent call last):
  File "/users/fraser/golubchi/.local/bin/ncbitax2lin", line 10, in <module>
    sys.exit(main())
  File "/users/fraser/golubchi/.local/lib/python3.7/site-packages/ncbitax2lin/ncbitax2lin.py", line 192, in main
    fire.Fire(taxonomy_to_lineages)
  File "/users/fraser/golubchi/.local/lib/python3.7/site-packages/fire/core.py", line 138, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/users/fraser/golubchi/.local/lib/python3.7/site-packages/fire/core.py", line 468, in _Fire
    target=component.__name__)
  File "/users/fraser/golubchi/.local/lib/python3.7/site-packages/fire/core.py", line 672, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/users/fraser/golubchi/.local/lib/python3.7/site-packages/ncbitax2lin/ncbitax2lin.py", line 171, in taxonomy_to_lineages
    df_data = data_io.read_names_and_nodes(names_file, nodes_file)
  File "/users/fraser/golubchi/.local/lib/python3.7/site-packages/ncbitax2lin/data_io.py", line 77, in read_names_and_nodes
    nodes_df = load_nodes(nodes_file)
  File "/users/fraser/golubchi/.local/lib/python3.7/site-packages/ncbitax2lin/utils.py", line 23, in timed_func
    result = func(*args, **kwargs)
  File "/users/fraser/golubchi/.local/lib/python3.7/site-packages/ncbitax2lin/data_io.py", line 38, in load_nodes
    "comments",
  File "/users/fraser/golubchi/.local/lib/python3.7/site-packages/pandas/io/parsers.py", line 676, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/users/fraser/golubchi/.local/lib/python3.7/site-packages/pandas/io/parsers.py", line 454, in _read
    data = parser.read(nrows)
  File "/users/fraser/golubchi/.local/lib/python3.7/site-packages/pandas/io/parsers.py", line 1133, in read
    ret = self._engine.read(nrows)
  File "/users/fraser/golubchi/.local/lib/python3.7/site-packages/pandas/io/parsers.py", line 2037, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 860, in pandas._libs.parsers.TextReader.read
  File "pandas/_libs/parsers.pyx", line 875, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas/_libs/parsers.pyx", line 952, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 1028, in pandas._libs.parsers.TextReader._convert_column_data
  File "pandas/_libs/parsers.pyx", line 1338, in pandas._libs.parsers.TextReader._get_column_name
IndexError: list index out of range
zyxue commented 4 years ago

What command did you use?

tgolubch commented 4 years ago

Hi,

I used:

ncbitax2lin nodes.dmp names.dmp

Where the .dmp files were downloaded from NCBI last week.

T

On 22 Jun 2020, at 15:43, Zhuyi Xue notifications@github.com<mailto:notifications@github.com> wrote:

What command did you use?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/zyxue/ncbitax2lin/issues/13#issuecomment-647562843, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AFMQX7C5OKO2JMRW74VLWI3RX5UXLANCNFSM4OEOSJEQ.

zyxue commented 4 years ago

I just tried it. it still works fine. Is your nodes.dmp corrupted, or maybe try redownloading your dmp files?

Are you looking for these two?

389166,Viruses,,,Nidovirales,Coronaviridae,Betacoronavirus,Severe acute respiratory syndrome-related coronavirus,,,,,,ssRNA viruses,"ssRNA positive-strand viruses, no DNA stage",,,,,,,,,,,SARS coronavirus,,,Bat SARS CoV Rf1/2004,Bat CoV 273/2005,,,,,,,,,,,Coronavirinae,,,,,,,,,,,,,
389167,Viruses,,,Nidovirales,Coronaviridae,Betacoronavirus,Severe acute respiratory syndrome-related coronavirus,,,,,,ssRNA viruses,"ssRNA positive-strand viruses, no DNA stage",,,,,,,,,,,SARS coronavirus,,,Bat SARS CoV Rm1/2004,Bat CoV 279/2005,,,,,,,,,,,Coronavirinae,,,,,,,,,,,,,
6

The gitlab repo is deprecated as I don't have time to keep it up to date.

tgolubch commented 4 years ago

Hi,

Hmm, my nodes.dmp looks fine… The md5 was correct as far as I can recall. Did you download the nodes file just now?

Are you looking for these two?

No, these two lines are Bat viruses, the ones I need are from SARS CoV 2… I manually edited the lineages file and added it, but unfortunately there seem to be others in the taxonomy that come up on my kraken reports (in small numbers) that sit somewhere in this taxonomic clade, but are not labelled SARS CoV 2, and consequently I don’t capture all the reads. I was hoping a new lineages file would solve this.

T

On 22 Jun 2020, at 16:56, Zhuyi Xue notifications@github.com<mailto:notifications@github.com> wrote:

I just tried it. it still works fine. Is your nodes.dmp corrupted maybe?

Are you looking for these two?

389166,Viruses,,,Nidovirales,Coronaviridae,Betacoronavirus,Severe acute respiratory syndrome-related coronavirus,,,,,,ssRNA viruses,"ssRNA positive-strand viruses, no DNA stage",,,,,,,,,,,SARS coronavirus,,,Bat SARS CoV Rf1/2004,Bat CoV 273/2005,,,,,,,,,,,Coronavirinae,,,,,,,,,,,,, 389167,Viruses,,,Nidovirales,Coronaviridae,Betacoronavirus,Severe acute respiratory syndrome-related coronavirus,,,,,,ssRNA viruses,"ssRNA positive-strand viruses, no DNA stage",,,,,,,,,,,SARS coronavirus,,,Bat SARS CoV Rm1/2004,Bat CoV 279/2005,,,,,,,,,,,Coronavirinae,,,,,,,,,,,,, 6

The gitlab repo is deprecated as I don't have time to keep it up to date.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/zyxue/ncbitax2lin/issues/13#issuecomment-647612844, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AFMQX7AJKYV3VC4CGJ6RAMLRX55K3ANCNFSM4OEOSJEQ.

zyxue commented 4 years ago

yeah, I just downloaded a copy of dump, and it works for me.

try re-download the dump and follow the instruction in the README.md.

pip install -U ncbitax2lin

wget -N ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
mkdir -p taxdump && tar zxf taxdump.tar.gz -C ./taxdump

ncbitax2lin taxdump/nodes.dmp taxdump/names.dmp
tgolubch commented 4 years ago

Yes, that worked! Thanks for your help. I guess it was corrupted somehow. Pity NCBI downloads are so temperamental.

Thanks Tanya

On 22 Jun 2020, at 17:32, Zhuyi Xue notifications@github.com<mailto:notifications@github.com> wrote:

yeah, I just downloaded a copy of dump, and it works for me.

try re-download the dump and follow the instruction in the README.md.

pip install -U ncbitax2lin

wget -N ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz mkdir -p taxdump && tar zxf taxdump.tar.gz -C ./taxdump

ncbitax2lin taxdump/nodes.dmp taxdump/names.dmp

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/zyxue/ncbitax2lin/issues/13#issuecomment-647632557, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AFMQX7GNPM2B2WDCGUBHHOLRX6BRDANCNFSM4OEOSJEQ.