zyxue / ncbitax2lin

🐞 Convert NCBI taxonomy dump into lineages
MIT License
138 stars 29 forks source link

KeyError: 1 #15

Closed bpil83 closed 2 years ago

bpil83 commented 3 years ago

Im running the scripts as instructed in anaconda and I get this error. Im not good enough in python to figure out the problem. Can you help?

Console feed:

(base) C:\Users\BPIL>ncbitax2lin --nodes-file taxdump/nodes.dmp --names-file taxdump/names.dmp
2021-08-31 13:32:38,637|INFO|time spent on load_nodes: 0:00:04.046432
2021-08-31 13:32:45,796|INFO|time spent on load_names: 0:00:07.158943
2021-08-31 13:32:48,974|INFO|# of tax ids: 2,359,686
2021-08-31 13:32:49,420|INFO|df.info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2359686 entries, 0 to 2359685
Data columns (total 4 columns):
 #   Column         Dtype
---  ------         -----
 0   tax_id         int64
 1   parent_tax_id  int64
 2   rank           object
 3   rank_name      object
dtypes: int64(2), object(2)
memory usage: 367.0 MB

2021-08-31 13:32:49,421|INFO|Generating TAXONOMY_DICT ...
2021-08-31 13:33:00,737|INFO|found 12 cpus, and will use all of them to find lineages for all tax ids
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "c:\users\bpil\anaconda3\lib\multiprocessing\pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "c:\users\bpil\anaconda3\lib\multiprocessing\pool.py", line 44, in mapstar
    return list(map(*args))
  File "c:\users\bpil\anaconda3\lib\site-packages\ncbitax2lin\ncbitax2lin.py", line 78, in find_lineage
    record = TAXONOMY_DICT[tax_id]
KeyError: 1
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "c:\users\bpil\anaconda3\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "c:\users\bpil\anaconda3\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\BPIL\Anaconda3\Scripts\ncbitax2lin.exe\__main__.py", line 7, in <module>
  File "c:\users\bpil\anaconda3\lib\site-packages\ncbitax2lin\ncbitax2lin.py", line 192, in main
    fire.Fire(taxonomy_to_lineages)
  File "c:\users\bpil\anaconda3\lib\site-packages\fire\core.py", line 138, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "c:\users\bpil\anaconda3\lib\site-packages\fire\core.py", line 468, in _Fire
    target=component.__name__)
  File "c:\users\bpil\anaconda3\lib\site-packages\fire\core.py", line 672, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "c:\users\bpil\anaconda3\lib\site-packages\ncbitax2lin\ncbitax2lin.py", line 179, in taxonomy_to_lineages
    lineages = find_all_lineages(df_data.tax_id)
  File "c:\users\bpil\anaconda3\lib\site-packages\ncbitax2lin\ncbitax2lin.py", line 101, in find_all_lineages
    return pool.map(find_lineage, tax_ids)
  File "c:\users\bpil\anaconda3\lib\multiprocessing\pool.py", line 268, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "c:\users\bpil\anaconda3\lib\multiprocessing\pool.py", line 657, in get
    raise self._value
KeyError: 1
zyxue commented 3 years ago

what does the few lines of your nodes.dmp and names.dmp look like, do they look like

% head -n3 nodes.dmp
1   |   1   |   no rank |       |   8   |   0   |   1   |   0   |   0   ||
2   |   131567  |   superkingdom    |       |   0   |   0   |   11  |   0   |   0|
6   |   335928  |   genus   |       |   0   |   1   |   11  |   1   |   0   ||
% head -n3 names.dmp 
1   |   all |       |   synonym |
1   |   root    |       |   scientific name |
2   |   Bacteria    |   Bacteria <bacteria> |   scientific name |
josuebarrera commented 2 years ago

Dear zyxue, I'm getting the same error. The first lines of each file look like the ones you describe:

% head -n3 taxdump/nodes.dmp 
1   |   1   |   no rank |       |   8   |   0   |   1   |   0   |   0   |   0   |   0   |   0   |       |
2   |   131567  |   superkingdom    |       |   0   |   0   |   11  |   0   |   0   |   0   |   0   |   0   |       |
6   |   335928  |   genus   |       |   0   |   1   |   11  |   1   |   0   |   1   |   0   |   0   |       |
% head -n3 taxdump/names.dmp 
1   |   all |       |   synonym |
1   |   root    |       |   scientific name |
2   |   Bacteria    |   Bacteria <bacteria> |   scientific name |
zyxue commented 2 years ago

@josuebarrera , are you also on windows?

josuebarrera commented 2 years ago

@josuebarrera , are you also on windows?

No, I'm using macOS Big Sur

josuebarrera commented 2 years ago

Nevermind, I just installed ncbitax2lin on Manjaro Linux and it works perfectly.

AGrantUEA commented 2 years ago

I'm getting what looks to be the same error on a Windows system. The key is different (in my case 101004) but that taxon id is present in both the nodes and the names file

E:\NCBI>ncbitax2lin --nodes-file taxdump/nodes.dmp --names-file taxdump/names.dmp
2021-10-27 16:28:50,579|INFO|time spent on load_nodes: 0:00:02.466410
2021-10-27 16:28:55,370|INFO|time spent on load_names: 0:00:04.790171
2021-10-27 16:28:57,390|INFO|# of tax ids: 2,372,129
2021-10-27 16:28:57,757|INFO|df.info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2372129 entries, 0 to 2372128
Data columns (total 4 columns):
 #   Column         Dtype
---  ------         -----
 0   tax_id         int64
 1   parent_tax_id  int64
 2   rank           object
 3   rank_name      object
dtypes: int64(2), object(2)
memory usage: 368.9 MB

2021-10-27 16:28:57,757|INFO|Generating TAXONOMY_DICT ...
2021-10-27 16:29:03,212|INFO|found 8 cpus, and will use all of them to find lineages for all tax ids
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "C:\Users\thegr\AppData\Local\Programs\Python\Python310\lib\multiprocessing\pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "C:\Users\thegr\AppData\Local\Programs\Python\Python310\lib\multiprocessing\pool.py", line 48, in mapstar
    return list(map(*args))
  File "C:\Users\thegr\AppData\Local\Programs\Python\Python310\lib\site-packages\ncbitax2lin\ncbitax2lin.py", line 78, in find_lineage
    record = TAXONOMY_DICT[tax_id]
KeyError: 101004
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\Users\thegr\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\thegr\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\Users\thegr\AppData\Local\Programs\Python\Python310\Scripts\ncbitax2lin.exe\__main__.py", line 7, in <module>
  File "C:\Users\thegr\AppData\Local\Programs\Python\Python310\lib\site-packages\ncbitax2lin\ncbitax2lin.py", line 192, in main
    fire.Fire(taxonomy_to_lineages)
  File "C:\Users\thegr\AppData\Local\Programs\Python\Python310\lib\site-packages\fire\core.py", line 138, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "C:\Users\thegr\AppData\Local\Programs\Python\Python310\lib\site-packages\fire\core.py", line 463, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "C:\Users\thegr\AppData\Local\Programs\Python\Python310\lib\site-packages\fire\core.py", line 672, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "C:\Users\thegr\AppData\Local\Programs\Python\Python310\lib\site-packages\ncbitax2lin\ncbitax2lin.py", line 179, in taxonomy_to_lineages
    lineages = find_all_lineages(df_data.tax_id)
  File "C:\Users\thegr\AppData\Local\Programs\Python\Python310\lib\site-packages\ncbitax2lin\ncbitax2lin.py", line 101, in find_all_lineages
    return pool.map(find_lineage, tax_ids)
  File "C:\Users\thegr\AppData\Local\Programs\Python\Python310\lib\multiprocessing\pool.py", line 364, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "C:\Users\thegr\AppData\Local\Programs\Python\Python310\lib\multiprocessing\pool.py", line 771, in get
    raise self._value
KeyError: 101004
zyxue commented 2 years ago

sorry, I have no access to a windows machine. Maybe the separator between columns is treated differently on windows? If you figure out why, please feel free to send a PR.

liaochen1988 commented 2 years ago

The same problem here (KeyError: 1) on Mac Mojave.

boris-dimitrov commented 2 years ago

Same on Mac OS X Monterey 12.1 with Python 3.8.

zyxue commented 2 years ago

I've reproduced the error, will take a look. it looks the way how global variable TAXONOMY_DICT is updated among multiple processes has changed, or is inconsistent among different OSs.

zyxue commented 2 years ago

Could you please update to the new version (2.1.0) with pip install -U ncbitax2lin and try again?

bpil83 commented 2 years ago

I updated but now I get a new error, which could be looking like its a local problem but not sure:

2022-01-27 10:41:03,366|INFO|Generating a dictionary of taxonomy: tax_id => tax_unit ...
2022-01-27 10:41:14,977|INFO|size of taxonomy_dict: ~80 MB
2022-01-27 10:41:15,031|INFO|Finding all lineages ...
2022-01-27 10:41:15,031|INFO|will use 6 processes to find lineages for all tax ids
2022-01-27 10:41:15,036|INFO|chunk_size = 393281
2022-01-27 10:41:15,055|INFO|chunked sizes: [393281, 393281, 393281, 393281, 393281, 393281]
2022-01-27 10:41:15,059|INFO|Starting 6 processes ...
working on tax_id: 50000
working on tax_id: 100000
working on tax_id: 150000
working on tax_id: 200000
working on tax_id: 250000
working on tax_id: 300000
working on tax_id: 350000
working on tax_id: 400000
working on tax_id: 450000
Process Process-1:
Traceback (most recent call last):
  File "c:\users\bpil\anaconda3\lib\multiprocessing\process.py", line 297, in _bootstrap
    self.run()
  File "c:\users\bpil\anaconda3\lib\multiprocessing\process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "c:\users\bpil\anaconda3\lib\site-packages\ncbitax2lin\lineage.py", line 52, in _find_lineages
    with open(output, "wb") as opened:
OSError: [Errno 22] Invalid argument: '/C:\\Users\\BPIL\\AppData\\Local\\Temp\\tmpqdbfb9e4_ncbitax2lin/_lineages_0.pkl'
zyxue commented 2 years ago

Hmm. Seems it could be due to windows path uses \. Will send a patch

zyxue commented 2 years ago

could you update again and try version 2.2.0, please?

bpil83 commented 2 years ago

Runs perfectly now, thank you..