Closed bpil83 closed 2 years ago
what does the few lines of your nodes.dmp
and names.dmp
look like, do they look like
% head -n3 nodes.dmp
1 | 1 | no rank | | 8 | 0 | 1 | 0 | 0 ||
2 | 131567 | superkingdom | | 0 | 0 | 11 | 0 | 0|
6 | 335928 | genus | | 0 | 1 | 11 | 1 | 0 ||
% head -n3 names.dmp
1 | all | | synonym |
1 | root | | scientific name |
2 | Bacteria | Bacteria <bacteria> | scientific name |
Dear zyxue, I'm getting the same error. The first lines of each file look like the ones you describe:
% head -n3 taxdump/nodes.dmp
1 | 1 | no rank | | 8 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | |
2 | 131567 | superkingdom | | 0 | 0 | 11 | 0 | 0 | 0 | 0 | 0 | |
6 | 335928 | genus | | 0 | 1 | 11 | 1 | 0 | 1 | 0 | 0 | |
% head -n3 taxdump/names.dmp
1 | all | | synonym |
1 | root | | scientific name |
2 | Bacteria | Bacteria <bacteria> | scientific name |
@josuebarrera , are you also on windows?
@josuebarrera , are you also on windows?
No, I'm using macOS Big Sur
Nevermind, I just installed ncbitax2lin on Manjaro Linux and it works perfectly.
I'm getting what looks to be the same error on a Windows system. The key is different (in my case 101004) but that taxon id is present in both the nodes and the names file
E:\NCBI>ncbitax2lin --nodes-file taxdump/nodes.dmp --names-file taxdump/names.dmp
2021-10-27 16:28:50,579|INFO|time spent on load_nodes: 0:00:02.466410
2021-10-27 16:28:55,370|INFO|time spent on load_names: 0:00:04.790171
2021-10-27 16:28:57,390|INFO|# of tax ids: 2,372,129
2021-10-27 16:28:57,757|INFO|df.info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2372129 entries, 0 to 2372128
Data columns (total 4 columns):
# Column Dtype
--- ------ -----
0 tax_id int64
1 parent_tax_id int64
2 rank object
3 rank_name object
dtypes: int64(2), object(2)
memory usage: 368.9 MB
2021-10-27 16:28:57,757|INFO|Generating TAXONOMY_DICT ...
2021-10-27 16:29:03,212|INFO|found 8 cpus, and will use all of them to find lineages for all tax ids
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "C:\Users\thegr\AppData\Local\Programs\Python\Python310\lib\multiprocessing\pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "C:\Users\thegr\AppData\Local\Programs\Python\Python310\lib\multiprocessing\pool.py", line 48, in mapstar
return list(map(*args))
File "C:\Users\thegr\AppData\Local\Programs\Python\Python310\lib\site-packages\ncbitax2lin\ncbitax2lin.py", line 78, in find_lineage
record = TAXONOMY_DICT[tax_id]
KeyError: 101004
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "C:\Users\thegr\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\thegr\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code
exec(code, run_globals)
File "C:\Users\thegr\AppData\Local\Programs\Python\Python310\Scripts\ncbitax2lin.exe\__main__.py", line 7, in <module>
File "C:\Users\thegr\AppData\Local\Programs\Python\Python310\lib\site-packages\ncbitax2lin\ncbitax2lin.py", line 192, in main
fire.Fire(taxonomy_to_lineages)
File "C:\Users\thegr\AppData\Local\Programs\Python\Python310\lib\site-packages\fire\core.py", line 138, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "C:\Users\thegr\AppData\Local\Programs\Python\Python310\lib\site-packages\fire\core.py", line 463, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "C:\Users\thegr\AppData\Local\Programs\Python\Python310\lib\site-packages\fire\core.py", line 672, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "C:\Users\thegr\AppData\Local\Programs\Python\Python310\lib\site-packages\ncbitax2lin\ncbitax2lin.py", line 179, in taxonomy_to_lineages
lineages = find_all_lineages(df_data.tax_id)
File "C:\Users\thegr\AppData\Local\Programs\Python\Python310\lib\site-packages\ncbitax2lin\ncbitax2lin.py", line 101, in find_all_lineages
return pool.map(find_lineage, tax_ids)
File "C:\Users\thegr\AppData\Local\Programs\Python\Python310\lib\multiprocessing\pool.py", line 364, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "C:\Users\thegr\AppData\Local\Programs\Python\Python310\lib\multiprocessing\pool.py", line 771, in get
raise self._value
KeyError: 101004
sorry, I have no access to a windows machine. Maybe the separator between columns is treated differently on windows? If you figure out why, please feel free to send a PR.
The same problem here (KeyError: 1) on Mac Mojave.
Same on Mac OS X Monterey 12.1 with Python 3.8.
I've reproduced the error, will take a look. it looks the way how global variable TAXONOMY_DICT
is updated among multiple processes has changed, or is inconsistent among different OSs.
Could you please update to the new version (2.1.0) with pip install -U ncbitax2lin
and try again?
I updated but now I get a new error, which could be looking like its a local problem but not sure:
2022-01-27 10:41:03,366|INFO|Generating a dictionary of taxonomy: tax_id => tax_unit ...
2022-01-27 10:41:14,977|INFO|size of taxonomy_dict: ~80 MB
2022-01-27 10:41:15,031|INFO|Finding all lineages ...
2022-01-27 10:41:15,031|INFO|will use 6 processes to find lineages for all tax ids
2022-01-27 10:41:15,036|INFO|chunk_size = 393281
2022-01-27 10:41:15,055|INFO|chunked sizes: [393281, 393281, 393281, 393281, 393281, 393281]
2022-01-27 10:41:15,059|INFO|Starting 6 processes ...
working on tax_id: 50000
working on tax_id: 100000
working on tax_id: 150000
working on tax_id: 200000
working on tax_id: 250000
working on tax_id: 300000
working on tax_id: 350000
working on tax_id: 400000
working on tax_id: 450000
Process Process-1:
Traceback (most recent call last):
File "c:\users\bpil\anaconda3\lib\multiprocessing\process.py", line 297, in _bootstrap
self.run()
File "c:\users\bpil\anaconda3\lib\multiprocessing\process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "c:\users\bpil\anaconda3\lib\site-packages\ncbitax2lin\lineage.py", line 52, in _find_lineages
with open(output, "wb") as opened:
OSError: [Errno 22] Invalid argument: '/C:\\Users\\BPIL\\AppData\\Local\\Temp\\tmpqdbfb9e4_ncbitax2lin/_lineages_0.pkl'
Hmm. Seems it could be due to windows path uses \
. Will send a patch
could you update again and try version 2.2.0
, please?
Runs perfectly now, thank you..
Im running the scripts as instructed in anaconda and I get this error. Im not good enough in python to figure out the problem. Can you help?
Console feed: