Closed eray-sahin closed 1 year ago
I may be your machine doesn't have enough memory. Will you be able to try it on a bigger node?
Hi, I'm facing the same issue. My job gets killed when running ncbitax2lin
, because it exhausts all available memory:
failed 47 : execd enforced h_rss limit
...
maxrss 209.618G
200 GB feels like it should be enough RAM, given that the nodes.dmp
and names.dmp
files are only 161 MB and 213 MB in size, respectively.
In case it's relevant – in my case, 6 processes were used:
2022-10-10 05:28:44,480|INFO|will use 6 processes to find lineages for all 2,447,832 tax ids
2022-10-10 05:28:44,480|INFO|chunk_size = 407972
2022-10-10 05:28:44,496|INFO|chunked sizes: [407972, 407972, 407972, 407972, 407972, 407972]
2022-10-10 05:28:44,505|INFO|Starting 6 processes ...
2022-10-10 05:28:44,754|INFO|Joining 6 processes ...
What are the expected memory requirements of ncbitax2lin
?
hmm. 200GB should be more than enough. I just tried it again on a machine with 32G (ncbitax2lin probably used less)
ncbitax2lin --nodes-file taxdump/nodes.dmp --names-file taxdump/names.dmp
2022-10-10 10:18:41,713|INFO|time spent on load_nodes: 0:00:03.560350
2022-10-10 10:18:47,738|INFO|time spent on load_names: 0:00:06.024048
2022-10-10 10:18:49,831|INFO|# of tax ids: 2,447,831
2022-10-10 10:18:50,238|INFO|df.info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2447831 entries, 0 to 2447830
Data columns (total 4 columns):
# Column Dtype
--- ------ -----
0 tax_id int64
1 parent_tax_id int64
2 rank object
3 rank_name object
dtypes: int64(2), object(2)
memory usage: 380.5 MB
2022-10-10 10:18:50,238|INFO|Generating a dictionary of taxonomy: tax_id => tax_unit ...
2022-10-10 10:18:58,123|INFO|size of taxonomy_dict: ~80 MB
2022-10-10 10:18:58,193|INFO|Finding all lineages ...
2022-10-10 10:18:58,194|INFO|will use 6 processes to find lineages for all 2,447,831 tax ids
2022-10-10 10:18:58,194|INFO|chunk_size = 407972
2022-10-10 10:18:58,206|INFO|chunked sizes: [407972, 407972, 407972, 407972, 407972, 407971]
2022-10-10 10:18:58,211|INFO|Starting 6 processes ...
working on tax_id: 50000
...
working on tax_id: 2400000
working on tax_id: 2450000
2022-10-10 10:20:11,456|INFO|Joining 6 processes ...
working on tax_id: 2500000
...
2022-10-10 10:20:20,502|INFO|adding lineages from /var/folders/x_/8jph6xyj3t13_w0j0q9fpqwm0000gp/T/tmp9b1a2prg_ncbitax2lin/_lineages_0.pkl ...
2022-10-10 10:20:23,449|INFO|adding lineages from /var/folders/x_/8jph6xyj3t13_w0j0q9fpqwm0000gp/T/tmp9b1a2prg_ncbitax2lin/_lineages_1.pkl ...
2022-10-10 10:20:26,110|INFO|adding lineages from /var/folders/x_/8jph6xyj3t13_w0j0q9fpqwm0000gp/T/tmp9b1a2prg_ncbitax2lin/_lineages_2.pkl ...
2022-10-10 10:20:28,776|INFO|adding lineages from /var/folders/x_/8jph6xyj3t13_w0j0q9fpqwm0000gp/T/tmp9b1a2prg_ncbitax2lin/_lineages_3.pkl ...
2022-10-10 10:20:31,128|INFO|adding lineages from /var/folders/x_/8jph6xyj3t13_w0j0q9fpqwm0000gp/T/tmp9b1a2prg_ncbitax2lin/_lineages_4.pkl ...
2022-10-10 10:20:34,371|INFO|adding lineages from /var/folders/x_/8jph6xyj3t13_w0j0q9fpqwm0000gp/T/tmp9b1a2prg_ncbitax2lin/_lineages_5.pkl ...
2022-10-10 10:20:37,837|INFO|Preparings all lineages into a dataframe to be written to disk ...
2022-10-10 10:21:27,032|INFO|Writing lineages to ncbi_lineages_2022-10-10.csv.gz ...
And it works fine.
Please feel free to reopen if any further questions.
Hello,
When I run the command;
ncbitax2lin --nodes-file taxdump/nodes.dmp --names-file taxdump/names.dmp ../ncbi.tax.10.txt
It does not produce any output. the log messages were;
Thank you