zyxue / ncbitax2lin

🐞 Convert NCBI taxonomy dump into lineages
MIT License
140 stars 29 forks source link

Could not get any output #23

Closed eray-sahin closed 1 year ago

eray-sahin commented 2 years ago

Hello,

When I run the command; ncbitax2lin --nodes-file taxdump/nodes.dmp --names-file taxdump/names.dmp ../ncbi.tax.10.txt

It does not produce any output. the log messages were;

2022-07-14 12:45:57,883|INFO|time spent on load_nodes: 0:00:04.126396
2022-07-14 12:46:05,872|INFO|time spent on load_names: 0:00:07.987480
2022-07-14 12:46:08,622|INFO|# of tax ids: 2,431,352
2022-07-14 12:46:09,087|INFO|df.info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2431352 entries, 0 to 2431351
Data columns (total 4 columns):
 #   Column         Dtype
---  ------         -----
 0   tax_id         int64
 1   parent_tax_id  int64
 2   rank           object
 3   rank_name      object
dtypes: int64(2), object(2)
memory usage: 378.0 MB

2022-07-14 12:46:09,087|INFO|Generating a dictionary of taxonomy: tax_id =tax_unit ...
2022-07-14 12:46:19,067|INFO|size of taxonomy_dict: ~80 MB
2022-07-14 12:46:19,126|INFO|Finding all lineages ...
2022-07-14 12:46:19,126|INFO|will use 6 processes to find lineages for all 2,431,352 tax ids
2022-07-14 12:46:19,139|INFO|chunk_size = 405226
2022-07-14 12:46:19,148|INFO|chunked sizes: [405226, 405226, 405226, 405226, 405226, 405222]
2022-07-14 12:46:19,156|INFO|Starting 6 processes ...
2022-07-14 12:46:19,715|INFO|Joining 6 processes ...
working on tax_id: 2500000
working on tax_id: 2000000
working on tax_id: 1550000
working on tax_id: 1150000
working on tax_id: 500000
working on tax_id: 50000
working on tax_id: 2050000
working on tax_id: 2550000
working on tax_id: 1600000
working on tax_id: 1200000
working on tax_id: 2100000
working on tax_id: 100000
working on tax_id: 2600000
working on tax_id: 1650000
working on tax_id: 1250000
working on tax_id: 2150000
working on tax_id: 150000
working on tax_id: 650000
working on tax_id: 1700000
working on tax_id: 1300000
working on tax_id: 2650000
working on tax_id: 700000
working on tax_id: 200000
working on tax_id: 2200000
working on tax_id: 1350000
working on tax_id: 1750000
working on tax_id: 750000
working on tax_id: 250000
working on tax_id: 1800000
working on tax_id: 1400000
working on tax_id: 2250000
working on tax_id: 2750000
working on tax_id: 850000
working on tax_id: 300000
working on tax_id: 2300000
working on tax_id: 1850000
working on tax_id: 900000
working on tax_id: 2800000
working on tax_id: 1500000
working on tax_id: 1900000
working on tax_id: 2850000
working on tax_id: 350000
working on tax_id: 2350000
working on tax_id: 400000
working on tax_id: 1000000
working on tax_id: 1950000
working on tax_id: 2900000
working on tax_id: 2400000
working on tax_id: 2950000
working on tax_id: 1050000
working on tax_id: 450000
working on tax_id: 2450000
2022-07-14 12:46:45,642|INFO|adding lineages from /tmp/tmpa3bjevds_ncbitax2lin/_lineages_0.pkl ...
2022-07-14 12:46:49,074|INFO|adding lineages from /tmp/tmpa3bjevds_ncbitax2lin/_lineages_1.pkl ...
2022-07-14 12:46:51,566|INFO|adding lineages from /tmp/tmpa3bjevds_ncbitax2lin/_lineages_2.pkl ...
2022-07-14 12:46:53,381|INFO|adding lineages from /tmp/tmpa3bjevds_ncbitax2lin/_lineages_3.pkl ...
2022-07-14 12:46:56,254|INFO|adding lineages from /tmp/tmpa3bjevds_ncbitax2lin/_lineages_4.pkl ...
2022-07-14 12:46:59,116|INFO|adding lineages from /tmp/tmpa3bjevds_ncbitax2lin/_lineages_5.pkl ...
2022-07-14 12:47:00,507|INFO|Preparings all lineages into a dataframe to be written to disk ...
Killed 

Thank you

zyxue commented 2 years ago

I may be your machine doesn't have enough memory. Will you be able to try it on a bigger node?

rsmeral commented 2 years ago

Hi, I'm facing the same issue. My job gets killed when running ncbitax2lin, because it exhausts all available memory:

failed       47  : execd enforced h_rss limit
...
maxrss       209.618G

200 GB feels like it should be enough RAM, given that the nodes.dmp and names.dmp files are only 161 MB and 213 MB in size, respectively.

In case it's relevant – in my case, 6 processes were used:

2022-10-10 05:28:44,480|INFO|will use 6 processes to find lineages for all 2,447,832 tax ids
2022-10-10 05:28:44,480|INFO|chunk_size = 407972
2022-10-10 05:28:44,496|INFO|chunked sizes: [407972, 407972, 407972, 407972, 407972, 407972]
2022-10-10 05:28:44,505|INFO|Starting 6 processes ...
2022-10-10 05:28:44,754|INFO|Joining 6 processes ...

What are the expected memory requirements of ncbitax2lin?

zyxue commented 2 years ago

hmm. 200GB should be more than enough. I just tried it again on a machine with 32G (ncbitax2lin probably used less)

ncbitax2lin --nodes-file taxdump/nodes.dmp --names-file taxdump/names.dmp
2022-10-10 10:18:41,713|INFO|time spent on load_nodes: 0:00:03.560350
2022-10-10 10:18:47,738|INFO|time spent on load_names: 0:00:06.024048
2022-10-10 10:18:49,831|INFO|# of tax ids: 2,447,831
2022-10-10 10:18:50,238|INFO|df.info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2447831 entries, 0 to 2447830
Data columns (total 4 columns):
 #   Column         Dtype 
---  ------         ----- 
 0   tax_id         int64 
 1   parent_tax_id  int64 
 2   rank           object
 3   rank_name      object
dtypes: int64(2), object(2)
memory usage: 380.5 MB

2022-10-10 10:18:50,238|INFO|Generating a dictionary of taxonomy: tax_id => tax_unit ...
2022-10-10 10:18:58,123|INFO|size of taxonomy_dict: ~80 MB
2022-10-10 10:18:58,193|INFO|Finding all lineages ...
2022-10-10 10:18:58,194|INFO|will use 6 processes to find lineages for all 2,447,831 tax ids
2022-10-10 10:18:58,194|INFO|chunk_size = 407972
2022-10-10 10:18:58,206|INFO|chunked sizes: [407972, 407972, 407972, 407972, 407972, 407971]
2022-10-10 10:18:58,211|INFO|Starting 6 processes ...
working on tax_id: 50000

...
working on tax_id: 2400000
working on tax_id: 2450000
2022-10-10 10:20:11,456|INFO|Joining 6 processes ...
working on tax_id: 2500000
...

2022-10-10 10:20:20,502|INFO|adding lineages from /var/folders/x_/8jph6xyj3t13_w0j0q9fpqwm0000gp/T/tmp9b1a2prg_ncbitax2lin/_lineages_0.pkl ...
2022-10-10 10:20:23,449|INFO|adding lineages from /var/folders/x_/8jph6xyj3t13_w0j0q9fpqwm0000gp/T/tmp9b1a2prg_ncbitax2lin/_lineages_1.pkl ...
2022-10-10 10:20:26,110|INFO|adding lineages from /var/folders/x_/8jph6xyj3t13_w0j0q9fpqwm0000gp/T/tmp9b1a2prg_ncbitax2lin/_lineages_2.pkl ...
2022-10-10 10:20:28,776|INFO|adding lineages from /var/folders/x_/8jph6xyj3t13_w0j0q9fpqwm0000gp/T/tmp9b1a2prg_ncbitax2lin/_lineages_3.pkl ...
2022-10-10 10:20:31,128|INFO|adding lineages from /var/folders/x_/8jph6xyj3t13_w0j0q9fpqwm0000gp/T/tmp9b1a2prg_ncbitax2lin/_lineages_4.pkl ...
2022-10-10 10:20:34,371|INFO|adding lineages from /var/folders/x_/8jph6xyj3t13_w0j0q9fpqwm0000gp/T/tmp9b1a2prg_ncbitax2lin/_lineages_5.pkl ...
2022-10-10 10:20:37,837|INFO|Preparings all lineages into a dataframe to be written to disk ...
2022-10-10 10:21:27,032|INFO|Writing lineages to ncbi_lineages_2022-10-10.csv.gz ...

And it works fine.

zyxue commented 1 year ago

Please feel free to reopen if any further questions.