xorbitsai / xorbits

Scalable Python DS & ML, in an API compatible & lightning fast way.
https://xorbits.io
Apache License 2.0
1.06k stars 67 forks source link

BUG: service stopped when pivot a 1125138913x5 matrix into 4000 columns on a 160U-4096GBmem machine #741

Closed huboqiang closed 8 months ago

huboqiang commented 8 months ago

Describe the bug

Running pivot and service stopped:

image

To Reproduce

To help us to reproduce this bug, please provide information below:

image
  1. Your Python version 3.11.4
  2. The version of Xorbits you use 0.5.1
  3. Versions of crucial packages, such as numpy(1.25.2), scipy(1.11.2) and pandas(2.0.3)
  4. Full stack of the error.
……
……
……
/cluster/home/bqhu_jh/share/miniconda3/envs/bigdata/lib/python3.11/site-packages/xorbits/_mars/dataframe/base/pivot.py:72: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  input_data[dtype] = np.nan
/cluster/home/bqhu_jh/share/miniconda3/envs/bigdata/lib/python3.11/site-packages/xorbits/_mars/dataframe/base/pivot.py:72: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  input_data[dtype] = np.nan
/cluster/home/bqhu_jh/share/miniconda3/envs/bigdata/lib/python3.11/site-packages/xorbits/_mars/dataframe/base/pivot.py:72: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  input_data[dtype] = np.nan
2023-10-13 19:20:03,359 xorbits._mars.services.scheduling.worker.execution 337003 ERROR    Failed to run subtask n6Qo8tTl8A6Pn7uCwJHxGAEK on band numa-0
Traceback (most recent call last):
  File "/cluster/home/bqhu_jh/share/miniconda3/envs/bigdata/lib/python3.11/site-packages/xorbits/_mars/services/scheduling/worker/execution.py", line 445, in _run_subtask_once
    return await asyncio.shield(aiotask)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/cluster/home/bqhu_jh/share/miniconda3/envs/bigdata/lib/python3.11/site-packages/xorbits/_mars/services/subtask/api.py", line 70, in run_subtask_in_slot
    return await ref.run_subtask.options(profiling_context=profiling_context).send(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/cluster/home/bqhu_jh/share/miniconda3/envs/bigdata/lib/python3.11/site-packages/xoscar/backends/context.py", line 226, in send
    result = await self._wait(future, actor_ref.address, send_message)  # type: ignore
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/cluster/home/bqhu_jh/share/miniconda3/envs/bigdata/lib/python3.11/site-packages/xoscar/backends/context.py", line 115, in _wait
    return await future
           ^^^^^^^^^^^^
  File "/cluster/home/bqhu_jh/share/miniconda3/envs/bigdata/lib/python3.11/site-packages/xoscar/backends/core.py", line 84, in _listen
    raise ServerClosed(
xoscar.errors.ServerClosed: Remote server unixsocket:///361854215913472 closed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/cluster/home/bqhu_jh/share/miniconda3/envs/bigdata/lib/python3.11/site-packages/xorbits/_mars/services/scheduling/worker/execution.py", line 402, in internal_run_subtask
    subtask_info.result = await self._retry_run_subtask(
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/cluster/home/bqhu_jh/share/miniconda3/envs/bigdata/lib/python3.11/site-packages/xorbits/_mars/services/scheduling/worker/execution.py", line 513, in _retry_run_subtask
    return await _retry_run(subtask, subtask_info, _run_subtask_once)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/cluster/home/bqhu_jh/share/miniconda3/envs/bigdata/lib/python3.11/site-packages/xorbits/_mars/services/scheduling/worker/execution.py", line 95, in _retry_run
    raise ex
  File "/cluster/home/bqhu_jh/share/miniconda3/envs/bigdata/lib/python3.11/site-packages/xorbits/_mars/services/scheduling/worker/execution.py", line 73, in _retry_run
    return await target_async_func(*args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/cluster/home/bqhu_jh/share/miniconda3/envs/bigdata/lib/python3.11/site-packages/xorbits/_mars/services/scheduling/worker/execution.py", line 491, in _run_subtask_once
    raise ex
xoscar.errors.ServerClosed: unexpectedly terminated process (Remote server unixsocket:///361854215913472 closed) with address 127.0.0.1:39737, which is highly suspected to be caused by an Out-of-Memory (OOM) problem
2023-10-13 19:20:03,454 xorbits._mars.services.task.execution.mars.stage 337003 ERROR    Subtask n6Qo8tTl8A6Pn7uCwJHxGAEK errored
Traceback (most recent call last):
  File "/cluster/home/bqhu_jh/share/miniconda3/envs/bigdata/lib/python3.11/site-packages/xorbits/_mars/services/scheduling/worker/execution.py", line 445, in _run_subtask_once
    return await asyncio.shield(aiotask)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/cluster/home/bqhu_jh/share/miniconda3/envs/bigdata/lib/python3.11/site-packages/xorbits/_mars/services/subtask/api.py", line 70, in run_subtask_in_slot
    return await ref.run_subtask.options(profiling_context=profiling_context).send(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/cluster/home/bqhu_jh/share/miniconda3/envs/bigdata/lib/python3.11/site-packages/xoscar/backends/context.py", line 226, in send
    result = await self._wait(future, actor_ref.address, send_message)  # type: ignore
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/cluster/home/bqhu_jh/share/miniconda3/envs/bigdata/lib/python3.11/site-packages/xoscar/backends/context.py", line 115, in _wait
    return await future
           ^^^^^^^^^^^^
  File "/cluster/home/bqhu_jh/share/miniconda3/envs/bigdata/lib/python3.11/site-packages/xoscar/backends/core.py", line 84, in _listen
    raise ServerClosed(
xoscar.errors.ServerClosed: Remote server unixsocket:///361854215913472 closed
……
……
……
  1. Minimized code to reproduce the error.
    df = xpd.read_parquet("./merged.pq")
    print(df.shape)   # (1125138913, 5)
    df_pvt = df.pivot(index=[0,1,2], columns=[4], values=[3])
    df_pvt

Expected behavior

A clear and concise description of what you expected to happen.

I think there were some bug for large scale data and the server closed.

Additional context

Add any other context about the problem here. No.

huboqiang commented 8 months ago

Using rechunk(chunk_size=10_000) can adjust partrition policy like spark repartrition.

image