xorbitsai / xorbits

Scalable Python DS & ML, in an API compatible & lightning fast way.
https://xorbits.readthedocs.io
Apache License 2.0
1.11k stars 67 forks source link

BUG: put arrow based pandas dataframe raises pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays #630

Open codingl2k1 opened 1 year ago

codingl2k1 commented 1 year ago

Describe the bug

A clear and concise description of what the bug is.

  File "xoscar/core.pyx", line 378, in _handle_actor_result
  File "/Users/codingl2k1/Work/xorbits/python/xorbits/_mars/services/subtask/worker/processor.py", line 616, in run
    ) = await self._store_data(chunk_graph)
^^^^^^^^^^^
  File "/Users/codingl2k1/Work/xorbits/python/xorbits/_mars/services/subtask/worker/processor.py", line 346, in _store_data
    store_infos = await put_infos
^^^^^^^
  File "/Users/codingl2k1/Work/xoscar/python/xoscar/batch.py", line 147, in _async_batch
    return [await self._async_call(*args_list[0], **kwargs_list[0])]
^^^^^^^^^^^^^^^
  File "/Users/codingl2k1/Work/xoscar/python/xoscar/batch.py", line 96, in _async_call
    return await self.func(*args, **kwargs)
^^^^^^^^^^^
  File "/Users/codingl2k1/Work/xorbits/python/xorbits/_mars/services/storage/api/oscar.py", line 119, in put
    return await self._storage_handler_ref.put(
  ^^^^^^^^^^^^^^^^^
  File "xoscar/core.pyx", line 266, in __pyx_actor_method_wrapper
  File "xoscar/core.pyx", line 269, in xoscar.core.__pyx_actor_method_wrapper
  File "/Users/codingl2k1/Work/xoscar/python/xoscar/batch.py", line 96, in _async_call
    return await self.func(*args, **kwargs)
^^^^^^^^^^^
  File "/Users/codingl2k1/Work/xorbits/python/xorbits/_mars/services/storage/handler.py", line 221, in put
    object_info = await self._clients[level].put(obj)
  ^^^^^^^^^^^^^^^^^
  File "/Users/codingl2k1/Work/xorbits/python/xorbits/_mars/storage/shared_memory.py", line 173, in put
    buffers = await serializer.run()
  ^^^^^^^^^^^^^^^^^
  File "/Users/codingl2k1/Work/xoscar/python/xoscar/serialization/aio.py", line 71, in run
    return await self._get_buffers()
  ^^^^^^^^^^^^^^^^^
  File "/Users/codingl2k1/Work/xoscar/python/xoscar/serialization/aio.py", line 40, in _get_buffers
    headers, buffers = await serialize_with_spawn(
  ^^^^^^^^^^^^^^^^^
  File "xoscar/serialization/core.pyx", line 767, in serialize_with_spawn
  File "xoscar/serialization/core.pyx", line 607, in xoscar.serialization.core._serial_single
  File "xoscar/serialization/core.pyx", line 265, in xoscar.serialization.core.PickleSerializer.serial
  File "xoscar/serialization/core.pyx", line 230, in xoscar.serialization.core.pickle_buffers
  File "/Users/codingl2k1/.pyenv/versions/3.11.4/lib/python3.11/site-packages/cloudpickle/cloudpickle_fast.py", line 73, in dumps
    cp.dump(obj)
^^^^^^^^^^^^^^^
  File "/Users/codingl2k1/.pyenv/versions/3.11.4/lib/python3.11/site-packages/cloudpickle/cloudpickle_fast.py", line 632, in dump
    return Pickler.dump(self, obj)
^^^^^^^^^^^^^^^
  File "/Users/codingl2k1/.pyenv/versions/3.11.4/lib/python3.11/site-packages/pandas/core/arrays/arrow/array.py", line 455, in __getstate__
    state["_data"] = self._data.combine_chunks()
  ^^^^^^^^^^^^^^^^^
  File "pyarrow/table.pxi", line 731, in pyarrow.lib.ChunkedArray.combine_chunks
  File "pyarrow/array.pxi", line 3321, in pyarrow.lib.concat_arrays
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: [address=127.0.0.1:60088, pid=63873] offset overflow while concatenating arrays

To Reproduce

To help us to reproduce this bug, please provide information below:

Python 3.11.4 pandas 2.0.3 pyarrow 12.0.1

  1. Your Python version
  2. The version of Xorbits you use
  3. Versions of crucial packages, such as numpy, scipy and pandas
  4. Full stack of the error.
  5. Minimized code to reproduce the error.

Expected behavior

A clear and concise description of what you expected to happen.

Additional context

Add any other context about the problem here.

Related issue: https://github.com/apache/arrow/issues/33049