xorbitsai / xorbits

Scalable Python DS & ML, in an API compatible & lightning fast way.
https://xorbits.readthedocs.io
Apache License 2.0
1.11k stars 67 forks source link

BUG: estimate_pandas_size on arrow based pandas dataframe raises pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays #629

Open codingl2k1 opened 1 year ago

codingl2k1 commented 1 year ago

Describe the bug

A clear and concise description of what the bug is.

  File "/Users/codingl2k1/Work/xorbits/python/xorbits/_mars/utils.py", line 496, in calc_data_size
    return estimate_pandas_size(dt)
  ^^^^^^^^^^^^^^^^^
  File "/Users/codingl2k1/Work/xorbits/python/xorbits/_mars/utils.py", line 571, in estimate_pandas_size
    sample_size = sys.getsizeof(iloc[indices])
  ^^^^^^^^^^^^^^^^^
  File "/Users/codingl2k1/.pyenv/versions/3.11.4/lib/python3.11/site-packages/pandas/core/indexing.py", line 1103, in __getitem__
    return self._getitem_axis(maybe_callable, axis=axis)
^^^^^^^^^^^^^^^
  File "/Users/codingl2k1/.pyenv/versions/3.11.4/lib/python3.11/site-packages/pandas/core/indexing.py", line 1647, in _getitem_axis
    return self._get_list_axis(key, axis=axis)
^^^^^^^^^^^^^^^
  File "/Users/codingl2k1/.pyenv/versions/3.11.4/lib/python3.11/site-packages/pandas/core/indexing.py", line 1618, in _get_list_axis
    return self.obj._take_with_is_copy(key, axis=axis)
^^^^^^^^^^^^^^^
  File "/Users/codingl2k1/.pyenv/versions/3.11.4/lib/python3.11/site-packages/pandas/core/generic.py", line 3948, in _take_with_is_copy
    result = self._take(indices=indices, axis=axis)
  ^^^^^^^^^^^^^^^^^
  File "/Users/codingl2k1/.pyenv/versions/3.11.4/lib/python3.11/site-packages/pandas/core/generic.py", line 3932, in _take
    new_data = self._mgr.take(
  ^^^^^^^^^^^^^^^^^
  File "/Users/codingl2k1/.pyenv/versions/3.11.4/lib/python3.11/site-packages/pandas/core/internals/managers.py", line 963, in take
    return self.reindex_indexer(
  ^^^^^^^^^^^^^^^^^
  File "/Users/codingl2k1/.pyenv/versions/3.11.4/lib/python3.11/site-packages/pandas/core/internals/managers.py", line 747, in reindex_indexer
    new_blocks = [
^^^^^^^^^^^^^^^
  File "/Users/codingl2k1/.pyenv/versions/3.11.4/lib/python3.11/site-packages/pandas/core/internals/managers.py", line 748, in <listcomp>
    blk.take_nd(
^^^^^^^^^^^
  File "/Users/codingl2k1/.pyenv/versions/3.11.4/lib/python3.11/site-packages/pandas/core/internals/blocks.py", line 945, in take_nd
    new_values = algos.take_nd(
  ^^^^^^^^^^^^^^^^^
  File "/Users/codingl2k1/.pyenv/versions/3.11.4/lib/python3.11/site-packages/pandas/core/array_algos/take.py", line 114, in take_nd
    return arr.take(indexer, fill_value=fill_value, allow_fill=allow_fill)
  ^^^^^^^^^^^^^^^^^
  File "/Users/codingl2k1/.pyenv/versions/3.11.4/lib/python3.11/site-packages/pandas/core/arrays/arrow/array.py", line 1035, in take
    return type(self)(self._data.take(indices))
^^^^^^^^^^^
  File "pyarrow/table.pxi", line 1029, in pyarrow.lib.ChunkedArray.take
  File "/Users/codingl2k1/.pyenv/versions/3.11.4/lib/python3.11/site-packages/pyarrow/compute.py", line 482, in take
    return call_function('take', [data, indices], options, memory_pool)
      ^^^^^^^^^^^^^^^^^
  File "pyarrow/_compute.pyx", line 572, in pyarrow._compute.call_function
  File "pyarrow/_compute.pyx", line 367, in pyarrow._compute.Function.call
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: [address=127.0.0.1:59388, pid=61917] offset overflow while concatenating arrays

To Reproduce

To help us to reproduce this bug, please provide information below:

Python 3.11.4 pandas 2.0.3 pyarrow 12.0.1

  1. Your Python version
  2. The version of Xorbits you use
  3. Versions of crucial packages, such as numpy, scipy and pandas
  4. Full stack of the error.
  5. Minimized code to reproduce the error.

Expected behavior

A clear and concise description of what you expected to happen.

Additional context

Add any other context about the problem here.

Similar issue: https://github.com/apache/arrow/issues/33049