xorbitsai / xorbits

Scalable Python DS & ML, in an API compatible & lightning fast way.
https://xorbits.readthedocs.io
Apache License 2.0
1.1k stars 67 forks source link

BUG: read_csv cannot support gzip file #687

Open qinxuye opened 12 months ago

qinxuye commented 12 months ago

Describe the bug

read_csv cannot support gzip file.

To Reproduce

To help us to reproduce this bug, please provide information below:

  1. Your Python version
  2. The version of Xorbits you use
  3. Versions of crucial packages, such as numpy, scipy and pandas
  4. Full stack of the error.
  5. Minimized code to reproduce the error.
import xorbits.pandas as pd
pd.read_csv('/Users/xuyeqin/Workspace/xorbits_sql/xorbits_sql/tests/tpc-h/lineitem.csv.gz', sep='|')
Traceback (most recent call last):
  File "/Users/xuyeqin/miniconda3/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3508, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-5-68341782af92>", line 1, in <module>
    pd.read_csv('/Users/xuyeqin/Workspace/xorbits_sql/xorbits_sql/tests/tpc-h/lineitem.csv.gz', sep='|')
  File "/Users/xuyeqin/Workspace/xorbits/python/xorbits/core/adapter.py", line 472, in wrapped
    return from_mars(c(*to_mars(args), **to_mars(kwargs)))
  File "/Users/xuyeqin/Workspace/xorbits/python/xorbits/_mars/dataframe/datasource/read_csv.py", line 684, in read_csv
    mini_df = pd.read_csv(
  File "/Users/xuyeqin/miniconda3/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 912, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/Users/xuyeqin/miniconda3/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 577, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/Users/xuyeqin/miniconda3/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1407, in __init__
    self._engine = self._make_engine(f, self.engine)
  File "/Users/xuyeqin/miniconda3/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1679, in _make_engine
    return mapping[engine](f, **self.options)
  File "/Users/xuyeqin/miniconda3/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 93, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas/_libs/parsers.pyx", line 548, in pandas._libs.parsers.TextReader.__cinit__
  File "pandas/_libs/parsers.pyx", line 637, in pandas._libs.parsers.TextReader._get_header
  File "pandas/_libs/parsers.pyx", line 848, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 859, in pandas._libs.parsers.TextReader._check_tokenize_status
  File "pandas/_libs/parsers.pyx", line 2017, in pandas._libs.parsers.raise_parser_error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte