xorbitsai / xorbits

Scalable Python DS & ML, in an API compatible & lightning fast way.
https://xorbits.readthedocs.io
Apache License 2.0
1.11k stars 67 forks source link

BUG:Cannot read CSV file encoded in GBK #613

Closed lx12633036 closed 1 year ago

lx12633036 commented 1 year ago

Describe the bug

xorbits cannot read GBK-encoded CSV files, but Pandas can with the 'encongding' parameter.

To Reproduce

To help us to reproduce this bug, please provide information below:

  1. Your Python version:3.83.13
  2. The version of Xorbits you use:0.4.4
  3. Versions of crucial packages, such as numpy, scipy and pandas:pandas-2.0.1
  4. Full stack of the error:
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
Cell In[1], line 2
      1 import xorbits.pandas as pd
----> 2 metro_exit=pd.read_csv(r'path/A.csv',encoding='gbk')

File [c:\Users\###\anaconda3\envs\test_2\lib\site-packages\xorbits\core\adapter.py:472](file:///C:/Users/###/anaconda3/envs/test_2/lib/site-packages/xorbits/core/adapter.py:472), in wrap_mars_callable..wrapped(*args, **kwargs)
    470 @functools.wraps(c)
    471 def wrapped(*args, **kwargs):
--> 472     return from_mars(c(*to_mars(args), **to_mars(kwargs)))

File [c:\Users\###\anaconda3\envs\test_2\lib\site-packages\xorbits\_mars\dataframe\datasource\read_csv.py:714](file:///C:/Users/###/anaconda3/envs/test_2/lib/site-packages/xorbits/_mars/dataframe/datasource/read_csv.py:714), in read_csv(path, names, sep, index_col, compression, header, dtype, usecols, nrows, skiprows, chunk_bytes, gpu, head_bytes, head_lines, incremental_index, use_arrow_dtype, storage_options, memory_scale, merge_small_files, merge_small_file_options, **kwargs)
    712     f.seek(head_start)
    713     b = f.read(head_end - head_start)
--> 714 mini_df = pd.read_csv(
    715     BytesIO(b),
    716     sep=sep,
    717     index_col=index_col,
    718     dtype=dtype,
    719     names=names,
    720     header=header,
    721     skiprows=skiprows,
    722 )
    723 if header == "infer" and names is not None:
    724     # ignore header as we always specify names
    725     header = None

File [c:\Users\###\anaconda3\envs\test_2\lib\site-packages\pandas\io\parsers\readers.py:912](file:///C:/Users/###/anaconda3/envs/test_2/lib/site-packages/pandas/io/parsers/readers.py:912), in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)
    899 kwds_defaults = _refine_defaults_read(
    900     dialect,
    901     delimiter,
   (...)
    908     dtype_backend=dtype_backend,
    909 )
    910 kwds.update(kwds_defaults)
--> 912 return _read(filepath_or_buffer, kwds)

File [c:\Users\###\anaconda3\envs\test_2\lib\site-packages\pandas\io\parsers\readers.py:577](file:///C:/Users/###/anaconda3/envs/test_2/lib/site-packages/pandas/io/parsers/readers.py:577), in _read(filepath_or_buffer, kwds)
    574 _validate_names(kwds.get("names", None))
    576 # Create the parser.
--> 577 parser = TextFileReader(filepath_or_buffer, **kwds)
    579 if chunksize or iterator:
    580     return parser

File [c:\Users\###\anaconda3\envs\test_2\lib\site-packages\pandas\io\parsers\readers.py:1407](file:///C:/Users/###/anaconda3/envs/test_2/lib/site-packages/pandas/io/parsers/readers.py:1407), in TextFileReader.__init__(self, f, engine, **kwds)
   1404     self.options["has_index_names"] = kwds["has_index_names"]
   1406 self.handles: IOHandles | None = None
-> 1407 self._engine = self._make_engine(f, self.engine)

File [c:\Users\###\anaconda3\envs\test_2\lib\site-packages\pandas\io\parsers\readers.py:1679](file:///C:/Users/###/anaconda3/envs/test_2/lib/site-packages/pandas/io/parsers/readers.py:1679), in TextFileReader._make_engine(self, f, engine)
   1676     raise ValueError(msg)
   1678 try:
-> 1679     return mapping[engine](f, **self.options)
   1680 except Exception:
   1681     if self.handles is not None:

File [c:\Users\###\anaconda3\envs\test_2\lib\site-packages\pandas\io\parsers\c_parser_wrapper.py:93](file:///C:/Users/###/anaconda3/envs/test_2/lib/site-packages/pandas/io/parsers/c_parser_wrapper.py:93), in CParserWrapper.__init__(self, src, **kwds)
     90 if kwds["dtype_backend"] == "pyarrow":
     91     # Fail here loudly instead of in cython after reading
     92     import_optional_dependency("pyarrow")
---> 93 self._reader = parsers.TextReader(src, **kwds)
     95 self.unnamed_cols = self._reader.unnamed_cols
     97 # error: Cannot determine type of 'names'

File [c:\Users\###\anaconda3\envs\test_2\lib\site-packages\pandas\_libs\parsers.pyx:548](file:///C:/Users/###/anaconda3/envs/test_2/lib/site-packages/pandas/_libs/parsers.pyx:548), in pandas._libs.parsers.TextReader.__cinit__()

File [c:\Users\###\anaconda3\envs\test_2\lib\site-packages\pandas\_libs\parsers.pyx:637](file:///C:/Users/###/anaconda3/envs/test_2/lib/site-packages/pandas/_libs/parsers.pyx:637), in pandas._libs.parsers.TextReader._get_header()

File [c:\Users\###\anaconda3\envs\test_2\lib\site-packages\pandas\_libs\parsers.pyx:848](file:///C:/Users/###/anaconda3/envs/test_2/lib/site-packages/pandas/_libs/parsers.pyx:848), in pandas._libs.parsers.TextReader._tokenize_rows()

File [c:\Users\###\anaconda3\envs\test_2\lib\site-packages\pandas\_libs\parsers.pyx:859](file:///C:/Users/###/anaconda3/envs/test_2/lib/site-packages/pandas/_libs/parsers.pyx:859), in pandas._libs.parsers.TextReader._check_tokenize_status()

File [c:\Users\###\anaconda3\envs\test_2\lib\site-packages\pandas\_libs\parsers.pyx:2017](file:///C:/Users/###/anaconda3/envs/test_2/lib/site-packages/pandas/_libs/parsers.pyx:2017), in pandas._libs.parsers.raise_parser_error()

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 53: invalid continuation byt

Expected behavior

A clear and concise description of what you expected to happen.

Additional context

Add any other context about the problem here.

aresnow1 commented 1 year ago

Thanks for your report, we will fix it ASAP.