Closed lx12633036 closed 1 year ago
xorbits cannot read GBK-encoded CSV files, but Pandas can with the 'encongding' parameter.
To help us to reproduce this bug, please provide information below:
--------------------------------------------------------------------------- UnicodeDecodeError Traceback (most recent call last) Cell In[1], line 2 1 import xorbits.pandas as pd ----> 2 metro_exit=pd.read_csv(r'path/A.csv',encoding='gbk') File [c:\Users\###\anaconda3\envs\test_2\lib\site-packages\xorbits\core\adapter.py:472](file:///C:/Users/###/anaconda3/envs/test_2/lib/site-packages/xorbits/core/adapter.py:472), in wrap_mars_callable..wrapped(*args, **kwargs) 470 @functools.wraps(c) 471 def wrapped(*args, **kwargs): --> 472 return from_mars(c(*to_mars(args), **to_mars(kwargs))) File [c:\Users\###\anaconda3\envs\test_2\lib\site-packages\xorbits\_mars\dataframe\datasource\read_csv.py:714](file:///C:/Users/###/anaconda3/envs/test_2/lib/site-packages/xorbits/_mars/dataframe/datasource/read_csv.py:714), in read_csv(path, names, sep, index_col, compression, header, dtype, usecols, nrows, skiprows, chunk_bytes, gpu, head_bytes, head_lines, incremental_index, use_arrow_dtype, storage_options, memory_scale, merge_small_files, merge_small_file_options, **kwargs) 712 f.seek(head_start) 713 b = f.read(head_end - head_start) --> 714 mini_df = pd.read_csv( 715 BytesIO(b), 716 sep=sep, 717 index_col=index_col, 718 dtype=dtype, 719 names=names, 720 header=header, 721 skiprows=skiprows, 722 ) 723 if header == "infer" and names is not None: 724 # ignore header as we always specify names 725 header = None File [c:\Users\###\anaconda3\envs\test_2\lib\site-packages\pandas\io\parsers\readers.py:912](file:///C:/Users/###/anaconda3/envs/test_2/lib/site-packages/pandas/io/parsers/readers.py:912), in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend) 899 kwds_defaults = _refine_defaults_read( 900 dialect, 901 delimiter, (...) 908 dtype_backend=dtype_backend, 909 ) 910 kwds.update(kwds_defaults) --> 912 return _read(filepath_or_buffer, kwds) File [c:\Users\###\anaconda3\envs\test_2\lib\site-packages\pandas\io\parsers\readers.py:577](file:///C:/Users/###/anaconda3/envs/test_2/lib/site-packages/pandas/io/parsers/readers.py:577), in _read(filepath_or_buffer, kwds) 574 _validate_names(kwds.get("names", None)) 576 # Create the parser. --> 577 parser = TextFileReader(filepath_or_buffer, **kwds) 579 if chunksize or iterator: 580 return parser File [c:\Users\###\anaconda3\envs\test_2\lib\site-packages\pandas\io\parsers\readers.py:1407](file:///C:/Users/###/anaconda3/envs/test_2/lib/site-packages/pandas/io/parsers/readers.py:1407), in TextFileReader.__init__(self, f, engine, **kwds) 1404 self.options["has_index_names"] = kwds["has_index_names"] 1406 self.handles: IOHandles | None = None -> 1407 self._engine = self._make_engine(f, self.engine) File [c:\Users\###\anaconda3\envs\test_2\lib\site-packages\pandas\io\parsers\readers.py:1679](file:///C:/Users/###/anaconda3/envs/test_2/lib/site-packages/pandas/io/parsers/readers.py:1679), in TextFileReader._make_engine(self, f, engine) 1676 raise ValueError(msg) 1678 try: -> 1679 return mapping[engine](f, **self.options) 1680 except Exception: 1681 if self.handles is not None: File [c:\Users\###\anaconda3\envs\test_2\lib\site-packages\pandas\io\parsers\c_parser_wrapper.py:93](file:///C:/Users/###/anaconda3/envs/test_2/lib/site-packages/pandas/io/parsers/c_parser_wrapper.py:93), in CParserWrapper.__init__(self, src, **kwds) 90 if kwds["dtype_backend"] == "pyarrow": 91 # Fail here loudly instead of in cython after reading 92 import_optional_dependency("pyarrow") ---> 93 self._reader = parsers.TextReader(src, **kwds) 95 self.unnamed_cols = self._reader.unnamed_cols 97 # error: Cannot determine type of 'names' File [c:\Users\###\anaconda3\envs\test_2\lib\site-packages\pandas\_libs\parsers.pyx:548](file:///C:/Users/###/anaconda3/envs/test_2/lib/site-packages/pandas/_libs/parsers.pyx:548), in pandas._libs.parsers.TextReader.__cinit__() File [c:\Users\###\anaconda3\envs\test_2\lib\site-packages\pandas\_libs\parsers.pyx:637](file:///C:/Users/###/anaconda3/envs/test_2/lib/site-packages/pandas/_libs/parsers.pyx:637), in pandas._libs.parsers.TextReader._get_header() File [c:\Users\###\anaconda3\envs\test_2\lib\site-packages\pandas\_libs\parsers.pyx:848](file:///C:/Users/###/anaconda3/envs/test_2/lib/site-packages/pandas/_libs/parsers.pyx:848), in pandas._libs.parsers.TextReader._tokenize_rows() File [c:\Users\###\anaconda3\envs\test_2\lib\site-packages\pandas\_libs\parsers.pyx:859](file:///C:/Users/###/anaconda3/envs/test_2/lib/site-packages/pandas/_libs/parsers.pyx:859), in pandas._libs.parsers.TextReader._check_tokenize_status() File [c:\Users\###\anaconda3\envs\test_2\lib\site-packages\pandas\_libs\parsers.pyx:2017](file:///C:/Users/###/anaconda3/envs/test_2/lib/site-packages/pandas/_libs/parsers.pyx:2017), in pandas._libs.parsers.raise_parser_error() UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 53: invalid continuation byt
A clear and concise description of what you expected to happen.
Add any other context about the problem here.
Thanks for your report, we will fix it ASAP.
Describe the bug
xorbits cannot read GBK-encoded CSV files, but Pandas can with the 'encongding' parameter.
To Reproduce
To help us to reproduce this bug, please provide information below:
Expected behavior
A clear and concise description of what you expected to happen.
Additional context
Add any other context about the problem here.