xorbitsai / xorbits

Scalable Python DS & ML, in an API compatible & lightning fast way.
https://xorbits.readthedocs.io
Apache License 2.0
1.11k stars 67 forks source link

ENH: xorbits's read_parquet compatible with pandas on pyarrow engine #770

Open luweizheng opened 4 months ago

luweizheng commented 4 months ago

Xorbits integrates the pyarrow backend. See this blog post for more info. And we also introduce use_arrow_dtype in read_parquet. If we install the pyarrow backend, Xorbits will detect it, marking use_arrow_dtype to True in the configuration and it will read parquet with arrow dtype. Some dtypes of pyarrow and pandas are different, for example, timestamp. Suppose time is a timestamp column. If time is a pandas dtype we can do it like this: df["time"].dt. But pyarrow does not have dt attribute.

If arrow is installed, xorbits use arrow and use_arrow_dtype of the configuration is set as true. So here we read data in pyarrow format: https://github.com/xorbitsai/xorbits/blob/b1f1107af931e9101b22e4f1e000add3820297b5/python/xorbits/_mars/dataframe/datasource/read_parquet.py#L181C1-L201C18

We may include this in our document or change the default behavior of the ArrowEngine when reading parquet files.