xorbitsai / xorbits

Scalable Python DS & ML, in an API compatible & lightning fast way.
https://xorbits.readthedocs.io
Apache License 2.0
1.13k stars 69 forks source link

ENH: DataFrameNunique has performance issue #536

Closed ChengjieLi28 closed 1 year ago

ChengjieLi28 commented 1 year ago

Note that the issue tracker is NOT the place for general support. For discussions about development, questions about usage, or any general questions, contact us on https://discuss.xorbits.io/.

I am testing on a dataframe with 3 columns and approximately 400 million rows. The first column of the data contains 85,642,283 distinct values. The performance of xorbits is significantly slower than pandas.

On 256g AWS EC2, pandas spent over 8 minutes to complete caculation including reading csv data, while xorbits took over 10 minutes.

We should introduce shuffle in nunique op for this case.