xorbitsai / xorbits

Scalable Python DS & ML, in an API compatible & lightning fast way.
https://xorbits.readthedocs.io
Apache License 2.0
1.1k stars 67 forks source link

FEAT: Add `Dataframe.Groupby.nth` #684

Closed JiaYaobo closed 11 months ago

JiaYaobo commented 12 months ago

What do these changes do?

Add dataframe.groupby.nth , a counter part for https://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.core.groupby.GroupBy.nth.html

Very much in progress

Related issue number

Fixes https://github.com/xorbitsai/xorbits/issues/634

Check code requirements

codecov[bot] commented 12 months ago

Codecov Report

Merging #684 (fbb3200) into main (ebc391f) will increase coverage by 0.02%. The diff coverage is 95.49%.

@@            Coverage Diff             @@
##             main     #684      +/-   ##
==========================================
+ Coverage   93.53%   93.55%   +0.02%     
==========================================
  Files        1025     1026       +1     
  Lines       79405    79516     +111     
  Branches    16453    16475      +22     
==========================================
+ Hits        74271    74392     +121     
+ Misses       3453     3436      -17     
- Partials     1681     1688       +7     
Flag Coverage Δ
unittests 93.44% <95.49%> (+0.02%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed Coverage Δ
python/xorbits/_mars/dataframe/groupby/nth.py 95.41% <95.41%> (ø)
python/xorbits/_mars/dataframe/groupby/__init__.py 96.55% <100.00%> (+0.12%) :arrow_up:

... and 8 files with indirect coverage changes

aresnow1 commented 12 months ago

Thanks for your contribution! Better to add tests to cover these changes.

JiaYaobo commented 11 months ago

Tests have beed added, tag you @aresnow1 here!

JiaYaobo commented 11 months ago

hmm, failures seem somewhat confused to me, it claims that "real shape" is (5, 3), however it should be (5, 4) since there are four columns, and CI ubuntu-latest, _mars/dataframe, 3.9 works fine, any help is appreciated! @aresnow1 @qinxuye

aresnow1 commented 11 months ago

The failed check is for compatibility, you can install pandas v1.5.3 and run tests locally to reproduce it.

JiaYaobo commented 11 months ago

@aresnow1 thanks a lot! It seems that DataFrame.Groupby.nth has different behavoirs between versions. For example, with pandas==2.1.0

import pandas as pd
import numpy as np

df1 = pd.DataFrame(
    {
        "a": np.random.randint(0, 5, size=20),
        "b": np.random.randint(0, 5, size=20),
        "c": np.random.randint(0, 5, size=20),
        "d": np.random.randint(0, 5, size=20),
    }
)

df1.groupby("b").nth(0)

results:

截屏2023-09-18 14 43 07

However pandas == 1.5.3 results

截屏2023-09-18 14 44 18

So should I modify code to make it compatible between versions or just skip tests on pandas==1.5.3

aresnow1 commented 11 months ago

@aresnow1 thanks a lot! It seems that DataFrame.Groupby.nth has different behavoirs between versions. For example, with pandas==2.1.0

import pandas as pd
import numpy as np

df1 = pd.DataFrame(
    {
        "a": np.random.randint(0, 5, size=20),
        "b": np.random.randint(0, 5, size=20),
        "c": np.random.randint(0, 5, size=20),
        "d": np.random.randint(0, 5, size=20),
    }
)

df1.groupby("b").nth(0)

results:

截屏2023-09-18 14 43 07

However pandas == 1.5.3 results

截屏2023-09-18 14 44 18

So should I modify code to make it compatible between versions or just skip tests on pandas==1.5.3

Let's skip the testing then. We will later change the minimum dependency for pandas to 2.0.

JiaYaobo commented 11 months ago

@aresnow1 Just skip tests when pandas version <= 1.5.3, and tests passed as expect, PTAL :)