xgcm / xhistogram

Fast, flexible, label-aware histograms for numpy and xarray
https://xhistogram.readthedocs.io
MIT License
89 stars 19 forks source link

feature: multi-dimensional bins #28

Open aaronspring opened 3 years ago

aaronspring commented 3 years ago

currently, xhistogram only allows bins to be one-dimensional.

however, when the bin edges vary in time (seasonality) or space (locations of the globe) xhistogram cannot be used with multi-dim bins. there is a hard-coded requirement for bins elements to be 1-d. One of such multi-dim bin applications is the ranked probability score rps we use in xskillscore.rps, where we want to know how many forecasts fell into which bins. Bins are often defined as terciles of the forecast distribution and the bins for these terciles (forecast_with_lon_lat_time_dims.quantile(q=[.33,.66],dim='time')) depend on lon and lat.

How we solved this in xskillscore.rps: < gives us CDFs, and diff brings it back to histograms. maybe have to throw away the upper edge

Fc = (forecasts < forecasts_edges).mean('member').diff(bin_dim)
Oc = (observations < observations_edges).astype("int").diff(bin_dim)

https://github.com/xarray-contrib/xskillscore/blob/493f9afd7b5acefb4baa47bec6ad65fca19965bd/xskillscore/core/probabilistic.py#L680

I first implemented rps with xhistogram, then with the snippet above, yields same results.

However, I am not sure whether such multi-dimensional bins would be an interesting addition to xhistogram or are out-of-scope.

dcherian commented 3 years ago

I think this dask pR is related: https://github.com/dask/dask/pull/7346

aaronspring commented 3 years ago

after quickly skimming over np.histogram2d and this issue, I think that I am asking here for a different thing: I what that bins can be multi-dimensional, whereas in np.histogram bins is a 1d-array or int.

dougiesquire commented 3 years ago

I think this is a really nice contribution that would be appreciated by many users, though I'm not sure whether it's out of scope for xhistogram (thoughts @rabernat?).

One simple option could be to allow bins to be an xarray object, in which case we use something like the approach you give above to compute the histogram?

TomNicholas commented 3 years ago

@aaronspring I'm trying to make sure I understand what you're proposing here - is this an accurate summary of what you're suggesting?:

If I have data which includes a time dimension, and I currently count along other dimension(s), my result will still have a time dimension, but the resultant bin coordinates are currently only allowed to be one-dimensional. You are proposing to be able to pass a multidimensional bins array (for each variable potentially for an N-D histogram) which has bin edges that are a function of time. The result would have the same shape array for the bin_counts, but the bin coordinates on the output would vary along this time dimension (passed straight through from your input). You would have ended up with a histogram whose counts and bin edges changed over time - effectively a set of separate histograms calculated independently for each point in time.

If that's what you're suggesting then it actually sounds fairly doable - it's basically just allowing the bins arguments to be ND and then making sure they broadcast properly. You would also need an input check that your bins arrays don't vary along any of the dimensions you want to count over, because I think that would be nonsensical.

aaronspring commented 3 years ago

@TomNicholas didnt see that linked PR. thanks for linking again here.

simply put: I want to use nd instead of 1d arrays as bins in xhistogram.

whats the kind of API/code example I was looking for: https://gist.github.com/aaronspring/251553f132202cc91aadde03f2a452f9 (dont focus on the results, just the dimensionality)

dcherian commented 1 year ago

Here's how to do it with flox: https://github.com/xarray-contrib/flox/pull/203 [well hopefully I got it right ;)]

Rendered version

TomNicholas commented 1 year ago

Well this looks great!

I did have a little go in the airport after AMS but didn't get anywhere near this far šŸ˜…

Next step would be to test it against histogram (for the cases xhistogram already handles).

What's the plan with flox integration into Xarray? Will it always be optional? Will it become part of main?

On Wed, Jan 18, 2023, 12:01 AM Deepak Cherian @.***> wrote:

Here's how to do it with flox: xarray-contrib/flox#203 https://github.com/xarray-contrib/flox/pull/203 [well hopefully I got it right ;)]

Rendered version https://flox--203.org.readthedocs.build/en/203/user-stories/nD-bins.html

ā€” Reply to this email directly, view it on GitHub https://github.com/xgcm/xhistogram/issues/28#issuecomment-1386490718, or unsubscribe https://github.com/notifications/unsubscribe-auth/AISNPI4JVVCPS5D2WPUWW5TWS52JNANCNFSM4Y7VCW6Q . You are receiving this because you were mentioned.Message ID: @.***>

TomNicholas commented 1 year ago

What's the plan with flox integration into Xarray? Will it always be optional? Will it become part of main?

Apparently flox will stay optional, so if we moved this functionality into xarray it would rely on an optional import, but that's okay.

TomNicholas commented 1 year ago

@dcherian how might this work for N-dimensional histograms? I.e. placing N variables into N sets of bins. That's obviously one of the main features xhistogram provides. I notice your notebook says

The core factorize_ function (which wraps pd.cut) only handles 1D bins, so we use xr.apply_ufunc to vectorize it for us.

Does that mean we would still have to do a reshape of some kind?

dcherian commented 1 year ago

https://flox.readthedocs.io/en/latest/intro.html#histogramming-binning-by-multiple-variables

Does this make it clear

TomNicholas commented 1 year ago

Ooh!

Checking I've understood this correctly:

xarray_reduce(
    da,                  # weights, here just ones
    "labels1",           # name of 1st variable we want to bin
    "labels2",           # name of 2nd variable we want to bin
    func="count",        # count occurrences falling in bins
    expected_groups=(
        pd.IntervalIndex.from_breaks([-0.5, 4.5, 6.5, 8.9]),  # bins for 1st variable
        pd.IntervalIndex.from_breaks([0.5, 1.5, 1.9]),        # bins for 2nd variable
    ),
)
dcherian commented 1 year ago

Yes. PR to improve that page is very welcome!