Open tomvothecoder opened 2 years ago
I saw the ping at https://github.com/pydata/xarray/issues/6610. Let me know if you run in to issues or have questions
Thanks @dcherian! I'm looking forward to trying out flox
.
xarray >= 2024.09.0
now supports grouping by multiple variables: https://xarray.dev/blog/multiple-groupers and https://docs.xarray.dev/en/stable/user-guide/groupby.html#grouping-by-multiple-variables.
Example:
import xarray as xr
import numpy as np
import pandas as pd
# Create time coordinates
time = pd.date_range("2000-01-01", "2003-12-31", freq="D")
# Create lat and lon coordinates
lat = [10, 20]
lon = [30, 40]
# Create dummy air temperature data
data = np.random.rand(len(time), len(lat), len(lon))
# Create the Dataset
ds = xr.Dataset(
{"air_temperature": (["time", "lat", "lon"], data)},
coords={"time": time, "lat": lat, "lon": lon},
)
print(ds)
ds_gb = ds.groupby(["time.year", "time.month"]).mean()
Is your feature request related to a problem?
Currently, Xarray's GroupBy operations are limited to single variables. Grouping by multiple coordinates (e.g.,time.year
andtime.season
) requires creating a new set of coordinates before grouping due to the xarray limitations described below (source)xarray >= 2024.09.0
now supports grouping by multiple variables: https://xarray.dev/blog/multiple-groupers and https://docs.xarray.dev/en/stable/user-guide/groupby.html#grouping-by-multiple-variables.Related code in
xcdat
for temporal grouping: https://github.com/xCDAT/xcdat/blob/c9bcbcdb66af916958a79a33177bc43d478e4036/xcdat/temporal.py#L1266-L1322Current temporal averaging logic (workaround for multi-variable grouping):
xarray.DataArray
to apandas.DataFrame
, a. Keep only the DataFrame columns needed for grouping (e.g., "year" and "season" for seasonal group averages), essentially "labeling" coordinates with their groups b. Process the DataFrame including:Mapping of months to custom seasons for custom seasonal groupingNow done with Xarray/NumPy via #423Correction of "DJF" seasons by shifting Decembers over to the next yearNow done with Xarray/NumPy via #423cftime
coordinates (season strings aren't supported incftime
/datetime
objects)cftime
objects to represent new time coordinatesDescribe the solution you'd like
It is would be simpler and possibly more performant to leverage Xarray's newly added support for grouping by multiple variables (e.g.,
.groupby(["time.year", "time.season"])
) instead of using Pandas to store and manipulate Datetime components. This solution will reduce a lot of the internal complexities involved with the temporal averaging API.Describe alternatives you've considered
Multi-variable grouping was originally done using
pd.MultiIndex
but we shifted away from this approach because this object cannot be written out tonetcdf4
. Alsopd.MultiIndex
is not the standard object type for representing time coordinates in xarray. The standard object types arenp.datetime64
andcftime
.Additional context
Future solution through
xarray
+flox
:xarray
version in https://github.com/pydata/xarray/issues/6610, we should be able to do this..groupby()
performance significantly.