xCDAT / xcdat

An extension of xarray for climate data analysis on structured grids.
https://xcdat.readthedocs.io/en/latest/
Apache License 2.0
119 stars 12 forks source link

[Refactor]: Consider using `flox` and `xr.resample()` to improve temporal averaging grouping logic #217

Open tomvothecoder opened 2 years ago

tomvothecoder commented 2 years ago

Is your feature request related to a problem?

Currently, Xarray's GroupBy operations are limited to single variables. Grouping by multiple coordinates (e.g., time.year and time.season) requires creating a new set of coordinates before grouping due to the xarray limitations described below (source)

xarray >= 2024.09.0 now supports grouping by multiple variables: https://xarray.dev/blog/multiple-groupers and https://docs.xarray.dev/en/stable/user-guide/groupby.html#grouping-by-multiple-variables.

Related code in xcdat for temporal grouping: https://github.com/xCDAT/xcdat/blob/c9bcbcdb66af916958a79a33177bc43d478e4036/xcdat/temporal.py#L1266-L1322

Current temporal averaging logic (workaround for multi-variable grouping):

  1. Preprocess time coordinates (e.g., drop leap days, subset based on reference climatology)
  2. Transform time coordinates from an xarray.DataArray to a pandas.DataFrame, a. Keep only the DataFrame columns needed for grouping (e.g., "year" and "season" for seasonal group averages), essentially "labeling" coordinates with their groups b. Process the DataFrame including:
    • Mapping of months to custom seasons for custom seasonal grouping Now done with Xarray/NumPy via #423
    • Correction of "DJF" seasons by shifting Decembers over to the next year Now done with Xarray/NumPy via #423
    • Mapping of seasons to their mid months to create cftime coordinates (season strings aren't supported in cftime/datetime objects)
  3. Convert DataFrame to cftime objects to represent new time coordinates
  4. Replace existing time coordinates in the DataArray with new time coordinates
  5. Group DataArray with new time coordinates for the mean

Describe the solution you'd like

It is would be simpler and possibly more performant to leverage Xarray's newly added support for grouping by multiple variables (e.g., .groupby(["time.year", "time.season"])) instead of using Pandas to store and manipulate Datetime components. This solution will reduce a lot of the internal complexities involved with the temporal averaging API.

Describe alternatives you've considered

Multi-variable grouping was originally done using pd.MultiIndex but we shifted away from this approach because this object cannot be written out to netcdf4. Also pd.MultiIndex is not the standard object type for representing time coordinates in xarray. The standard object types are np.datetime64 and cftime.

Additional context

Future solution through xarray + flox:

dcherian commented 1 year ago

I saw the ping at https://github.com/pydata/xarray/issues/6610. Let me know if you run in to issues or have questions

tomvothecoder commented 1 year ago

Thanks @dcherian! I'm looking forward to trying out flox.

tomvothecoder commented 1 month ago

xarray >= 2024.09.0 now supports grouping by multiple variables: https://xarray.dev/blog/multiple-groupers and https://docs.xarray.dev/en/stable/user-guide/groupby.html#grouping-by-multiple-variables.

Example:

import xarray as xr
import numpy as np
import pandas as pd

# Create time coordinates
time = pd.date_range("2000-01-01", "2003-12-31", freq="D")

# Create lat and lon coordinates
lat = [10, 20]
lon = [30, 40]

# Create dummy air temperature data
data = np.random.rand(len(time), len(lat), len(lon))

# Create the Dataset
ds = xr.Dataset(
    {"air_temperature": (["time", "lat", "lon"], data)},
    coords={"time": time, "lat": lat, "lon": lon},
)

print(ds)

ds_gb = ds.groupby(["time.year", "time.month"]).mean()