xCDAT / xcdat

An extension of xarray for climate data analysis on structured grids.
https://xcdat.readthedocs.io/en/latest/
Apache License 2.0
101 stars 11 forks source link

[Bug]: temporal.group_average daily to monthly assigns wrong time index (and bounds) #586

Open durack1 opened 5 months ago

durack1 commented 5 months ago

What happened?

I have been attempting to generate monthly mean from daily data, and expected the process-resolved time value to generate an index that is the middle of the month. The data example is 1 through 31 January 1998 daily files of SST. I would expect for a monthly average over these days [1998-01-01 00:00 ... 1998-01-31 00:00] the new time value represents the mean of the period e.g. [1998-01-16 12:00], whereas I get [1998-01-01 00:00], and no time_bounds assigned

What did you expect to happen? Are there are possible answers you came across?

I expected a new monthly mean value to have the time axis value [1998-01-16 12:00], which also includes time_bounds [1998-01-01 00:00, 1998-02-01 00:00].

Working example included in the Jupyter Notebook

Minimal Complete Verifiable Example (MVCE)

Example calculates same January 1998 monthly average is included in the Jupyter Notebook (above)

Relevant log output

See Notebook above

Anything else we need to know?

ping @taylor13, @gleckler1, @lee1043 who I have been discussing this behaviour with

Environment

INSTALLED VERSIONS

commit: None python: 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:36:39) [GCC 12.3.0] python-bits: 64 OS: Linux OS-release: 3.10.0-1160.90.1.el7.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.14.3 libnetcdf: 4.9.2

xarray: 2023.12.0 pandas: 2.1.4 numpy: 1.26.3 scipy: 1.11.4 netCDF4: 1.6.5 pydap: None h5netcdf: None h5py: None Nio: None zarr: None cftime: 1.6.3 nc_time_axis: None iris: None bottleneck: 1.3.7 dask: 2023.12.1 distributed: 2023.12.1 matplotlib: 3.8.2 cartopy: 0.22.0 seaborn: None numbagg: None fsspec: 2023.12.2 cupy: None pint: None sparse: 0.15.1 flox: None numpy_groupies: None setuptools: 69.0.3 pip: 23.3.2 conda: None pytest: None mypy: None IPython: 8.20.0 sphinx: None

pochedls commented 5 months ago

This is essentially the same as #565. PRs welcome.

pochedls commented 5 months ago

@durack1 – as a workaround, you can probably use ds.bounds.add_time_bounds() to create bounds after your temporal averaging operation. If the time point is important, you could also average the resulting time bounds and reassign them to ds.time.

taylor13 commented 5 months ago

To be sure. Regarding the statement made above: "The data example is 1 through 31 January 1998 daily files of SST. I would expect for a monthly average over these days [1998-01-01 00:00 ... 1998-01-31 00:00]".

I think the monthly average should be for all data representing intervals partially or fully within the bounds [1998-01-01 00:00 and 1998-02-01 00:00], so if the the coordinates values represent the beginning of daily means (rather than the middle of the day), then indeed data values for coordinates intervals associated with the coordinate values [1998-01-01 00:00 ... 1998-01-31 00:00] will be simply averaged together (no weighting required) to yield a monthly mean extending from 1998-01-01 00:00 to 1998-02-01 00:00. If, however, the coordinates represent the middle of each daily interval, then the first daily sample represents an interval from 1997-12-31 12:00 through 1998-01-01 12:00 and the monthly mean will include 32 daily samples (rather than 31) with the first and last samples included but with half the "weight" of the others. There is further discussion of this at https://github.com/xCDAT/xcdat/issues/565.

durack1 commented 5 months ago

This is essentially the same as #565. PRs welcome.

Agreed, this is effectively a dupe although focuses on daily -> monthly, vs monthly -> annual #565. duplicate label assigned

pochedls commented 5 months ago

565 also seems to be a duplicate of #534, which is related to #529.

durack1 commented 5 months ago

565 also seems to be a duplicate of #534, which is related to #529.

@pochedls it seems in #529 there was an xarray version dependence. Is there a test that would catch issues in the case upstream (xarray) changes impact the temporal_average function updates?

durack1 commented 5 months ago

@durack1 – as a workaround, you can probably use ds.bounds.add_time_bounds() to create bounds after your temporal averaging operation. If the time point is important, you could also average the resulting time bounds and reassign them to ds.time.

I had tried this (in the Notebook) but got the error ValueError: Cannot generate bounds for coordinate variable 'time' which has a length <= 1 (singleton).

The code I attempted was ds.bounds.add_time_bounds(method="freq", freq="month") not sure if this was some noob error

pochedls commented 5 months ago

I think it is generally not possible to automatically generate bounds if there is only one time value, but I think it could be done in this case (you are specifying the frequency). This would require revisiting this block of code.