xCDAT / xcdat

An extension of xarray for climate data analysis on structured grids.
https://xcdat.readthedocs.io/en/latest/
Apache License 2.0
101 stars 11 forks source link

[Bug]: spatial average error following temporal resample #583

Closed arfriedman closed 4 months ago

arfriedman commented 6 months ago

What happened?

I am encountering an error when taking an xcdat spatial average following xarray temporal resampling.

I am working with monthly model precipitation output named pr_mon:

In [7]: pr_mon
Out[7]:
<xarray.Dataset>
Dimensions:   (lon: 192, lat: 96, time: 7056, bnds: 2)
Coordinates:
  * lon       (lon) float64 -180.0 -178.1 -176.2 -174.4 ... 174.4 176.2 178.1
  * lat       (lat) float64 88.57 86.72 84.86 83.0 ... -84.86 -86.72 -88.57
  * time      (time) object 1421-01-31 00:00:00 ... 2008-12-31 00:00:00
Dimensions without coordinates: bnds
Data variables:
    lon_bnds  (lon, bnds) float64 -180.9 -179.1 -179.1 ... 177.2 177.2 179.1
    lat_bnds  (lat, bnds) float64 89.5 87.65 87.65 85.79 ... -87.65 -87.65 -89.5
    precip    (time, lat, lon) float64 13.27 13.14 13.05 ... 3.607 3.466 3.355

Taking a spatial average of this monthly variable works fine: In [8]: pr_mon_avg = pr_mon.spatial.average("precip")

However, after I temporally resample the data using xarray, xcdat spatial averaging returns an error. Here is an example for a three-month temporal sum:


In [10]: pr_3M = pr_mon.resample(time="3ME").sum()
In [11]: pr_3M
Out[11]:
<xarray.Dataset>
Dimensions:   (lon: 192, lat: 96, time: 2353, bnds: 2)
Coordinates:
  * lon       (lon) float64 -180.0 -178.1 -176.2 -174.4 ... 174.4 176.2 178.1
  * lat       (lat) float64 88.57 86.72 84.86 83.0 ... -84.86 -86.72 -88.57
  * time      (time) object 1421-01-31 00:00:00 ... 2009-01-31 00:00:00

In [12]: pr_3M_avg = pr_3M.spatial.average("precip")
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[12], line 1
----> 1 pr_3M_avg = pr_3M.spatial.average("precip")

File ~/miniconda3/envs/Nile-Laki/lib/python3.11/site-packages/xcdat/spatial.py:195, in SpatialAccessor.average(self, data_var, axis, weights, keep_weights, lat_bounds, lon_bounds)
    192 elif isinstance(weights, xr.DataArray):
    193     self._weights = weights
--> 195 self._validate_weights(dv, axis)
    196 ds[dv.name] = self._averager(dv, axis)
    198 if keep_weights:

File ~/miniconda3/envs/Nile-Laki/lib/python3.11/site-packages/xcdat/spatial.py:687, in SpatialAccessor._validate_weights(self, data_var, axis)
    685     dim_name = get_dim_keys(data_var, key)
    686     if dim_name not in self._weights.dims:
--> 687         raise KeyError(
    688             f"The weights DataArray does not include an {key} axis, or the "
    689             "dimension names are not the same."
    690         )
    692 # Check the weight dim sizes equal data var dim sizes.
    693 dim_sizes = {key: data_var.sizes[key] for key in self._weights.sizes.keys()}

KeyError: 'The weights DataArray does not include an X axis, or the dimension names are not the same.'

Do you know what might account for this error?

What did you expect to happen? Are there are possible answers you came across?

No response

Minimal Complete Verifiable Example (MVCE)

No response

Relevant log output

No response

Anything else we need to know?

No response

Environment

xCDAT version 0.6.1

In [13]: xr.show_versions() /home/andrew/miniconda3/envs/Nile-Laki/lib/python3.11/site-packages/_distutils_hack/init.py:33: UserWarning: Setuptools is replacing distutils. warnings.warn("Setuptools is replacing distutils.")

INSTALLED VERSIONS

commit: None python: 3.11.7 | packaged by conda-forge | (main, Dec 23 2023, 14:43:09) [GCC 12.3.0] python-bits: 64 OS: Linux OS-release: 5.15.0-86-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.14.3 libnetcdf: 4.9.2

xarray: 2023.12.0 pandas: 2.1.4 numpy: 1.26.2 scipy: 1.11.4 netCDF4: 1.6.5 pydap: None h5netcdf: None h5py: None Nio: None zarr: None cftime: 1.6.3 nc_time_axis: None iris: None bottleneck: None dask: 2023.12.1 distributed: 2023.12.1 matplotlib: 3.8.2 cartopy: 0.22.0 seaborn: None numbagg: None fsspec: 2023.12.2

pochedls commented 6 months ago

Thanks for reporting this. In trying to see if something obvious was wrong, I ended up having a similar (but different) issue.

# grab a random file
wget http://esgf-data.ucar.edu/thredds/fileServer/esg_dataroot/CMIP6/CMIP/NCAR/CESM2-FV2/historical/r1i1p1f1/Amon/pr/gn/v20191120/pr_Amon_CESM2-FV2_historical_r1i1p1f1_gn_185001-189912.nc

I get an issue with the resample command

# ESGF file
fn = 'pr_Amon_CESM2-FV2_historical_r1i1p1f1_gn_185001-189912.nc'
# or, locally, /p/css03/esgf_publish/CMIP6/CMIP/NCAR/CESM2-FV2/historical/r1i1p1f1/Amon/pr/gn/v20191120/pr_Amon_CESM2-FV2_historical_r1i1p1f1_gn_185001-189912.nc
ds = xc.open_dataset(fn)
pr_3M = ds.pr.resample(time="3ME").sum()

ValueError: Invalid frequency string provided

but the following resample string works (and results in a different error message)

pr_3M = ds.resample(time="3M").sum()
pr_3M_avg = pr_3M.spatial.average('pr')

ValueError: More than one grid cell spans prime meridian.

tomvothecoder commented 6 months ago

@arfriedman Can you provide us with a public link to the dataset so we can take a look at it? It looks like the precip variable is either missing the CF "axis" or "standard_name" attributes or they are incorrectly set.

I get an issue with the resample command

# ESGF file
fn = 'pr_Amon_CESM2-FV2_historical_r1i1p1f1_gn_185001-189912.nc'
# or, locally, /p/css03/esgf_publish/CMIP6/CMIP/NCAR/CESM2-FV2/historical/r1i1p1f1/Amon/pr/gn/v20191120/pr_Amon_CESM2-FV2_historical_r1i1p1f1_gn_185001-189912.nc
ds = xc.open_dataset(fn)
pr_3M = ds.pr.resample(time="3ME").sum()

ValueError: Invalid frequency string provided

but the following resample string works (and results in a different error message)

pr_3M = ds.pr.resample(time="3ME").sum()
pr_3M_avg = pr_3M.spatial.average('pr')

ValueError: More than one grid cell spans prime meridian.

@pochedls I tried your code but it didn't work for me. The code below works for resampling, but I also get the ValueError for spatial averaging. I resample using "3M" and not "3ME".

import xcdat as xc

fn = "/p/css03/esgf_publish/CMIP6/CMIP/NCAR/CESM2-FV2/historical/r1i1p1f1/Amon/pr/gn/v20191120/pr_Amon_CESM2-FV2_historical_r1i1p1f1_gn_185001-189912.nc"

# Resampling works
ds = xc.open_dataset(fn)
ds = ds.resample(time="3M").sum()

# ValueError: More than one grid cell spans prime meridian.
ds_3m_avg = ds.spatial.average("pr")
pochedls commented 6 months ago

Thanks @tomvothecoder – I pasted in the wrong code excerpt (I also used 3M) – this is fixed now. We're getting the same error.

In looking at this again, I think the longitude bounds get messed up after resample. They have a different shape (the latitude bounds are modified also).

@arfriedman – Do your bounds also change? Is there a way to use resample while retaining the lon_bnds shape? A potential workaround in this case is to copy the original bounds to the dataset after re-sample.

ds_resampled['lon_bnds'] = ds_original.lon_bnds
ds_resampled['lat_bnds'] = ds_original.lat_bnds

@tomvothecoder / @arfriedman – do you think this is an xarray issue / question?

original bounds

ds = xc.open_dataset(fn)
print(ds.lon)

<xarray.DataArray 'lon' (lon: 144)>
array([ 0. , 2.5, 5. , 7.5, 10. , 12.5, 15. , 17.5, 20. , 22.5,

  1. , 27.5, 30. , 32.5, 35. , 37.5, 40. , 42.5, 45. , 47.5,
  2. , 52.5, 55. , 57.5, 60. , 62.5, 65. , 67.5, 70. , 72.5,
  3. , 77.5, 80. , 82.5, 85. , 87.5, 90. , 92.5, 95. , 97.5,
    1. , 102.5, 105. , 107.5, 110. , 112.5, 115. , 117.5, 120. , 122.5,
    2. , 127.5, 130. , 132.5, 135. , 137.5, 140. , 142.5, 145. , 147.5,
    3. , 152.5, 155. , 157.5, 160. , 162.5, 165. , 167.5, 170. , 172.5,
    4. , 177.5, 180. , 182.5, 185. , 187.5, 190. , 192.5, 195. , 197.5,
    5. , 202.5, 205. , 207.5, 210. , 212.5, 215. , 217.5, 220. , 222.5,
    6. , 227.5, 230. , 232.5, 235. , 237.5, 240. , 242.5, 245. , 247.5,
    7. , 252.5, 255. , 257.5, 260. , 262.5, 265. , 267.5, 270. , 272.5,
    8. , 277.5, 280. , 282.5, 285. , 287.5, 290. , 292.5, 295. , 297.5,
    9. , 302.5, 305. , 307.5, 310. , 312.5, 315. , 317.5, 320. , 322.5,
    10. , 327.5, 330. , 332.5, 335. , 337.5, 340. , 342.5, 345. , 347.5,
    11. , 352.5, 355. , 357.5]) Coordinates:
      • lon (lon) float64 0.0 2.5 5.0 7.5 10.0 ... 350.0 352.5 355.0 357.5 ...
print(ds.lon_bnds.values)

[[ -1.25 1.25] [ 1.25 3.75] ... [356.25 358.75]]

print(ds.lon_bnds.shape)

(144, 2)

Bounds after re-sample

ds2 = ds.resample(time="3M").sum()
print(np.allclose(ds.lon, ds2.lon))

True

print(ds2.lon_bnds.values)

[[[ -1.25 1.25]
[ 1.25 3.75] [ 3.75 6.25] ... [351.25 353.75] [353.75 356.25] [356.25 358.75]]

[[ -1.25 1.25] [ 1.25 3.75] [ 3.75 6.25] ... [351.25 353.75] [353.75 356.25] [356.25 358.75]] ...

print(ds2.lon_bnds.shape)

(201, 144, 2)

arfriedman commented 6 months ago

Thanks @tomvothecoder and @pochedls.

@tomvothecoder here's a link to the model output -- apologies for the large file size: https://unibe365-my.sharepoint.com/:u:/g/personal/andrew_friedman_unibe_ch/Ec0kAh1-AthCnUVrdH5oGvwBf2zXmzLH0kENqwPL5iqnuA?e=y1fqW4

@pochedls I confirm that spatial averaging works after applying your work-around of copying the original dataset lat and lon bounds. To me this does suggest that the issue is from xarray.

tomvothecoder commented 6 months ago

@pochedls @arfriedman

xarray.Dataset.resample() adds the time dimension to variables that did not originally have a time dimension. There is an Xarray issue for this topic that has been open since 2018. shoyer's comment explains this undesired behavior and possible fixes that need to be implemented in the Xarray API.

If anybody is interested, reviving the thread sounds like a good idea since the last comment was from early 2022.

Example

Before resampling -- no time dimension on "lat_bnds" and "lon_bnds"

ds = xc.open_dataset(fn)
print(ds.data_vars)
"""
Data variables:
    pr         (time, lat, lon) float32 ...
    time_bnds  (time, nbnd) object ...
    lat_bnds   (lat, nbnd) float64 ...
    lon_bnds   (lon, nbnd) float64 ...
"""

After resampling -- time dimension added to "lat_bnds" and "lon_bnds" -> breaks spatial averaging

ds_3m = ds.resample(time="3M").sum()
"""
Data variables:
    pr        (time, lat, lon) float32 2.919e-06 2.918e-06 ... 4.918e-06
    lat_bnds  (time, lat, nbnd) float64 -90.95 -89.05 -89.05 ... 89.05 90.95
    lon_bnds  (time, lon, nbnd) float64 -1.25 1.25 1.25 ... 356.2 356.2 358.8
"""

Workaround

Use bounds from the original dataset

ds_3m["lat_bnds"] = ds.lat_bnds.copy()
ds_3m["lon_bnds"] = ds.lon_bnds.copy()
ds_3m_avg = ds_3m.spatial.average("pr")
tomvothecoder commented 4 months ago

Closing this issue since it is rooted in Xarray