xCDAT / xcdat

An extension of xarray for climate data analysis on structured grids.
https://xcdat.readthedocs.io/en/latest/
Apache License 2.0
116 stars 12 forks source link

[Bug]: Issues opening up datasets with multiple coord vars for the same axis mapped to single dim/independent dims (IPSL ESM files) #285

Closed oliviermarti closed 1 year ago

oliviermarti commented 2 years ago

What happened?

The IPSL Earth System Model output files has two time variables, and xcdat refuses to open them.

Developer Notes (@tomvothecoder, 8/30/22)

Dataset Info:

How xcdat handles coords:

What did you expect to happen?

No response

Minimal Complete Verifiable Example

# Test1 - Original IPSL file (not CF compliant)

import numpy as np, xarray as xr, xcdat as xc
TS_File ='https://thredds-su.ipsl.fr/thredds/dodsC/ipsl_thredds/omamce/TestCases/XCDAT/thetao_noCF.nc'
dTS = xc.open_dataset (TS_File, add_bounds=False, use_cftime=True, decode_times=True)

# Test2 - Remove attributes referring to non-existing bounds
# This file is CF compliant (https://pumatest.nerc.ac.uk/cgi-bin/cf-checker.pl)

import numpy as np, xarray as xr, xcdat as xc
TS_File ='https://thredds-su.ipsl.fr/thredds/dodsC/ipsl_thredds/omamce/TestCases/XCDAT/thetao_CF.nc'
dTS = xc.open_dataset (TS_File, add_bounds=False, use_cftime=True, decode_times=True)

# Test3 - remove attr 'coordinates: "time_centered nav_lat nav_lon"' from variable 'thetao' attributes

import numpy as np, xarray as xr, xcdat as xc
TS_File ='https://thredds-su.ipsl.fr/thredds/dodsC/ipsl_thredds/omamce/TestCases/XCDAT/thetao_CF_1.nc'
dTS = xc.open_dataset (TS_File, add_bounds=False, use_cftime=True, decode_times=True)

tos_year = dTS.temporal.group_average ("tos", freq="year", weighted=True)

Relevant log output

# Test 1
KeyError: "Receive multiple variables for key 'T': {'time_centered', 'time_counter'}. Expected only one. Please pass a list ['T'] instead to get all variables matching 'T'."

# Test 3
KeyError: "Receive multiple variables for key 'T': {'time_centered', 'time_counter'}. Expected only one. Please pass a list ['T'] instead to get all variables matching 'T'."

# Test 3
KeyError: "The coordinate variable 'time_counter' has no 'bounds' attr. Set the 'bounds' attr to the name of the bounds data variable."

We need to use `time_centered` and `time_centered_bounds` to compute the correct weighting. So we need to keep the `coordinates` attributes

Anything else we need to know?

No response

Environment

xcdat.version : '0.3.0'

xr.show_versions() :

INSTALLED VERSIONS

commit: None python: 3.10.5 | packaged by conda-forge | (main, Jun 14 2022, 07:03:09) [Clang 13.0.1 ] python-bits: 64 OS: Darwin OS-release: 21.5.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.1 libnetcdf: 4.8.1

xarray: 2022.3.0 pandas: 1.4.3 numpy: 1.22.4 scipy: 1.8.1 netCDF4: 1.6.0 pydap: None h5netcdf: None h5py: None Nio: None zarr: None cftime: 1.6.1 nc_time_axis: 1.4.1 PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2022.6.1 distributed: 2022.6.1 matplotlib: 3.5.2 cartopy: 0.20.2 seaborn: None numbagg: None fsspec: 2022.5.0 cupy: None pint: 0.19.2 sparse: 0.13.0 setuptools: 63.1.0 pip: 22.1.2 conda: 4.13.0 pytest: None IPython: 8.4.0 sphinx: None

pochedls commented 2 years ago

Thanks for sharing this – we had not come across this case yet (looking at CMIP6 data). This file seems to have two time variables (with identical time values in each), but I could imagine that some netCDF files would have two distinct time axes.

I see that this issue results because xcdat is expecting one time axis and has logic built around checking/decoding a single time axis, but this file has time_counter and time_centered. Ideally, we could find all time axes and decode them iteratively. One concern is this might effect how we use cfxarray – we may generally expect a single dimension variable.

This is going to need a closer look.

oliviermarti commented 2 years ago

Thank you for your interest. Note that we could specify the name of the right time axis at the call of open_dataset. However, if different variables in the file use different time axis, we are screwed with this method. I'm not sure that IPSL file format is a very good one. But we have Tb of them ... !!

rljacob commented 2 years ago

By specifying the name when you open the dataset, you could at least work with all the variables that share that axis. Then open a new session for the variables with the other time axis.

pochedls commented 2 years ago

I like @rljacob's idea (at least as a possible stopgap solution). I think you could potentially implement this by taking in an optional variable in xcdat.open_[mf]dataset, determining the .dims (or maybe .coords) associated with that variable` and dropping any dim/coord not associated with the variable.

I am hitting a similar error opening CESM2 LE data (which signals this issue may affect E3SM files, too).

import xcdat

fns = '/p/user_pub/climate_work/pochedley1/cesm2le/TREFHT/*1301.020*'
ds = xcdat.open_mfdataset(fns)

AttributeError: cf_xarray can't wrap attribute 'dims' because there are multiple values for 'longitude'. There is no unique mapping from 'longitude' to a value in 'dims'.

As a temporary solution, I am opening this data with xarray. xcdat basically uses xr.open_[mv]dataset, but adds bounds (if they aren't there) and decodes some non-cf-compliant time axes to cftime arrays (if they are non-cf compliant). So xarray open dataset functionality is a reasonable workaround for many datasets.

jypeter commented 2 years ago

@pochedls is there a way you can make the Receive multiple variables for key 'T': {'time_centered', 'time_counter'}. Expected only one error more specific, by specifying where the different time coordinates come from. This could help the user figure out what is wrong. In the present case, I imagine one of the time axes is found thanks to the axis:T attribute, and the other thanks to the coordinates variable attribute

Also, I think adding something in the documentation about the logic of finding the axes and how the coordinates attribute is handled could be useful. I'm guessing you want to follow the CF Convention as much as possible (#292)

Clarifying how you handle the coordinates variable attribute in different cases would probably be useful, but I confess I'm not even sure how this attribute works. I have quickly checked the CF web site and found references to the coordinates attribute in

pochedls commented 2 years ago

I looked into this a little further and have some ideas to address the issue in Test 1 and Test 2.

The immediate problem is when opening the dataset and getting the time axis in the _has_cf_compliant_time routine via ds.cf["T"], which returns:

KeyError: "Receive multiple variables for key 'T': {'time_centered', 'time_counter'}. Expected only one. Please pass a list ['T'] instead to get all variables matching 'T'."

This occurs because the dataset has multiple time axes. There are a couple potentially simple solutions to this:

A more complete solution might look like:

  1. In open_*dataset functions: if decode_times=True then get all temporal coordinates via time_axis_list = ds.cf[["T"]].coords.
  2. Loop over each temporal coordinate and test whether it is cf_compliant (you would need to re-write _has_cf_compliant_time to accept a DataArray rather than a data path). Decode non-cf_compliant axes within the loop and update the original dataset with the decoded axes and gather the cf_compliant DataArrays (and their attributes) into two dictionaries (e.g., variable_dict, attrs_dict).
    • Note that decoding non-cf_compliant axes here would require refactoring decode_non_cf_time to accept an axis DataArray. You would also need to write a new function to replace get_bounds to get the bounds from a specific axis (rather than by key).
  3. Decode the cf_compliant DataArrays using xr.conventions.cf_decoder(variable_dict, attrs_dict). Then update the original dataset with the decoded axes.

I think this would decode all time axes, which I think is what we should do (if possible). I do think temporal operations will still need to be updated to be able to select the appropriate time coordinate (e.g., by selecting the time coordinate associated with a specified data_var).

One note that might simplify the main logic above (at Step 2). Our version of _get_cftime_coords appears to work on both cf_compliant and non-cf_compliant data so a revised version of decode_non_cf_time (e.g., replaced with decode_time) might simply be applied to every time axis (which would save putting the cf_compliant data into a DataArray and having xarray decode). One reason we might not do this is that xarray may have more robust functionality to decode temporal data.

pochedls commented 2 years ago

In looking at Test 3, I see another issue here. The time axis needed for temporal operations is not actually assigned as a dimension. I thought this could be fixed by dropping time_counter and just using time_centered: this would get rid of the multiple time axes which is the issue for Test 1 and 2 and would also be the appropriate coordinate for temporal operations:

import numpy as np, xarray as xr, xcdat as xc
TS_File ='https://thredds-su.ipsl.fr/thredds/dodsC/ipsl_thredds/omamce/TestCases/XCDAT/thetao_CF_1.nc'
ds = xr.open_dataset(TS_File, decode_times=False)
ds['time_counter'] = ds['time_centered']
ds = ds.drop_vars('time_centered')
ds = ds.rename({'time_counter': 'time_centered'})
ds.to_netcdf('test.nc', 'w', 'NETCDF3_CLASSIC')

ds = xc.open_dataset('test.nc')
ds.temporal.group_average('thetao', freq='year', weighted=True)

But this yields a different error:

AttributeError: 'DataArray' object has no attribute 'time'

@tomvothecoder - do you think this is a separate issue in the temporal averaging logic?

tomvothecoder commented 2 years ago

In looking at Test 3, I see another issue here. The time axis needed for temporal operations is not actually assigned as a dimension. I thought this could be fixed by dropping time_counter and just using time_centered: this would get rid of the multiple time axes which is the issue for Test 1 and 2 and would also be the appropriate coordinate for temporal operations:

import numpy as np, xarray as xr, xcdat as xc
TS_File ='https://thredds-su.ipsl.fr/thredds/dodsC/ipsl_thredds/omamce/TestCases/XCDAT/thetao_CF_1.nc[](https://thredds-su.ipsl.fr/thredds/dodsC/ipsl_thredds/omamce/TestCases/XCDAT/thetao_CF_1.nc)'
ds = xr.open_dataset(TS_File, decode_times=False)
ds['time_counter'] = ds['time_centered']
ds = ds.drop_vars('time_centered')
ds = ds.rename({'time_counter': 'time_centered'})
ds.to_netcdf('test.nc', 'w', 'NETCDF3_CLASSIC')

ds = xc.open_dataset('test.nc')
ds.temporal.group_average('thetao', freq='year', weighted=True)

But this yields a different error:

AttributeError: 'DataArray' object has no attribute 'time'

@tomvothecoder - do you think this is a separate issue in the temporal averaging logic?

Yes, this issue is related to the TemporalAccessor class methods incorrectly performing static references to "time" as the name of the time dimension.

I opened up a separate issue and the related PR that fixes this:

pochedls commented 2 years ago

I just wanted to note that I hit this error message when using the regridder (use the code in this issue with one of the two datasets below).

KeyError: "Receive multiple variables for key 'longitude': {'longitude', 'lon'}. Expected only one. Please pass a list ['longitude'] instead to get all variables matching 'longitude'."

* /p/css03/esgf_publish/CMIP6/CMIP/BCC/BCC-CSM2-MR/historical/r1i1p1f1/SImon/siconc/gn/v20200218/
* /p/css03/esgf_publish/CMIP6/CMIP/BCC/BCC-ESM1/historical/r1i1p1f1/SImon/siconc/gn/v20200218/

It's possible that efforts to address this issue (e.g., here) will fix the regridding issue.

tomvothecoder commented 1 year ago

Hi @oliviermarti and @jypeter,

Thank you for your patience as we worked to solve this issue. I just wanted to inform you that that this issue is now resolved in xcdat=0.4.0 (#343). You should be able to add bounds, decode times, and perform xcdat temporal operations now. Give it a shot and let us know how it goes!

Example code:

import numpy as np,
import xarray as xr

import xcdat as xc

print(xc.__version)
# 0.4.0

TS_File ='https://thredds-su.ipsl.fr/thredds/dodsC/ipsl_thredds/omamce/TestCases/XCDAT/thetao_CF_1.nc'
dTS = xc.open_dataset(TS_File, add_bounds=True, decode_times=True)

print(dTS)
# <xarray.Dataset>
# Dimensions:               (y: 149, x: 182, olevel: 31, time_counter: 12,
#                            axis_nbounds: 2, bnds: 2)
# Coordinates:
#   * olevel                (olevel) float32 5.0 15.0 25.0 ... 4.75e+03 5.25e+03
#   * time_counter          (time_counter) object 1850-01-16 12:00:00 ... 1850-...
# Dimensions without coordinates: y, x, axis_nbounds, bnds
# Data variables:
#     nav_lat               (y, x) float32 ...
#     nav_lon               (y, x) float32 ...
#     thetao                (time_counter, olevel, y, x) float32 ...
#     time_centered         (time_counter) float64 ...
#     time_centered_bounds  (time_counter, axis_nbounds) float64 ...
#     time_counter_bnds     (time_counter, bnds) object 1850-01-01 18:00:00 ......
#     olevel_bnds           (olevel, bnds) float32 -0.000237 10.0 ... 5.5e+03
# Attributes:
#     name:                            CM6-SW-pi-Par01_1m_18500101_18501231_grid_T
#     description:                     Created by xios
#     title:                           Created by xios
#     Conventions:                     CF-1.6
#     timeStamp:                       2021-Oct-22 08:27:34 GMT
#     uuid:                            bf661a39-5d4a-4999-aaae-56f09c3123c5
#     LongName:                        IPSLCM6.1.10-LR
#     NCO:                             netCDF Operators version 4.9.9 (Homepage...
#     _NCProperties:                   version=2,netcdf=4.7.4,hdf5=1.10.6
#     history:                         Thu Jul 28 16:24:47 2022: ncatted -a coo...
#     DODS_EXTRA.Unlimited_Dimension:  time_counter

tos_year = dTS.temporal.group_average("thetao", freq="year", weighted=True)

# <xarray.Dataset>
# Dimensions:       (y: 149, x: 182, olevel: 31, bnds: 2, time_counter: 1)
# Coordinates:
#   * olevel        (olevel) float32 5.0 15.0 25.0 ... 4.25e+03 4.75e+03 5.25e+03
#   * time_counter  (time_counter) object 1850-01-01 00:00:00
# Dimensions without coordinates: y, x, bnds
# Data variables:
#     nav_lat       (y, x) float32 ...
#     nav_lon       (y, x) float32 ...
#     olevel_bnds   (olevel, bnds) float32 -0.000237 10.0 10.0 ... 5e+03 5.5e+03
#     thetao        (time_counter, olevel, y, x) float64 nan nan nan ... nan nan
# Attributes:
#     name:                            CM6-SW-pi-Par01_1m_18500101_18501231_grid_T
#     description:                     Created by xios
#     title:                           Created by xios
#     Conventions:                     CF-1.6
#     timeStamp:                       2021-Oct-22 08:27:34 GMT
#     uuid:                            bf661a39-5d4a-4999-aaae-56f09c3123c5
#     LongName:                        IPSLCM6.1.10-LR
#     NCO:                             netCDF Operators version 4.9.9 (Homepage...
#     _NCProperties:                   version=2,netcdf=4.7.4,hdf5=1.10.6
#     history:                         Thu Jul 28 16:24:47 2022: ncatted -a coo...
#     DODS_EXTRA.Unlimited_Dimension:  time_counter

In your example file:

What changed in xcdat=0.4.0

oliviermarti commented 1 year ago

Le 10 nov. 2022 à 23:46, Tom Vo @.***> a écrit :

Hi @oliviermarti https://github.com/oliviermarti and @jypeter https://github.com/jypeter,

Thank you for your patience as we worked to solve this issue. I just wanted to inform you that that this issue is now resolved in xcdat=0.4.0 (#343 https://github.com/xCDAT/xcdat/pull/343). You should be able to add bounds, decode times, and perform xcdat temporal operations now. Give it a shot and let us know how it goes!

Hi,

Sorry for this late response.

This xcdat revision is OK. I can read and manipulate my data now. Thanks !

Olivier

— Olivier Marti LSCE Bât 714 p. 1049 MERMAID Team Normal situation : +33 1 69 08 77 27 Corona lockdown : +33 6 45 36 43 74 @.***