xCDAT / xcdat

An extension of xarray for climate data analysis on structured grids.
https://xcdat.readthedocs.io/en/latest/
Apache License 2.0
112 stars 12 forks source link

[Bug]: Additional dimension for lat_bnds and lon_bnds #390

Closed ShihengDuan closed 1 year ago

ShihengDuan commented 1 year ago

What happened?

When I load data with xcdat, sometimes it generates additional dimensions for lat_bnds and lon_bnds. An example code is here:

MIROC_cmip_tas = xc.open_dataset('tas/tas_MIROC5_piControl.nc')
MIROC_cmip_tas = MIROC_cmip_tas.sel(time=slice("2100-01-01", "2619-01-01"))
MIROC_cmip_tas = MIROC_cmip_tas.drop_vars('height')
print(MIROC_cmip_tas['lat_bnds'])
print(MIROC_cmip_tas['lon_bnds'])

It generates the following output:

<xarray.DataArray 'lat_bnds' (time: 6228, lat: 132, bnds: 2)>
[1644192 values with dtype=float64]
Coordinates:
  * lat      (lat) float64 -88.93 -87.54 -86.14 -84.74 ... 86.14 87.54 88.93
  * time     (time) object 2100-01-16 12:00:00 ... 2618-12-16 12:00:00
Dimensions without coordinates: bnds

For lon_bnds, an additional latitude dimension is generated.

<xarray.DataArray 'lon_bnds' (lat: 132, time: 6228, lon: 256, bnds: 2)>
[420913152 values with dtype=float64]
Coordinates:
  * lat      (lat) float64 -88.93 -87.54 -86.14 -84.74 ... 86.14 87.54 88.93
  * lon      (lon) float64 0.0 1.406 2.812 4.219 ... 354.4 355.8 357.2 358.6
  * time     (time) object 2100-01-16 12:00:00 ... 2618-12-16 12:00:00
Dimensions without coordinates: bnds

The situation changes for different files. Sometimes I got (time, lon, bnds) for lon_bnds.

The version of xCDAT is 0.4.0.

What did you expect to happen?

Keep only latitude/longitude and bnds dimensions for bnds variables.

Minimal Complete Verifiable Example

No response

Relevant log output

No response

Anything else we need to know?

No response

Environment

xCDAT is 0.4.0. xr.show_versions() yields the following

INSTALLED VERSIONS
------------------
commit: None
python: 3.10.4 (main, Mar 31 2022, 08:41:55) [GCC 7.5.0]
python-bits: 64
OS: Linux
OS-release: 3.10.0-1160.71.1.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.12.2
libnetcdf: 4.8.1

xarray: 2022.6.0
pandas: 1.4.4
numpy: 1.22.3
scipy: 1.8.1
netCDF4: 1.6.1
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: 1.6.2
nc_time_axis: 1.4.1
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: 1.3.5
dask: 2022.9.0
distributed: 2022.9.0
matplotlib: 3.5.2
cartopy: 0.21.0
seaborn: None
numbagg: None
fsspec: 2022.10.0
cupy: None
pint: None
sparse: 0.13.0
flox: None
numpy_groupies: None
setuptools: 65.5.0
pip: 22.2.2
conda: None
pytest: None
IPython: 8.6.0
sphinx: None
pochedls commented 1 year ago

When I look at the what I think is the original version of this dataset, I get the following:

import xcdat as xc
fn = '/p/css03/cmip5_css02/data/cmip5/output1/MIROC/MIROC5/piControl/mon/atmos/Amon/r1i1p1/tas/1/*.nc'
ds = xc.open_mfdataset(fn)

MergeError: conflicting values for variable 'lat_bnds' on objects to be combined. You can skip this check by specifying compat='override'.

See https://github.com/xCDAT/xcdat/issues/162 for more on this.

I think this error arises when the lat_bnds values vary across files, which I suspect may be what is happening here. It looks like the data has been re-saved out in tas/tas_MIROC5_piControl.nc. What does an ncdump -h look like @ShihengDuan?

ShihengDuan commented 1 year ago

Here is the output from ncdump -h:

(duan) /home/duan5/ISIMIP/MIROC5/piControl/tas$ ncdump -h tas_MIROC5_piControl.nc 
netcdf tas_MIROC5_piControl {
dimensions:
    lat = 132 ;
    lon = 256 ;
    time = 10440 ;
    bnds = 2 ;
variables:
    double lat(lat) ;
        lat:_FillValue = NaN ;
        lat:bounds = "lat_bnds" ;
        lat:units = "degrees_north" ;
        lat:axis = "Y" ;
        lat:long_name = "latitude" ;
        lat:standard_name = "latitude" ;
    double lon(lon) ;
        lon:_FillValue = NaN ;
        lon:bounds = "lon_bnds" ;
        lon:units = "degrees_east" ;
        lon:axis = "X" ;
        lon:long_name = "longitude" ;
        lon:standard_name = "longitude" ;
    double time(time) ;
        time:_FillValue = NaN ;
        time:bounds = "time_bnds" ;
        time:axis = "T" ;
        time:long_name = "time" ;
        time:standard_name = "time" ;
        time:units = "days since 2000-01-01" ;
        time:calendar = "noleap" ;
    double time_bnds(lat, time, bnds) ;
        time_bnds:_FillValue = NaN ;
        time_bnds:coordinates = "height" ;
    double lat_bnds(time, lat, bnds) ;
        lat_bnds:_FillValue = NaN ;
        lat_bnds:coordinates = "height" ;
    double lon_bnds(lat, time, lon, bnds) ;
        lon_bnds:_FillValue = NaN ;
        lon_bnds:coordinates = "height" ;
    double height ;
        height:_FillValue = NaN ;
        height:units = "m" ;
        height:axis = "Z" ;
        height:positive = "up" ;
        height:long_name = "height" ;
        height:standard_name = "height" ;
    float tas(time, lat, lon) ;
        tas:_FillValue = 1.e+20f ;
        tas:standard_name = "air_temperature" ;
        tas:long_name = "Near-Surface Air Temperature" ;
        tas:units = "K" ;
        tas:original_name = "T2" ;
        tas:cell_methods = "time: mean" ;
        tas:cell_measures = "area: areacella" ;
        tas:history = "2011-09-15T06:51:00Z altered by CMOR: Treated scalar dimension: \'height\'. 2011-09-15T06:51:00Z altered by CMOR: replaced missing value flag (-999) with standard missing value (1e+20). 2011-09-15T06:51:00Z altered by CMOR: Inverted axis: lat." ;
        tas:associated_files = "baseURL: http://cmip-pcmdi.llnl.gov/CMIP5/dataLocation gridspecFile: gridspec_atmos_fx_MIROC5_piControl_r0i0p0.nc areacella: areacella_fx_MIROC5_piControl_r0i0p0.nc" ;
        tas:coordinates = "height" ;
        tas:missing_value = 1.e+20f ;

// global attributes:
        :institution = "AORI (Atmosphere and Ocean Research Institute, The University of Tokyo, Chiba, Japan), NIES (National Institute for Environmental Studies, Ibaraki, Japan), JAMSTEC (Japan Agency for Marine-Earth Science and Technology, Kanagawa, Japan)" ;
        :institute_id = "MIROC" ;
        :experiment_id = "piControl" ;
        :source = "MIROC5 2010 atmosphere: MIROC-AGCM6 (T85L40); ocean: COCO (COCO4.5, 256x224 L50); sea ice: COCO (COCO4.5); land: MATSIRO (MATSIRO, L6); aerosols: SPRINTARS (SPRINTARS 5.00, T85L40)" ;
        :model_id = "MIROC5" ;
        :forcing = "N/A" ;
        :parent_experiment_id = "N/A" ;
        :parent_experiment_rip = "N/A" ;
        :branch_time = 0. ;
        :contact = "Masahiro Watanabe (hiro@aori.u-tokyo.ac.jp), Seita Emori (emori@nies.go.jp), Masayoshi Ishii (ism@jamstec.go.jp), Masahide Kimoto (kimoto@aori.u-tokyo.ac.jp)" ;
        :references = "Watanabe et al., 2010: Improved climate simulation by MIROC5: Mean states, variability, and climate sensitivity. J. Climate, 23, 6312-6335" ;
        :initialization_method = 1 ;
        :physics_version = 1 ;
        :tracking_id = "d5a62be5-f4e9-4e63-b63f-5fea30c77a52" ;
        :product = "output" ;
        :experiment = "pre-industrial control" ;
        :frequency = "mon" ;
        :creation_date = "2011-09-15T06:51:00Z" ;
        :history = "2011-09-15T06:51:00Z CMOR rewrote data to comply with CF standards and CMIP5 requirements." ;
        :Conventions = "CF-1.4" ;
        :project_id = "CMIP5" ;
        :table_id = "Table Amon (26 July 2011) 976b7fd1d9e1be31dddd28f5dc79b7a1" ;
        :title = "MIROC5 model output prepared for CMIP5 pre-industrial control" ;
        :parent_experiment = "N/A" ;
        :modeling_realm = "atmos" ;
        :realization = 1 ;
        :cmor_version = "2.7.1" ;

I concatenated all the piControl simulation with the following code:

files = glob.glob('data/tas_Amon_*.nc')
# print(sorted(files))
tas = xa.open_mfdataset(sorted(files))
print(tas)
print(np.unique(tas.time, return_counts=True))
print('Done')
print(len(np.unique(tas.time)))
print(len(tas.time))

tas.to_netcdf('tas_MIROC5_piControl.nc')

I forgot why I didn't do tas=tas.compute() before saving to nc files, but when I add this code, it still shows time dimension. See below:

(duan) /home/duan5/ISIMIP/MIROC5/piControl/tas$ python
Python 3.10.4 (main, Mar 31 2022, 08:41:55) [GCC 7.5.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import xcdat as xc
>>> data = xc.open_dataset('tas_MIROC5_piControl_compute.nc')
>>> data['lat_bnds']
<xarray.DataArray 'lat_bnds' (time: 10440, lat: 132, bnds: 2)>
[2756160 values with dtype=float64]
Coordinates:
  * lat      (lat) float64 -88.93 -87.54 -86.14 -84.74 ... 86.14 87.54 88.93
  * time     (time) object 2000-01-16 12:00:00 ... 2869-12-16 12:00:00
    height   float64 2.0
Dimensions without coordinates: bnds
>>> data = xc.open_dataset('tas_MIROC5_piControl.nc')
>>> data['lat_bnds']
<xarray.DataArray 'lat_bnds' (time: 10440, lat: 132, bnds: 2)>
[2756160 values with dtype=float64]
Coordinates:
  * lat      (lat) float64 -88.93 -87.54 -86.14 -84.74 ... 86.14 87.54 88.93
  * time     (time) object 2000-01-16 12:00:00 ... 2869-12-16 12:00:00
    height   float64 2.0
Dimensions without coordinates: bnds
>>> 

tas_MIROC5_piControl.nc is the original file and tas_MIROC5_piControl_compute.nc is what I get when I do tas.compute() before saving to nc file.

pochedls commented 1 year ago

@ShihengDuan – I realize that when I go back to the original issue on this ticket, if I use xarray instead of xcdat when opening the dataset (or the original data), I think I have the same issue (lat_bnds has time as a dim). I think this may either be an issue with the underlying data that needs to be resolved or an issue with xarray (on which xcdat is based).

Do you agree?

ShihengDuan commented 1 year ago

Oh yes, I didn't realize that. I guess it is related to how the original data file is formatted?

tomvothecoder commented 1 year ago

I think this may either be an issue with the underlying data that needs to be resolved or an issue with xarray (on which xcdat is based).

Based on the error below, I believe this is an issue with the underlying data rather than with xarray.

import xcdat as xc
fn = '/p/css03/cmip5_css02/data/cmip5/output1/MIROC/MIROC5/piControl/mon/atmos/Amon/r1i1p1/tas/1/*.nc'
ds = xc.open_mfdataset(fn)
MergeError: conflicting values for variable 'lat_bnds' on objects to be combined. You can skip this check by specifying compat='override'.

Here is info from the FAQs that covers this specific issue and a workaround with xarray: https://xcdat.readthedocs.io/en/latest/faqs.html#how-do-i-open-a-multi-file-dataset-with-values-that-conflict

In xarray, the default setting for checking compatibility across a multi-file dataset is compat='no_conflicts'. If conflicting values exists between files, xarray raises MergeError: conflicting values for variable <VARIABLE NAME> on objects to be combined. You can skip this check by specifying compat="override".

If you still intend on working with these datasets and recognize the source of the issue (e.g., minor floating point diffs), follow the instructions below. Please understand the potential implications before proceeding!

The xcdat/xarray workaround is to configure open_dataset()/open_mfdataset() with these settings:

import xcdat as xc
fn = '/p/css03/cmip5_css02/data/cmip5/output1/MIROC/MIROC5/piControl/mon/atmos/Amon/r1i1p1/tas/1/*.nc'

ds = xc.open_mfdataset(fn, compat="override", join="override", coords="minimal")

print(ds.lat_bnds)

# Notice how with `coords="minimal"`, time dimension is not being added to lat_bnds and lon_bnds
<xarray.DataArray 'lat_bnds' (lat: 128, bnds: 2)>
dask.array<open_dataset-752fd673e13445722f06a881808577f1lat_bnds, shape=(128, 2), dtype=float64, chunksize=(128, 2), chunktype=numpy.ndarray>
Coordinates:
  * lat      (lat) float64 -88.93 -87.54 -86.14 -84.74 ... 86.14 87.54 88.93
    height   float64 ...
Dimensions without coordinates: bnds
  1. compat="override": skip comparing and pick variable from first
  2. join="override": if indexes are of same size, rewrite indexes to be those of the first object with that dimension. Indexes for the same dimension must have the same size in all objects.
  3. coords="minimal": Only coordinates in which the dimension already appears are included.
    • This avoids dimensions unintentionally being added to variables
ShihengDuan commented 1 year ago

Thanks for the information. This makes sense to me.