xCDAT / xcdat

An extension of xarray for climate data analysis on structured grids.
https://xcdat.readthedocs.io/en/latest/
Apache License 2.0
114 stars 12 forks source link

[Bug]: Possible inconsistency with xarray when spanning across files #394

Open gleckler1 opened 1 year ago

gleckler1 commented 1 year ago

What happened?

See attached markdown file prepared by Andrew Manaser (RSS). xcdat_open_mfdataset_issue_README.md

What did you expect to happen?

This problem occurs when using the open_mfdataset method with xcdat (occasionalloy called xc in this document) to concatenate multiple files of monthly average gridded lat/lon geophysical netCDF4 data along a new axis (in this case time)

Minimal Complete Verifiable Example

No response

Relevant log output

No response

Anything else we need to know?

No response

Environment

Should be included on attached file.

pochedls commented 1 year ago

@gleckler1 - is it possible to point to some publicly available data to look into this? I don't think this can be resolved without more information (or data to investigate the issue).

pochedls commented 1 year ago

Original comment below. Example data can be downloaded from: https://data.remss.com/smap/SSS/V05.0/FINAL/L3/monthly/

XCDAT Issue with open_mfdataset

This document was written to briefly illustrate an issue with xcdat that does not appear to be present in xarray.

Issue

This problem occurs when using the open_mfdataset method with xcdat (occasionalloy called xc in this document) to concatenate multiple files of monthly average gridded lat/lon geophysical netCDF4 data along a new axis (in this case time). The following section illustrates how the open_mfdataset method with xarray (occasionally called xr in this document) is used in a code whose purpose is to make a dataset from the files mentioned above obs4MIPs compliant.

xarray.open_mfdataset

In the code being used, xr.open_mfdataset is called as follows:

f = xr.open_mfdataset(inputDatasets, mask_and_scale=False, decode_times=False, combine='nested', concat_dim='time', preprocess=extract_date)

Where inputDatasets is a list of netCDF4 files to be concatenated, and the function extract_date is given by:

def extract_date(ds):   # preprocessing function when opening files
    for var in ds.variables:
        if var == 'time':
            dataset_time = ds[var].values
            dataset_units = ds[var].units
            ds.assign(time=dataset_time)
            ds["time"].attrs = {"units": dataset_units}
    return ds

Note that each netCDF4 file in inputDatasets contains a time array of length one (i.e., a scalar value), which corresponds to the time at the center of the month. These values will eventually fill the new time coordinate in the dataset f created by xr.open_mfdataset.

Several keywords are set in xr.open_mfdataset. These are necessary for reading in these files for the following reasons: --mask_and_scale: needs to be 'False' for these files otherwise _FillValue information will not be provided when the files are read in. --decode_times: needs to be set to 'False'. Prevents xarray from converting time data (in 'seconds since...' or 'days since...') to datetime. --combine: needs to be set to nested per xarray documentation in order to assign concat_dim (https://docs.xarray.dev/en/stable/generated/xarray.open_mfdataset.html; see below). --concat_dim: a string with the name of the axis/dimesion that the gridded lat/lon coordinate variables will be concatenated along. Simply set to time in this code, but could be any string the user wants, in theory. --preprocess: set to extract_date. Setting this variable tells xr.open_mfdataset to run the dataset through the extract_date function before doing anything else.

This call appropriately reads in all monthly average gridded lat/lon data in the netCDF4 files given by the inputDatasets list and concantenates these data along a new time dimension. This results in an output dataset f with arrays of gridded variable maps having the dimensions time, lat, and lon where the length of time corresponds to the total number of months in the dataset. However, when xc.open_mfdataset is used instead of xr.open_mfdataset, several issues arise. These are detailed below.

xcdat.open_mfdataset

When the same call to open_mfdataset shown above is made with xcdat:

f = xc.open_mfdataset(inputDatasets, mask_and_scale=False, decode_times=False, combine='nested', concat_dim='time', preprocess=extract_date)

The following error is returned:

xarray.core.merge.MergeError: conflicting values for variable `nobs` on objects to be combined.  You can skip this check by specifying compat=`override`

In this particular error, nobs refers to one of the lat/lon gridded fields in the netCDF4 files to be combined (I believe it is one of the first fields read by this call). When compat='override' is set, the following error is returned:

ValueError: Cannot specify both coords='different' and compat='override'

In this error, it is referring to the default value of the coords keyword. When the coords keyword is changed to either all or minimal (options per the xarray.open_mfdataset documentation given above), the call xc.open_mfdataset runs and creates a dataset f with a new time coordinate as expected. However, instead of the output dataset containing arrays of gridded maps (i.e., arrays of gridded maps with dimensions of [time, lat, lon]), there is no time dimension, simply maps with two spatial dimensions (lat and lon). I believe these maps correspond to the gridded data in the first or last file in the inputDatasets file list.

Concluding

It is possible that there is a sequence of keywords using xcdat.open_mfdataset that will produce the same output dataset as the xarray.open_mfdataset call given above. However, the purpose of this document is to note how the same method using the same keywords in xcdat and xarray produces different results. Therefore, we feel it is best to bring this to the attention of the developers of xcdat.

pochedls commented 1 year ago

I think this issue arises because xarray uses data_vars="all" when opening multifile datasets and xcdat uses data_vars="minimal".

I think this issue can be resolved with:

f = xr.open_mfdataset(inputDatasets, mask_and_scale=False, decode_times=False, combine='nested', concat_dim='time', preprocess=extract_date, data_vars='all')

I think specifying data_vars="all" solves the immediate issue. I guess a question is whether xcdat should continue to use data_vars="minimal" as the default (I don't remember why this was done, but it presumably was useful at some point).

chengzhuzhang commented 1 year ago

@pochedls Thank you for summarizing this issue and troubleshooting. Regarding to data_vars="minimal", I recall that occasionally I will need to specify it when using open_mfdataset. @tomvothecoder documented the issue here

chengzhuzhang commented 1 year ago

With data_vars='minimal', for instance if the time dim is not appear in a data variable, the variable won't be stacked. If i understand correctly, in this use case, the expected behavior is to construct a time dimension even if the variable is invariant in time?

tomvothecoder commented 1 year ago

Hi @pochedls and @chengzhuzhang, here's what I came up with based on your comments.

Suspected Root Cause

As Jill pointed out in #143, data_vars="all" concatenates the time dimension to variables such as lat_bnds and lon_bnds. This seems like an unexpected/unwanted behavior that also breaks spatial averaging. We had to set data_vars="minimum" to stop this from happening.

There is a closed xarray issue that talks about adding data_vars="minimum" to address this issue.

xarray docs on data_vars kwarg:

  • data_vars ({"minimal", "different", "all"} or list of str, default: "all") –

    These data variables will be concatenated together:

    • “minimal”: Only data variables in which the dimension already appears are included.
    • “different”: Data variables which are not equal (ignoring attributes) across all datasets are also concatenated (as well as all for which dimension already appears). Beware: this option may load the data payload of data variables into memory if they are not already
    • “all”: All data variables will be
    • list of str: The listed data variables will be concatenated, in addition to the “minimal” data variables.

Example:

import os
import xcdat as xc

dir = "/p/user_pub/PCMDIobs/PCMDIobs2/atmos/3hr/pr/TRMM-3B43v-7/gn/v20200707"
file_pattern = "pr_3hr_TRMM-3B43v-7_BE_gn_v20200707_1998*.nc"
files = os.path.join(dir, file_pattern)

# `data_vars="minimal"` (xcdat default)
# ------------------------------------------------------------
ds1 = xc.open_mfdataset(files, data_vars="minimal")

# Notice how "lat_bnds" retains its original dimensions (no "time" dimension)
print(ds1.lat_bnds)
"""
<xarray.DataArray 'lat_bnds' (lat: 400, bnds: 2)>
dask.array<where, shape=(400, 2), dtype=float64, chunksize=(400, 2), chunktype=numpy.ndarray>
Coordinates:
  * lat      (lat) float64 -49.88 -49.62 -49.38 -49.12 ... 49.38 49.62 49.88
Dimensions without coordinates: bnds
"""

# Spatial averaging works as expected
pr_global1 = ds1.spatial.average("pr", axis=["X", "Y"])

# `data_vars="all"` (xarray default)
# ------------------------------------------------------------
ds2 = xc.open_mfdataset(files, data_vars="all")

# Notice how "lat_bnds" now has a "time" dimension (unexpected)
print(ds2.lat_bnds)
"""
<xarray.DataArray 'lat_bnds' (time: 2920, lat: 400, bnds: 2)>
dask.array<concatenate, shape=(2920, 400, 2), dtype=float64, chunksize=(248, 400, 2), chunktype=numpy.ndarray>
Coordinates:
  * time     (time) object 1998-01-01 00:00:00 ... 1998-12-31 21:00:00
  * lat      (lat) float64 -49.88 -49.62 -49.38 -49.12 ... 49.38 49.62 49.88
Dimensions without coordinates: bnds
"""

# This results in the error below with spatial averaging
pr_global2 = ds2.spatial.average("pr", axis=["X", "Y"])
"""
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/home/vo13/xCDAT/xcdat/qa/issue-394-data-vars/qa.py in line 5
      15 #%%
      16
      17 # `data_vars="all"` (xarray default)
      18 ds2 = xc.open_mfdataset(files, data_vars="all")
----> 19 pr_global2 = ds2.spatial.average("pr", axis=["X", "Y"])

File ~/xCDAT/xcdat/xcdat/spatial.py:195, in SpatialAccessor.average(self, data_var, axis, weights, keep_weights, lat_bounds, lon_bounds)
    192 elif isinstance(weights, xr.DataArray):
    193     self._weights = weights
--> 195 self._validate_weights(dv, axis)
    196 ds[dv.name] = self._averager(dv, axis)
    198 if keep_weights:

File ~/xCDAT/xcdat/xcdat/spatial.py:707, in SpatialAccessor._validate_weights(self, data_var, axis)
    705     dim_name = get_dim_keys(data_var, key)
    706     if dim_name not in self._weights.dims:
--> 707         raise KeyError(
    708             f"The weights DataArray does not include an {key} axis, or the "
    709             "dimension names are not the same."
    710         )
    712 # Check the weight dim sizes equal data var dim sizes.
    713 dim_sizes = {key: data_var.sizes[key] for key in self._weights.sizes.keys()}

KeyError: 'The weights DataArray does not include an X axis, or the dimension names are not the same.'
"""

Potential Solution(s)

In this specific issue, it sounds there are cases where we do want to concatenate on dimensions that are not in the original data var (e.g., time).

With this knowledge, we can either:

  1. Keep default data_vars="minimum"
    • User can override data_vars as needed (e.g., data_vars="all")
    • It seems like in most cases we do not want to concatenate dimensions on variables unexpectedly
  2. Update default data_vars="all"
    • I wonder if users might raise issues regarding dimensions unexpectedly being concatenated
    • This is the xarray default, so the behaviors will be consistent (even if unexpected)

I'll think about this more, but do either you have any thoughts here?

pochedls commented 1 year ago

I think we should keep this as is (default data_vars="minimum"), but perhaps add some notes about this case/error in open_mfdataset(). [I haven't thought enough about "what" to add – I see there is already good notes on data_vars].