Open gleckler1 opened 1 year ago
@gleckler1 - is it possible to point to some publicly available data to look into this? I don't think this can be resolved without more information (or data to investigate the issue).
Original comment below. Example data can be downloaded from: https://data.remss.com/smap/SSS/V05.0/FINAL/L3/monthly/
This document was written to briefly illustrate an issue with xcdat
that does not appear to be present in xarray
.
This problem occurs when using the open_mfdataset
method with xcdat
(occasionalloy called xc
in this document) to concatenate multiple files of monthly average gridded lat
/lon
geophysical netCDF4 data along a new axis (in this case time
). The following section illustrates how the open_mfdataset
method with xarray
(occasionally called xr
in this document) is used in a code whose purpose is to make a dataset from the files mentioned above obs4MIPs compliant.
In the code being used, xr.open_mfdataset
is called as follows:
f = xr.open_mfdataset(inputDatasets, mask_and_scale=False, decode_times=False, combine='nested', concat_dim='time', preprocess=extract_date)
Where inputDatasets
is a list of netCDF4 files to be concatenated, and the function extract_date
is given by:
def extract_date(ds): # preprocessing function when opening files
for var in ds.variables:
if var == 'time':
dataset_time = ds[var].values
dataset_units = ds[var].units
ds.assign(time=dataset_time)
ds["time"].attrs = {"units": dataset_units}
return ds
Note that each netCDF4 file in inputDatasets
contains a time
array of length one (i.e., a scalar value), which corresponds to the time at the center of the month. These values will eventually fill the new time
coordinate in the dataset f
created by xr.open_mfdataset
.
Several keywords are set in xr.open_mfdataset
. These are necessary for reading in these files for the following reasons:
--mask_and_scale
: needs to be 'False' for these files otherwise _FillValue
information will not be provided when the files are read in.
--decode_times
: needs to be set to 'False'. Prevents xarray
from converting time data (in 'seconds since...' or 'days since...') to datetime.
--combine
: needs to be set to nested
per xarray documentation in order to assign concat_dim
(https://docs.xarray.dev/en/stable/generated/xarray.open_mfdataset.html; see below).
--concat_dim
: a string with the name of the axis/dimesion that the gridded lat/lon coordinate variables will be concatenated along. Simply set to time
in this code, but could be any string the user wants, in theory.
--preprocess
: set to extract_date
. Setting this variable tells xr.open_mfdataset
to run the dataset through the extract_date
function before doing anything else.
This call appropriately reads in all monthly average gridded lat
/lon
data in the netCDF4 files given by the inputDatasets
list and concantenates these data along a new time
dimension. This results in an output dataset f
with arrays of gridded variable maps having the dimensions time
, lat
, and lon
where the length of time
corresponds to the total number of months in the dataset. However, when xc.open_mfdataset
is used instead of xr.open_mfdataset
, several issues arise. These are detailed below.
When the same call to open_mfdataset
shown above is made with xcdat
:
f = xc.open_mfdataset(inputDatasets, mask_and_scale=False, decode_times=False, combine='nested', concat_dim='time', preprocess=extract_date)
The following error is returned:
xarray.core.merge.MergeError: conflicting values for variable `nobs` on objects to be combined. You can skip this check by specifying compat=`override`
In this particular error, nobs
refers to one of the lat/lon gridded fields in the netCDF4 files to be combined (I believe it is one of the first fields read by this call). When compat='override'
is set, the following error is returned:
ValueError: Cannot specify both coords='different' and compat='override'
In this error, it is referring to the default value of the coords
keyword. When the coords
keyword is changed to either all
or minimal
(options per the xarray.open_mfdataset
documentation given above), the call xc.open_mfdataset
runs and creates a dataset f
with a new time
coordinate as expected. However, instead of the output dataset containing arrays of gridded maps (i.e., arrays of gridded maps with dimensions of [time
, lat
, lon
]), there is no time
dimension, simply maps with two spatial dimensions (lat
and lon
). I believe these maps correspond to the gridded data in the first or last file in the inputDatasets
file list.
It is possible that there is a sequence of keywords using xcdat.open_mfdataset
that will produce the same output dataset as the xarray.open_mfdataset
call given above. However, the purpose of this document is to note how the same method using the same keywords in xcdat
and xarray
produces different results. Therefore, we feel it is best to bring this to the attention of the developers of xcdat
.
I think this issue arises because xarray
uses data_vars="all"
when opening multifile datasets and xcdat
uses data_vars="minimal"
.
I think this issue can be resolved with:
f = xr.open_mfdataset(inputDatasets, mask_and_scale=False, decode_times=False, combine='nested', concat_dim='time', preprocess=extract_date, data_vars='all')
I think specifying data_vars="all"
solves the immediate issue. I guess a question is whether xcdat
should continue to use data_vars="minimal"
as the default (I don't remember why this was done, but it presumably was useful at some point).
@pochedls Thank you for summarizing this issue and troubleshooting.
Regarding to data_vars="minimal"
, I recall that occasionally I will need to specify it when using open_mfdataset
. @tomvothecoder documented the issue here
With data_vars='minimal', for instance if the time
dim is not appear in a data variable, the variable won't be stacked. If i understand correctly, in this use case, the expected behavior is to construct a time dimension even if the variable is invariant in time?
Hi @pochedls and @chengzhuzhang, here's what I came up with based on your comments.
As Jill pointed out in #143, data_vars="all"
concatenates the time
dimension to variables such as lat_bnds
and lon_bnds
. This seems like an unexpected/unwanted behavior that also breaks spatial averaging. We had to set data_vars="minimum"
to stop this from happening.
There is a closed xarray issue that talks about adding data_vars="minimum"
to address this issue.
xarray docs on data_vars
kwarg:
data_vars (
{"minimal", "different", "all"}
or list of str, default:"all"
) –These data variables will be concatenated together:
- “minimal”: Only data variables in which the dimension already appears are included.
- “different”: Data variables which are not equal (ignoring attributes) across all datasets are also concatenated (as well as all for which dimension already appears). Beware: this option may load the data payload of data variables into memory if they are not already
- “all”: All data variables will be
- list of str: The listed data variables will be concatenated, in addition to the “minimal” data variables.
Example:
import os
import xcdat as xc
dir = "/p/user_pub/PCMDIobs/PCMDIobs2/atmos/3hr/pr/TRMM-3B43v-7/gn/v20200707"
file_pattern = "pr_3hr_TRMM-3B43v-7_BE_gn_v20200707_1998*.nc"
files = os.path.join(dir, file_pattern)
# `data_vars="minimal"` (xcdat default)
# ------------------------------------------------------------
ds1 = xc.open_mfdataset(files, data_vars="minimal")
# Notice how "lat_bnds" retains its original dimensions (no "time" dimension)
print(ds1.lat_bnds)
"""
<xarray.DataArray 'lat_bnds' (lat: 400, bnds: 2)>
dask.array<where, shape=(400, 2), dtype=float64, chunksize=(400, 2), chunktype=numpy.ndarray>
Coordinates:
* lat (lat) float64 -49.88 -49.62 -49.38 -49.12 ... 49.38 49.62 49.88
Dimensions without coordinates: bnds
"""
# Spatial averaging works as expected
pr_global1 = ds1.spatial.average("pr", axis=["X", "Y"])
# `data_vars="all"` (xarray default)
# ------------------------------------------------------------
ds2 = xc.open_mfdataset(files, data_vars="all")
# Notice how "lat_bnds" now has a "time" dimension (unexpected)
print(ds2.lat_bnds)
"""
<xarray.DataArray 'lat_bnds' (time: 2920, lat: 400, bnds: 2)>
dask.array<concatenate, shape=(2920, 400, 2), dtype=float64, chunksize=(248, 400, 2), chunktype=numpy.ndarray>
Coordinates:
* time (time) object 1998-01-01 00:00:00 ... 1998-12-31 21:00:00
* lat (lat) float64 -49.88 -49.62 -49.38 -49.12 ... 49.38 49.62 49.88
Dimensions without coordinates: bnds
"""
# This results in the error below with spatial averaging
pr_global2 = ds2.spatial.average("pr", axis=["X", "Y"])
"""
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
/home/vo13/xCDAT/xcdat/qa/issue-394-data-vars/qa.py in line 5
15 #%%
16
17 # `data_vars="all"` (xarray default)
18 ds2 = xc.open_mfdataset(files, data_vars="all")
----> 19 pr_global2 = ds2.spatial.average("pr", axis=["X", "Y"])
File ~/xCDAT/xcdat/xcdat/spatial.py:195, in SpatialAccessor.average(self, data_var, axis, weights, keep_weights, lat_bounds, lon_bounds)
192 elif isinstance(weights, xr.DataArray):
193 self._weights = weights
--> 195 self._validate_weights(dv, axis)
196 ds[dv.name] = self._averager(dv, axis)
198 if keep_weights:
File ~/xCDAT/xcdat/xcdat/spatial.py:707, in SpatialAccessor._validate_weights(self, data_var, axis)
705 dim_name = get_dim_keys(data_var, key)
706 if dim_name not in self._weights.dims:
--> 707 raise KeyError(
708 f"The weights DataArray does not include an {key} axis, or the "
709 "dimension names are not the same."
710 )
712 # Check the weight dim sizes equal data var dim sizes.
713 dim_sizes = {key: data_var.sizes[key] for key in self._weights.sizes.keys()}
KeyError: 'The weights DataArray does not include an X axis, or the dimension names are not the same.'
"""
In this specific issue, it sounds there are cases where we do want to concatenate on dimensions that are not in the original data var (e.g., time
).
With this knowledge, we can either:
data_vars="minimum"
data_vars
as needed (e.g., data_vars="all"
)data_vars="all"
I'll think about this more, but do either you have any thoughts here?
I think we should keep this as is (default data_vars="minimum"
), but perhaps add some notes about this case/error in open_mfdataset()
. [I haven't thought enough about "what" to add – I see there is already good notes on data_vars
].
What happened?
See attached markdown file prepared by Andrew Manaser (RSS). xcdat_open_mfdataset_issue_README.md
What did you expect to happen?
This problem occurs when using the
open_mfdataset
method withxcdat
(occasionalloy calledxc
in this document) to concatenate multiple files of monthly average griddedlat
/lon
geophysical netCDF4 data along a new axis (in this casetime
)Minimal Complete Verifiable Example
No response
Relevant log output
No response
Anything else we need to know?
No response
Environment
Should be included on attached file.