xarray.open_mfdataset() has a few issues related to: (1) incorrectly concatenating coords on variables (e.g,. "time" gets added to "lat_bnds") and 2) performance. xCDAT addresses (1) by defaulting data_vars="minimal". To address (2) performance, the post and docs below suggest adding coords="minimal" and "compat="override".
It is very common for different netCDF files in a "dataset" (a folder) to be encoded differently so we can't set decode_cf=False by default.
there's probably something else going on under the hood that's causing the slowness of open_mfdataset at present.
There's
slowness in reading coordinate information from every file. We have parallel to help a bit here.
slowness in combining each file to form a single Xarray dataset. By default, we do lots of consistency checking by reading data. Xarray allows you to control this, data_vars='minimal', coords='minimal', compat='override' is a common choice.
What your describing sounds like a failure of lazy decoding or acftime slowdown (example)which should be fixed. If you can provide a reproducible example, that would help.
This is an amazing bug. The defaults say data_vars="all", coords="different" which means always concatenate all data_vars along the concat dimensions (here inferred to be "time") but only concatenate coords if they differ in the different files.
When decode_cf=False , lat ,lon are data_vars and get concatenated without any checking or reading. When decode_cf=True, lat, lon are promoted to coords, then get checked for equality across all files. The two variables get read sequentially from all files. This is the slowdown you see.
Once again, this is a consequence of bad defaults for concat and open_mfdataset.
I would follow docs.xarray.dev/en/stable/user-guide/io.html#reading-multi-file-datasets and use data_vars="minimal", coords="minimal", compat="override" which will only concatenate those variables with the time dimension, and skip any checking for variables that don't have a time dimension (simply pick the variable from the first file).
A common use-case involves a dataset distributed across a large number of files with each file containing a large number of variables. Commonly, a few of these variables need to be concatenated along a dimension (say "time"), while the rest are equal across the datasets (ignoring floating point differences). The following command with suitable modifications (such as parallel=True) works well with such datasets:
This command concatenates variables along the "time" dimension, but only those that already contain the "time" dimension (data_vars='minimal', coords='minimal'). Variables that lack the "time" dimension are taken from the first dataset (compat='override').
Describe alternatives you've considered
No response
Additional context
I don't know how reliable parallel=True is for speeding up reading coordinate information. There is a Xarray GitHub issue #7079 with comments suggesting using the parallel=True is not thread-safe and might cause resource locking on some filesystems, unlike the default parallel=False. Tony B and I ran into this in e3sm_to_cmip (related issue).
In this e3sm_diags PR, we are getting a TimeoutError: Timed out when using xcdat.open_mfdataset(). There might be some performance issues with the underlying call to xarray.open_mfdataset(). I think this e3sm_diags issue is actually related to compatibility with the multiprocessing scheduler manually defined in e3sm_diags (related issue).
@tomvothecoder – it seems like adding these defaults could be helpful, but this would ideally be tested across many datasets (e.g., in the PMP) before it is rolled out.
Is your feature request related to a problem?
xarray.open_mfdataset()
has a few issues related to: (1) incorrectly concatenating coords on variables (e.g,. "time" gets added to "lat_bnds") and 2) performance. xCDAT addresses (1) by defaultingdata_vars="minimal"
. To address (2) performance, the post and docs below suggest addingcoords="minimal"
and"compat="override"
.https://github.com/pydata/xarray/issues/1385#issuecomment-1700325001
https://github.com/pydata/xarray/issues/1385#issuecomment-1958761334
Describe the solution you'd like
Xarray documentation
Describe alternatives you've considered
No response
Additional context
I don't know how reliable
parallel=True
is for speeding up reading coordinate information. There is a Xarray GitHub issue #7079 with comments suggesting using theparallel=True
is not thread-safe and might cause resource locking on some filesystems, unlike the defaultparallel=False
. Tony B and I ran into this in e3sm_to_cmip (related issue).In this e3sm_diags PR, we are getting a
TimeoutError: Timed out
when usingxcdat.open_mfdataset()
.There might be some performance issues with the underlying call toI think this e3sm_diags issue is actually related to compatibility with the multiprocessing scheduler manually defined in e3sm_diags (related issue).xarray.open_mfdataset()
.