xCDAT / xcdat

An extension of xarray for climate data analysis on structured grids.
https://xcdat.readthedocs.io/en/latest/
Apache License 2.0
113 stars 12 forks source link

[Enhancement]: Update `coords="minimal"` and `compat="minimal"` as defaults to improve performance of `xc.open_mfdataset()`? #641

Open tomvothecoder opened 5 months ago

tomvothecoder commented 5 months ago

Is your feature request related to a problem?

xarray.open_mfdataset() has a few issues related to: (1) incorrectly concatenating coords on variables (e.g,. "time" gets added to "lat_bnds") and 2) performance. xCDAT addresses (1) by defaulting data_vars="minimal". To address (2) performance, the post and docs below suggest adding coords="minimal" and "compat="override".

https://github.com/pydata/xarray/issues/1385#issuecomment-1700325001

It is very common for different netCDF files in a "dataset" (a folder) to be encoded differently so we can't set decode_cf=False by default.

there's probably something else going on under the hood that's causing the slowness of open_mfdataset at present.

There's

  1. slowness in reading coordinate information from every file. We have parallel to help a bit here.
  2. slowness in combining each file to form a single Xarray dataset. By default, we do lots of consistency checking by reading data. Xarray allows you to control this, data_vars='minimal', coords='minimal', compat='override' is a common choice.

What your describing sounds like a failure of lazy decoding or acftime slowdown (example)which should be fixed. If you can provide a reproducible example, that would help.

https://github.com/pydata/xarray/issues/1385#issuecomment-1958761334

This is an amazing bug. The defaults say data_vars="all", coords="different" which means always concatenate all data_vars along the concat dimensions (here inferred to be "time") but only concatenate coords if they differ in the different files.

When decode_cf=False , lat ,lon are data_vars and get concatenated without any checking or reading. When decode_cf=True, lat, lon are promoted to coords, then get checked for equality across all files. The two variables get read sequentially from all files. This is the slowdown you see.

Once again, this is a consequence of bad defaults for concat and open_mfdataset.

I would follow docs.xarray.dev/en/stable/user-guide/io.html#reading-multi-file-datasets and use data_vars="minimal", coords="minimal", compat="override" which will only concatenate those variables with the time dimension, and skip any checking for variables that don't have a time dimension (simply pick the variable from the first file).

Describe the solution you'd like

Xarray documentation

A common use-case involves a dataset distributed across a large number of files with each file containing a large number of variables. Commonly, a few of these variables need to be concatenated along a dimension (say "time"), while the rest are equal across the datasets (ignoring floating point differences). The following command with suitable modifications (such as parallel=True) works well with such datasets:

xr.open_mfdataset('my/files/*.nc', concat_dim="time", combine="nested",
                  data_vars='minimal', coords='minimal', compat='override')

This command concatenates variables along the "time" dimension, but only those that already contain the "time" dimension (data_vars='minimal', coords='minimal'). Variables that lack the "time" dimension are taken from the first dataset (compat='override').

Describe alternatives you've considered

No response

Additional context

I don't know how reliable parallel=True is for speeding up reading coordinate information. There is a Xarray GitHub issue #7079 with comments suggesting using the parallel=True is not thread-safe and might cause resource locking on some filesystems, unlike the default parallel=False. Tony B and I ran into this in e3sm_to_cmip (related issue).

In this e3sm_diags PR, we are getting a TimeoutError: Timed out when using xcdat.open_mfdataset(). There might be some performance issues with the underlying call to xarray.open_mfdataset(). I think this e3sm_diags issue is actually related to compatibility with the multiprocessing scheduler manually defined in e3sm_diags (related issue).

pochedls commented 5 months ago

@tomvothecoder – it seems like adding these defaults could be helpful, but this would ideally be tested across many datasets (e.g., in the PMP) before it is rolled out.