Closed slevang closed 2 months ago
Thanks, and yes for sure, hopefully next week I will find time to add tests and benchmarks.
Did some quick profiling on a ~4GB array of 1/4deg global data coarsening to 1deg. Dask array on a 32 CPU node. Results:
skipna=False
: 32sskipna=True
: 64smain
: 96sSo adding skipna forces roughly one additional pass through the array with the weight renormalization. The reason this PR is faster than main
is because the current code has the np.any(np.isnan())
check which forces computation, plus the separately calculated isnan
array, which forces 3 passes through the data. If I cut out the logic branch of checking for NaNs on main
and go straight to the einsum, we recover the ~32s run above.
Made the modification to take notnull.any(non_regrid_dims)
which leaves us at about a 3x performance penalty for skipna=True
in the benchmarks I've run. I think this should maybe be a configurable arg though in cases where you want to track NaNs very carefully throughout the dataset.
Merged as part of #41
Nice package! In testing it out where I've previously used
xesmf
, I noticed two features lacking from the conservative method:Number 1 is easy, number 2 is trickier. I added a naive implementation for the
nan_threshold
capabilities of xesmf here for discussion. As noted in the previous issue, to do this 100% correctly we would need to track the NaN fraction as we reduce over each dimension, which I'm not doing here. Thenan_threshold
value doesn't translate directly to total fraction of NaN cells due to the sequential reduction. It would also get complicated for isolated NaNs in the temporal dimension.I'm not sure any of this matters much for a dataset where you have consistent NaN's e.g. SST. Here's an example of the new functionality used on the MUR dataset. Note this is a 33TB array but we can now generate the (lazily) regridded dataset instantaneously.