Adding an xarray wrapper with apply_ufunc

jbusecke commented 2 years ago

[x] Closes #14

This draft PR represents the progress @TomNicholas and I made today.

We achieved the following:

Testing the output for given shapes of input

We wrote a little script (dev_numpy_wrapper.py) that tests the return shape of values from the 'raw' f2py function.

The results can be summarized like this:

# original:(200, 200, 10) | output:(200, 200, 10)
# original:(200, 200, 1) | output:(200, 200, 1)
# original:(200, 200) | output:(200, 200, 1)
# original:200 | output:(200, 1, 1)
# original:() | output:(1, 1, 1)
# (200, 200, 10, 4),  # ValueError: too many axes: 4 (effrank=4), expected rank=3
# (200, 200, 0),  # ValueError: unexpected array size: new_size=40000, got array with arr_size=0
# (200, 0, 3), #ValueError: unexpected array size: new_size=40000, got array with arr_size=0
# original:(1, 1, 1) | output:(1, 1, 1)

From which you can see that any input with <=3 dims/axes will return a 3d array, while >3 dims/axes or a 0 size axis will result in errors.

We concluded from this that the easiest way to use apply_ufuncs would be to ensure len(dims)==3 before invoking xr.apply_ufunc, because handling changes in array dimensionality is complicated. Our high level approach for now is to simple assert that input arrays have 3 dimensions after broadcasting. In the future we could simply expand along dummy dimensions, then pass to apply_ufunc and squeeze the dummy dimensions out again (the user would get back the same shape of array that they put in).

Working with chunked dask arrays

We have an initial test running in dev_xarray_wrapper.py that seems to successfully parallelize over a chunked array.

Outstanding issues:

Refactor this into a proper module + write tests
Check that results are 'transpose invariant'. If I assume correctly that the calculation is in principal pointwise, it doesnt matter for the results which dimension the fortran code internally iterates over. I will write a test that confirms that with some synthetic data. If this turns out to be right, we could tweak the performance of the fortran code by transposing the arrays appropriately outside of the wrapper (and add a comment about this). If the results are NOT the same, we need to think more deeply about how we make sure that the time dimension is identified (and then parallelized over since at that point it will be a core dimension).

jbusecke commented 2 years ago

UPDATE: I have added the xarray wrapper to flux.py and added some initial tests.

I believe that the tests confirm this to be indeed true:

Check that results are 'transpose invariant'. If I assume correctly that the calculation is in principal pointwise, it doesnt matter for the results which dimension the fortran code internally iterates over. I will write a test that confirms that with some synthetic data. If this turns out to be right, we could tweak the performance of the fortran code by transposing the arrays appropriately outside of the wrapper (and add a comment about this).

Which is great news! This means that we can use this wrapper to process our data in the cloud already (given that we are careful about units).

jbusecke commented 2 years ago

Hmm interesting. The CI seems to be failing, but locally it worked for me. Could this be platform dependent?

jbusecke commented 2 years ago

Just expanded the testing matrix to include macos to rule that out.

jbusecke commented 2 years ago

Yeah sure. You can make suggestions or push directly. I will probably try to work on it later. Just let me know when you break for the weekend.

Oh I think the CI problems are caused by the None kwargs in the numpy wrapper, so if we could solve this with positional arguments only that might solve this.

jbusecke commented 2 years ago

Sorry this should not have been closed, my bad

jbusecke commented 2 years ago

Ooof this is really messed up. I am getting these weird error during the test execution that actually show up as passed even though they only pass a subset of the tests. I need to check this really carefully before this gets merged. More tomorrow

jbusecke commented 2 years ago

Ooof this is really messed up. I am getting these weird error during the test execution that actually show up as passed even though they only pass a subset of the tests. I need to check this really carefully before this gets merged. More tomorrow

jbusecke commented 2 years ago

I made #28 to serve as a reminder for @rabernat concerns. Do you think we can merge this for now? Keen to get this released before our hack tomorrow.

cc @TomNicholas

jbusecke commented 2 years ago

We need this to be available on conda for todays hack, so we will merge now, but record concerns raised in issues.

xgcm / aerobulk-python

Adding an xarray wrapper with apply_ufunc #15

Testing the output for given shapes of input

Working with chunked dask arrays