xarray-contrib / xoak

xarray extension that provides tree-based indexes used for selecting irregular, n-dimensional data.
https://xoak.readthedocs.io
MIT License
57 stars 4 forks source link

Tolerance is not working #11

Closed willirath closed 3 years ago

willirath commented 4 years ago

This https://github.com/ESM-VFC/xoak/blob/53a877582077fe076e35d13df6925167ef839a4c/tests/test_catesian_coords_random_data.py#L65 fails because currently, https://github.com/ESM-VFC/xoak/blob/ffa933406461256eced4a4bc776e098147b5045f/src/xoak/core.py#L95 modifies the size of the index array.

willirath commented 4 years ago

I suggest we change the API here and drop the tolerance keyword, but allow for requesting the distances along with the selected data so that .xoak.sel(..., return_distances=True) will contain an additional coord called sel_distances with the distances calculated based on the metric.

One could argue that xarray.sel() has a tolerance keyword and that hence .xoak.sel() should have one, too. But the tolerance keyword of xarray.sel() is difficult to use in a meaningful way, because it just raises a KeyError whenever any of the indexer points violates the tolerance. So in real applications, I guess what people will do is calculate the distances and do masking / dropping outside of xarray.sel() anyway.

willirath commented 4 years ago

The reason why I think allowing for .xoak.sel(..., return_distances=True) rather than just dropping the tolerance keyword is that in our case, the metric may be difficult to access and apply after the fact. (Users would need to import it from sklearn and figure out how to apply it themselves.)

benbovy commented 4 years ago

Ah good catch.

I agree that returning distances as a coordinate (or data variable?) is useful. I'm not a big fan of methods returning new variables with hard-coded names, so I would rather lean towards .xoak.sel(..., neighbor_distances=None), and replace None by any string if one wants to get the distances.

That said, I think that supporting a tolerance might still be useful and possible. We could track the invalid indices and then set data as missing values (and maybe drop) after calling isel, e.g., using .where(reshaped_dist <= tolerance, drop=True).

willirath commented 4 years ago

I agree that returning distances as a coordinate (or data variable?) is useful. I'm not a big fan of methods returning new variables with hard-coded names, so I would rather lean towards .xoak.sel(..., neighbor_distances=None), and replace None by any string if one wants to get the distances.

I like this.

set data as missing values (and maybe drop) after calling isel

This is also a good solution.

, e.g., using .where(drop=True).

IMO, dropping should be up to the user. Otherwise, it's not easy to tell which points of the indexer were / weren't used.

benbovy commented 4 years ago

IMO, dropping should be up to the user

Agreed.

A. Another keyword argument .xoak.sel(drop=False)? B. let users call .dropna() on the returned Dataset / DataArray?

Option A might be confusing, as there is already a drop argument in .sel() that has a different meaning.

It might be worth to try xarray advanced indexing, i.e., ds.sel(x=da, method='nearest')) to see what's the behavior...

benbovy commented 4 years ago

It might be worth to try xarray advanced indexing, i.e., ds.sel(x=da, method='nearest')) to see what's the behavior...

Ah I could have read your comment above more carefully:

One could argue that xarray.sel() has a tolerance keyword and that hence .xoak.sel() should have one, too. But the tolerance keyword of xarray.sel() is difficult to use in a meaningful way, because it just raises a KeyError whenever any of the indexer points violates the tolerance.

Agreed. I think that this behavior will eventually need to be addressed in xarray, at least for point-wise indexing. It occurs to me too that returning elements with missing value seems more useful than raising a KeyError.

benbovy commented 3 years ago

Tolerance has been removed in #18. I also opened #26 for the suggestion of returning the distances to the nearest neighbors.