Closed willirath closed 3 years ago
I suggest we change the API here and drop the tolerance keyword, but allow for requesting the distances along with the selected data so that .xoak.sel(..., return_distances=True)
will contain an additional coord called sel_distances
with the distances calculated based on the metric.
One could argue that xarray.sel()
has a tolerance keyword and that hence .xoak.sel()
should have one, too. But the tolerance
keyword of xarray.sel()
is difficult to use in a meaningful way, because it just raises a KeyError
whenever any of the indexer points violates the tolerance. So in real applications, I guess what people will do is calculate the distances and do masking / dropping outside of xarray.sel()
anyway.
The reason why I think allowing for .xoak.sel(..., return_distances=True)
rather than just dropping the tolerance
keyword is that in our case, the metric may be difficult to access and apply after the fact. (Users would need to import it from sklearn
and figure out how to apply it themselves.)
Ah good catch.
I agree that returning distances as a coordinate (or data variable?) is useful. I'm not a big fan of methods returning new variables with hard-coded names, so I would rather lean towards .xoak.sel(..., neighbor_distances=None)
, and replace None
by any string if one wants to get the distances.
That said, I think that supporting a tolerance
might still be useful and possible. We could track the invalid indices and then set data as missing values (and maybe drop) after calling isel
, e.g., using .where(reshaped_dist <= tolerance, drop=True)
.
I agree that returning distances as a coordinate (or data variable?) is useful. I'm not a big fan of methods returning new variables with hard-coded names, so I would rather lean towards .xoak.sel(..., neighbor_distances=None), and replace None by any string if one wants to get the distances.
I like this.
set data as missing values (and maybe drop) after calling isel
This is also a good solution.
, e.g., using .where(drop=True).
IMO, dropping should be up to the user. Otherwise, it's not easy to tell which points of the indexer were / weren't used.
IMO, dropping should be up to the user
Agreed.
A. Another keyword argument .xoak.sel(drop=False)
?
B. let users call .dropna()
on the returned Dataset / DataArray?
Option A might be confusing, as there is already a drop
argument in .sel()
that has a different meaning.
It might be worth to try xarray advanced indexing, i.e., ds.sel(x=da, method='nearest')
) to see what's the behavior...
It might be worth to try xarray advanced indexing, i.e., ds.sel(x=da, method='nearest')) to see what's the behavior...
Ah I could have read your comment above more carefully:
One could argue that xarray.sel() has a tolerance keyword and that hence .xoak.sel() should have one, too. But the tolerance keyword of xarray.sel() is difficult to use in a meaningful way, because it just raises a KeyError whenever any of the indexer points violates the tolerance.
Agreed. I think that this behavior will eventually need to be addressed in xarray, at least for point-wise indexing. It occurs to me too that returning elements with missing value seems more useful than raising a KeyError.
Tolerance has been removed in #18. I also opened #26 for the suggestion of returning the distances to the nearest neighbors.
This https://github.com/ESM-VFC/xoak/blob/53a877582077fe076e35d13df6925167ef839a4c/tests/test_catesian_coords_random_data.py#L65 fails because currently, https://github.com/ESM-VFC/xoak/blob/ffa933406461256eced4a4bc776e098147b5045f/src/xoak/core.py#L95 modifies the size of the index array.