Open kthyng opened 1 year ago
Sorry for all the additions. I have continued to use the code in the project I'm working on and have realized more about the existing code and better approaches as I go.
I forgot to say this was meant to address https://github.com/xarray-contrib/xoak/issues/26.
@benbovy or @willirath are either of you available to take a look at this? I'm hoping to be able to use these changes in a version update of xoak with a package of mine that uses xoak. Thanks for your time.
@benbovy I'm completely open to the API and detailed choices in the code if you want to change anything. I just need access to the distances! I am aware of the new indices in xarray but haven't seen them in action to see how they work or how to use them yet.
I am aware of the new indices in xarray but haven't seen them in action to see how they work or how to use them yet.
Yes this is pretty new and still largely undocumented. Xoak is a good place to try this feature, but we need a smooth transition so I don't expect it to be straightforward.
I'm completely open to the API and detailed choices in the code if you want to change anything.
Sorry I could have provided more details in my previous comment. I was thinking about this alternative:
Keep Dataset.xoak.sel(indexers=None, **indexers_kwargs)
unchanged so that it will be easier to later switch to using Dataset.sel((indexers=None, **indexers_kwargs)
with a custom Xarray index implemented in xoak. We might want eventually provide custom options to Dataset.sel
but it is still not clear how would look like the API (https://github.com/pydata/xarray/issues/7099).
Add a Dataset.query(indexers=None, **indexers_kwargs)
that returns both distances and indices as Xarray DataArray and Dataset objects (#32).
This would look like this (reusing the examples in the documentation):
>>> ds_mesh
# <xarray.Dataset>
# Dimensions: (x: 100, y: 100)
# Coordinates:
# lat (x, y) float64 ...
# lon (x, y) float64 ...
# Dimensions without coordinates: x, y
# Data variables:
# field (x, y) float64 ...
>>> ds_trajectory
# <xarray.Dataset>
# Dimensions: (trajectory: 30)
# Dimensions without coordinates: trajectory
# Data variables:
# latitude (trajectory) ...
# longitude (trajectory) ...
>>> ds_mesh.xoak.set_index(['lat', 'lon'], 'sklearn_geo_balltree')
>>> distances, indices = ds_mesh.xoak.query(
... lat=ds_trajectory.latitude,
... lon=ds_trajectory.longitude,
... )
>>> distances
# <xarray.DataArray (trajectory: 30)>
# ...
# Dimensions without coordinates: trajectory
>>> indices
# <xarray.Dataset>
# Dimensions: (trajectory: 30)
# Dimensions without coordinates: trajectory
# Data variables:
# x (trajectory) int64 ...
# y (trajectory) int64 ...
Having a query
method that returns a (distances, indices)
tuple is consistent with scipy.spatial.KDTree.query()
, sklearn.neighbors.KDTree.query()
, sklearn.neighbors.BallTree.query()
, etc.
The distances
and indices
Xarray objects returned here are a bit confusing as they don't have the same type and structure, but it allows doing:
>>> results = ds_mesh.isel(indices).assign(distances=distances)
>>> results
# <xarray.Dataset>
# Dimensions: (trajectory: 30)
# Coordinates:
# lat (trajectory) float64 ...
# lon (trajectory) float64 ...
# Dimensions without coordinates: trajectory
# Data variables:
# field (trajectory) float64 ...
# distances (trajectory) float64 ...
Now, this is clearly more verbose than what is proposed in this PR:
>>> ds_mesh.xoak.sel(
... lat=ds_trajectory.latitude,
... lon=ds_trajectory.longitude,
... distances_name="distances"
... )
Also, because Dataset.xoak.query()
doesn't return a single Xarray object it wouldn't be easy to chain it with other Dataset / DataArray methods (it is possible using pipe -> merge
but this is less straightforward than just using Dataset.xoak.sel()
). So maybe we could support returning the distances via both .xoak.query()
and xoak.sel()
after all? What do you think @kthyng @willirath?
So maybe we could support returning the distances via both .xoak.query() and xoak.sel() after all?
If we're going to support both, I think we can either merge this PR and then refactor #32 or the other way around.
@benbovy I like having the indices and distances explicitly available with
distances, indices = ds_mesh.xoak.query( ...
as you suggest. One thing I didn't like about my change is that when a DataArray is input but distances are returned, a Dataset is returned to contain both variables. It's been messing up my code here and there.
Maybe it would be better to keep it simple with .sel
(no changes) and if you want to have the distances and indices explicitly, you have to do the two step process you laid out of
distances, indices = ds_mesh.xoak.query(
... lat=ds_trajectory.latitude,
... lon=ds_trajectory.longitude,
... )
results = ds_mesh.isel(indices).assign(distances=distances)
But I do think it should be up to you @benbovy and @willirath so let me know what you prefer and I can try to implement, or you can go ahead with it, either way.
I think leaving .xoak.sel()
unchanged and as close to the standard xarray .sel()
API as possible is what we should do. The proposal to have .query()
returning, both, distances and indices sounds like the best way to implement this.
Changes include:
Notes: