xarray-contrib / xoak

xarray extension that provides tree-based indexes used for selecting irregular, n-dimensional data.
https://xoak.readthedocs.io
MIT License
57 stars 4 forks source link

PROPOSAL: pygeos & rtree engine for geometry based selection #34

Open snowman2 opened 3 years ago

snowman2 commented 3 years ago

I am wondering if adding support for these R-Tree based selectors would be of interest:

It would allow for selecting the points with a geometry as input.

snowman2 commented 3 years ago

Closing for #12

benbovy commented 3 years ago

I am wondering if adding support for these R-Tree based selectors would be of interest

Yes this would be nice additions! We could add adapters for those R-Tree indexes either in xoak or in 3rd party packages. Adding it in xoak would be a reasonable option for now (dependencies are optional for all adapters implemented in xoak). I'm not sure how the number of adapters would grow in the future, though.

It would allow for selecting the points with a geometry as input.

This would be great too. That would require some work, though, as we need to extend the API of the index adapters and the xoak xarray accessors in order to support this. Open questions, I guess, are which "conventions" to use for representing those geometries within the xarray data model, and what are the advantages of using xarray instead of, e.g., geopandas for this use case.

(I'm re-opening this issue as I think it's worth discussing geometry based selection here - the topic #12 is too broad)

snowman2 commented 3 years ago

I guess, are which "conventions" to use for representing those geometries within the xarray data model, and what are the advantages of using xarray instead of, e.g., geopandas for this use case.

This stems from: https://github.com/corteva/rioxarray/issues/202

Currently in rioxarray you can clip the data using a list of geometries: https://corteva.github.io/rioxarray/stable/examples/clip_geom.html

However, this only works for regular grids (i.e. 1 dimensional x and y coordinates that are equally spaced)

Coordinates:
  * y            (y) float64 ...
  * x            (x) float64 ...

The idea of this would be to enable clipping with 2 dimensional coordinates (usually latitude and longitude) with irregular grids:

  * latitude            (latitude, longitude) float64 ...
  * longitude           (latitude, longitude) float64 ... 

With the pygeos.STRTree, you can query for the coordinates that intersect the input geometries and get the indexes back to generate a mask you could use to clip/select data.

>>> import pygeos
>>> tree = pygeos.STRtree(pygeos.creation,points(xds.longitude, xds.latitude))
>>> tree.query_bulk([pygeos.box(2, 2, 4, 4), pygeos.box(5, 5, 6, 6)]).tolist()

In this scenario, there isn't really a need/benefit to use geopandas.

benbovy commented 3 years ago

I see, clipping irregular data with a list of geometries is a nice example that xoak should support directly or indirectly.

xoak currently exposes a .sel method that is similar to xarray's .sel. Currently only point-wise indexing (i.e., using xarray objects as indexers) is possible, but range or rectangular selection (using slices) like in the simple examples given in https://corteva.github.io/rioxarray/stable/examples/clip_geom.html should be supported later. xoak's xarray accessor API could be extended with extra methods for more advanced but still common operations like clip with compound or complex geometries (although it might be too specific to fall in xoak's scope).

In addition, there's ongoing discussion/work in #32 to expose xarray-friendly wrappers for the query methods of the adapted index objects. The challenge here would be to support the variety of the queries supported by each index (e.g., pygeos.STRTree.query_bulk, sklearn.neighbors.KDTree.query_radius, etc.) through a common API.

BTW, I started writing bindings for the s2geometry library here that may provide similar capabilities than pygeos.STRTree for spherical coordinates.

benbovy commented 3 years ago

The challenge here would be to support the variety of the queries supported by each index (e.g., pygeos.STRTree.query_bulk, sklearn.neighbors.KDTree.query_radius, etc.) through a common API.

By common API, I mean a small set of query methods, one for each kind of query, rather than a single query method for all kinds of queries. Still, for the same kind of query there might be subtle differences from one index object to another.