xarray-contrib / xoak

xarray extension that provides tree-based indexes used for selecting irregular, n-dimensional data.
https://xoak.readthedocs.io
MIT License
57 stars 4 forks source link

Implement 2d sel via Xarray accessor? #1

Closed willirath closed 3 years ago

willirath commented 4 years ago

We have https://git.geomar.de/python/xorca_lonlat2ij which basically

(xorca_lonlat2ij.get_ij is a good entry point to get an overview of the logic.)

Then, we usually use these indices for a vector .sel() as seen here https://github.com/ESM-VFC/esm-vfc-workflows/blob/7edd9bbc7cd3532c0537f89f667154a9dd27156d/meteor_m85.1_vs_NEMO_FOCI_hydrography/compare_m81.1_hydrography_with_ORCA05_FOCI_Test_data.ipynb

What I'd like to have is this implemented as an Xarray accessor or something that is similarly easy to use.

This would not act on the staggered C grids, we'll need XGCM for, but could be easily generalized to XGCM-compatible datasets if https://github.com/xgcm/xgcm/issues/200 became reality at a later point.

(@benbovy This could be a first thing to work on.)

willirath commented 4 years ago

Here's a discussion about distributed KDTrees (https://gist.github.com/brendancol/a3dd4a35ecd94660411112999923d561) which points here (https://doi.org/10.1111/j.1467-8659.2007.01062.x) for a more distributed implementation.

willirath commented 4 years ago

@benbovy Here's a gist (better view on nbviever) with a very simple ndim selection example.

benbovy commented 4 years ago

Thanks @willirath. Some thoughts:

willirath commented 4 years ago

I think there's two aspects here, both of which we can understand and optimize separately (if we design this properly):

  1. What's the external API?

    # haversine metric
    points["temp"] = ds.ndsel.sel(x=points["x"], y=points["y"], metric="haversine")["temp"]
    # arbitrary metrix
    points["temp"] = ds.ndsel.sel(x=points["x"], y=points["y"], metric=metric_func)["temp"]
    # or even
    points["temp"] = ds.ndsel.sel(points, metric="haversine")["temp"]

    For this, we can afford really inefficient internals.

  2. Internal optimizations: This is mostly about the below. But we should make sure to have abstractions that make it easy to swap in / out the ideas you outlined.

I think we shold go for the implementation of the API first, because that makes it easier to play with / compare different optimizations.

I wonder if we couldn't directly use lat/lon coordinates when building a data structure for indexing. I've been looking again at this blog post by Jake VanderPlas about the implementation of a ball tree algorithm compatible with any distance metric. We could then use the haversine metric, or maybe a simpler/faster formula considering that the computed distances are relatively short (<= NEMO/FESOM grid resolution). Not sure whether that would be faster or slower, though.

Thanks for pointing me to this blog post. Sounds like this is a good way to go for the trees. As Jake's version seems to support arbitrary Python functions as metric, we can generalize this later.

The case in your gist does rather correspond to the worst case scenario where the selection points are scattered across the whole domain. I wonder if this really reflects the more real cases where the selection points have specific spatial patterns (e.g., ship track) and/or very limited extend. In those cases, there might be simple tricks to first shrink the amount of data used to build an indexing tree (e.g., first selection using a bounding box or convex hull).

Our target positions can get quite messy. If you think of all the autonomous floats and drifters, you get points that are not really easy to order, because these devices are scattered all across the globe. Here's the latest positions of the ARGO array: argo array image

This (with completely new positions every ~10 days) is about the most difficult use case we'll have to handle.

Depending on the computational cost of those operations, this might avoid the need of distributed trees, which seem much more complex to me. I haven't read the paper linked in your comment above yet, so I might be wrong, but based on the title/abstract it seems to address the parallel construction of kd‐trees, not their distributed storage.

I think going for fully distributed tree construction does not really matter. But (for chunked coords and / (x)or chunked ship sections) we could do something like:

If we can afford storing the whole tree in memory for each published grid, that's even simpler, we could build those trees ahead of time (the computational cost doesn't really matters).

Yes, we should think about this. It's probably difficult to do this in a way that survives on disk and across versions. But we could aim at only calculating trees once per session.

benbovy commented 4 years ago

For the public API, I was thinking of something like

ds_selected = ds.xsomething.sel_nearest(lat=lat_points, lon=lon_points)

The API should be compatible with Xarray's advanced indexing using DataArray objects. I think it will be compatible if we use .isel internally.

We could also provide a convenient method for the case where we have a set of points (observations) already in a Dataset or DataArray object:

ds_selected = ds.xsomething.reindex_like(ds_points)

Not sure reindex_like is a good name, though, as the behavior is different than Xarray's reindex_like.

benbovy commented 4 years ago

Our target positions can get quite messy. If you think of all the autonomous floats and drifters, you get points that are not really easy to order, because these devices are scattered all across the globe.

Ah that's good to know, I didn't thought about this case!

But (for chunked coords and / (x)or chunked ship sections) we could do something like ...

So, do we want to support the case where the number of observation points is very high? I haven't thought about that either.

benbovy commented 4 years ago

I quickly tested sklearn's BallTree implementation here, it works pretty well (fast and low memory footprint, for 1e6 points).

willirath commented 4 years ago

Hence, do we really need to expose a metric parameter?

While our use case is positions on the globe, I'd strongly favor some implementation that uses arbitrary ndim indexing with an arbitrary metric and a simple wrapper for georeferenced data.

Firstly, there's use cases that are very close to global and realistic Ocean model grids (although they're not really applicable to observational data from the real world): Idealized models with rectangular boxes are often used by the same people that are also interested in the realistic data.

Secondly, the ndim selection could easily add value for people that are not Ocean or climate modellers without really adding complexity on our side. (I'm thinking of people who do stuff like plasma simulations in torodial geometry.)

Thirdly (not sure people say this...), clearly separating out the metric ensures we design and code in a generalizable way, while just hard coding dim names lat and lon somewhere deep in the stack is an easy slip through if we rely on being on the Earth too much.

willirath commented 4 years ago

I quickly tested sklearn's BallTree implementation here, it works pretty well (fast and low memory footprint, for 1e6 points).

Looks good. I was initially feeling a little hesitant to rely on sklearn rather than something that is lower in the stack (like scipy). But this is only a gut feeling.

willirath commented 4 years ago

I think it would be better to provide an accessor that is not too generic (i.e., nearest neighbor lookup in a n-d space), but general enough to support NEMO, FESOM2 and possibly other model grids working with lat/lon coordinates.

xgeosel?

(Although I'm still liking the idea of going for a wider use case. But this is probably something that will eventually be built elsewhere and closer to the Xarray core devs?)

willirath commented 4 years ago

geondsel?

benbovy commented 4 years ago

I'd also be in favor of a generic solution for indexing n-d coordinates! However, I wouldn't spend too much time trying to settle the API now.

We are going to work very soon on Xarray's flexible indexes refactoring. Once indexes will be considered as first-class citizens of the Xarray data model, the recommended approach for addressing our problem will be to reuse/provide a custom index class and then directly use Dataset's or DataArray's .sel() method.

Using an accessor should be a temporary solution before flexible indexes are implemented (hopefully within the next 12 months), so probably best is to avoid generalizing it too much. Ideally, the accessor methods should mimic the current .sel methods to facilitate smooth transition in the future.

Providing generic index classes for some use-cases including n-d irregular data indexing is also part of the Xarray development roadmap. I think this will require input from the broader community.

there's use cases that are very close to global and realistic Ocean model grids (although they're not really applicable to observational data from the real world): Idealized models with rectangular boxes are often used by the same people that are also interested in the realistic data.

If this is something we want to support as soon as possible (< 12 months), and if those idealized models use non-regular grids, then flexible metrics make perfect sense. But this is something we could still easily support later, if needed.

benbovy commented 4 years ago

For the accessor API, we'll need to think how to handle index construction and index queries. There are two options:

A. implicit index construction (cache)

# 1st time call: triggers index construction 
ds.balltree.sel(lat=[...], lon=[...])

# reuse the built index
ds.baltree.sel(lat=[...], lon=[...])

B. explicit index construction

ds.balltree.set_index(['lat', 'lon'], metric='haversine')

ds.balltree.sel(lat=[...], lon=[...])

I'm in favor of option B, which is closer to what it will (most probably) look like after the indexes refactor in Xarray. Additionally, we could provide a ds.balltree.index property for accessing the underlying index.

benbovy commented 4 years ago

If we are going for a more generic solution (which is probably not a bad idea after all :-) ), then I would rather choose a descriptive index name for the name of the accessor, e.g., balltree or btree.

I was initially feeling a little hesitant to rely on sklearn rather than something that is lower in the stack (like scipy). But this is only a gut feeling.

I was too, and I'm still a bit hesitant of depending on the whole scikit-learn library just for reusing it's balltree implementation. I'll l have a look at the code to see if it would make sense to copy it instead.

willirath commented 4 years ago

Are you available for a quick (5 min) skype call? I'll be in meetings for most of the rest of the day, then.

willirath commented 4 years ago

I'll l have a look at the code to see if it would make sense to copy it instead.

For the initial development, depending is fine, I think.

benbovy commented 4 years ago

So this gist contains a preliminary implementation of a xarray extension for point-wise indexing of irregular data using a ball tree. A few comments:

willirath commented 4 years ago

This is extremely cool!

I'll add a few tests later today or tomorrow in the morning.

Also, I'll play with a few cases where Dask is used for the coords and for ds_indexer_2d. To see what performance we can expect without parallelizing the balltree part itself. First impression is very good: All np --> dask.array exept in the BallTreeAccessor class and the transform just works.

benbovy commented 4 years ago

I'm wondering if transform in shouldn't be applied on tolerance too in sel. Probably yes.

willirath commented 4 years ago

I'm wondering if transform in shouldn't be applied on tolerance too in sel. Probably yes.

According to the docs, the haversine distances are returned in units of radians too. But this is something that could be inconsistent across different implementations of the metrics?

willirath commented 4 years ago

@benbovy I've parametrized your gist: https://nbviewer.jupyter.org/gist/willirath/e0380c02da41568eb91bfaf509faefff to

benbovy commented 4 years ago

Nice!

What are the largest grid sizes for NEMO/FESOM? What would be the largest number of indexer points?

I'm wondering if/where we should put effort in to enable dask support (out-of-core / parallel / lazy computation). Possible issues/bottlenecks are:

  1. loading all coordinate data into memory
  2. ball tree construction time
  3. ball tree storage in memory
  4. loading all indexer data into memory

We could reduce 2 and 3 by increasing the leaf size of the ball tree. This would make queries less efficient, but this is a embarrassingly parallel problem and supporting queries with dask might be as simple as inserting a map_blocks somewhere (this would solve 4 too -- I guess it would also enable lazy selection?).

Unfortunately, I have no idea on how to address 1 with a simple and efficient solution. The solution that you referenced above looks quite easy to implement but it is not optimal. Instead of searching in all trees, considering that NEMO/FESOM grids are dense, we could build a forest of ball tree from overlapping regions, then wrap the query to first determine which tree(s) to use. This is getting quite complicated and it is still not a robust solution, though.

willirath commented 4 years ago

Size of the problem

What are the largest grid sizes for NEMO/FESOM? What would be the largest number of indexer points?

Max Number of points on model grid

Our current "big" models have O(3_000x3_000)~O(10_000_000) points. The are / will be Ocean-only and possibly climate models with much higher resolution, though. The LLC4320 model has approx 1/48 deg resolution. That'd amount to 48**2 * 360 * 180 ~ 150_000_000 horizontal points.

Max size of indexers

A worst / most expensive case would be a study where we want to sample all the surface positions of all the autonomous devices observing the Ocean. There's the two biggest efforts: ~1300 surface drifters measuring surface data continuously and ~4000 ARGO floats sampling profiles every ~10 days.

Just handling all the positions for one point in time would need indexers with a length of approx. 10_000 positions. Vessel based measurements might be in the same range as well.

But it gets hard, wenn we go for high time resoluton that is available from, e.g., the surface drifters: With hourly positions from the real drifters, we'd have approx 24 * 365 * 1300 ~ 10_000_000 unique positions for only one year.

Taking this to the next level (even though this is beyond the virtual field campaign scope) would be going for millions of virtual particles, each sampled at high time resolution, as is propsed in a paper by Ryan.

Bottom line

We should estimate how much work it is to get to 100_000_000 grid points and selectors of length 10_000_000 in minutes.

Dask / other optimizations

'm wondering if/where we should put effort in to enable dask support (out-of-core / parallel / lazy computation). Possible issues/bottlenecks are:

  1. loading all coordinate data into memory
  2. ball tree construction time
  3. ball tree storage in memory
  4. loading all indexer data into memory

We could reduce 2 and 3 by increasing the leaf size of the ball tree. This would make queries less efficient, but this is a

That's an interesting thought. We should at least see how far we get there. Queries are easy to parallelize, because we just need to split the selector ds. But it won't leverage much of the parallel resources we usually have.

embarrassingly parallel problem and supporting queries with dask might be as simple as inserting a map_blocks somewhere (this would solve 4 too -- I guess it would also enable lazy selection?).

I think, we can assume that the number of blocks / chunks of the grid is always much smaller than the number of grid points per chunk. (Example: With 150_000_000 grid points, at double precision, we'd have ~ 2.5 GB for lat + lon. With Dask's default chunking, this would be ~ 10 to 20 chunks of ~ 125 to 250 MB each.

As from a user's perspective, only wall time matters, we'll need to know the timings for

I agree that it's probably just a few map_blocks calls. I'll see if I get to trying this today.

Unfortunately, I have no idea on how to address 1 with a simple and efficient solution.

Loading the data from disk is nothing we can optimize in a development setting (laptop or other single machine). In real use cases with truly distributed Dask workers, we see that the IO bandwidth scales linearly with the number of workers / nodes we're on.

The solution that you referenced above looks quite easy to implement but it is not optimal. Instead of searching in all trees, considering that NEMO/FESOM grids are dense, we could build a forest of ball tree from overlapping regions, then wrap the query to first determine which tree(s) to use. This is getting quite complicated and it is still not a robust solution, though.

We can (sort of) assume dense grids for NEMO. But there are exceptions like folding the med-sea to Africa, because we can get a few CPUs in an MPI setting busy that would otherwise just handle very few grid points, etc. Also on the FESOM side, there is a way of using space-filling curves to number the nodes. But from what I gathered from talking with AWI people, this cannot be taken for granted. So we should be able to handle the worst case: Completely unstructured grid positions.

benbovy commented 4 years ago

Thanks for all the details about the size of the problem, that helps a lot!

Queries are easy to parallelize, because we just need to split the selector ds. But it won't leverage much of the parallel resources we usually have.

I'm not sure to fully understand your last point. It all depends on whether/how the indexers are chunked, right? I was thinking of just reusing the indexer chunks (if it is chunked) to perform the queries in parallel. It requires that the chunks of each indexer (lat, lon) are aligned (we could re-chunk internally, but it would be better if it is done explicitly and externally IMO).

One potential big limitation with those parallel queries is in a distributed setting. Because the ball tree structure is not distributed, this would result in poor performance (lots of data transfer) and/or high memory consumption (ball tree index data copied to all workers).

I agree that it's probably just a few map_blocks calls. I'll see if I get to trying this today.

Are you going to try implementing chunked queries on a single tree? Or constructing a forest of ball trees and perform queries using this forest? Or both?

I'm not sure how dask arrays would speed-up the construction of a single ball tree. Coordinate transform (if any) and stack operations are cheap compared to the sequential computation of the ball tree.

Loading the data from disk is nothing we can optimize in a development setting

Before considering disk IO optimization, I was rather thinking about the memory limits / out-of-core issues of (1) on small machines.

willirath commented 4 years ago

I was thinking of just reusing the indexer chunks (if it is chunked) to perform the queries in parallel. It requires that the chunks of each indexer (lat, lon) are aligned

Yes, that's what I was having in mind. Assuming alignment of lat and lon is somethig we should do.

Are you going to try implementing chunked queries on a single tree?

I'm still working on the timings. So far: Tree from 125 MB of lat+lon is 20-30 seconds (on a single MacBook CPU core) while querying for 125MB of positions takes forever (not sure how long. Still runs... :/).

So the biggest benefit would be from chunked queries on a single tree.

I'm not sure how dask arrays would speed-up the construction of a single ball tree.

I'm thinking in the direction of "have a flat collection of single-machine kd-trees and, when we need to query something we check them all". This is not optimal. But we can gain a lot more (wrt wall time) if we just make use of easy scaling to multiple CPUs and accept the overhead of finding the actual nearest neighbors among N_trees many candidates.

Before considering disk IO optimization, I was rather thinking about the memory limits / out-of-core issues of (1) on small machines.

Ah. Not sure memory-use for the coords this is the first wall we'll hit. Coords are tiny compared to the data. Horizontal positions of the ENORMOUS LLC4320 cost 2.5 GB.

benbovy commented 4 years ago

So far: Tree from 125 MB of lat+lon is 20-30 seconds (on a single MacBook CPU core) while querying for 125MB of positions takes forever (not sure how long. Still runs... :/).

I'm curious how this would go if you reduce the leaf size. Probably the size of the tree would explode :-)

But we can gain a lot more (wrt wall time) if we just make use of easy scaling to multiple CPUs and accept the overhead of finding the actual nearest neighbors among N_trees many candidates.

Yes I agree, in the end this could take more advantage of a highly distributed computing environment. Ideally, we should provide different solutions (i.e., different indexes / accessors) in a package so that we can pick up the best one depending on the problem and the computing environment.

Maybe other structures are more adapted? Here we don't take full advantage of the ball tree, which out-performs other tree-based indexing structures especially for high-dimensional data. Maybe R-Trees would help? I've found this paper about a distributed R-Tree structure. R-Trees require more storage than kd-trees or ball-trees, but in our 2-dimensional case that's not a big issue. I don't know if it's compatible with any metric, though (e.g., haversine). Also, we would have to implement it.

benbovy commented 4 years ago

Another question, closer to the scientific aspect: what are the potential implications of doing unconstrained / naive nearest neighbor lookup? What if the selected nearest neighbor corresponds to a node where you need to cross land area before reaching it? Is that a big issue for the scientific applications?

willirath commented 4 years ago

What if the selected nearest neighbor corresponds to a node where you need to cross land area before reaching it?

This can be a problem. We could probably solve this by implementing a metric that checks for this, but this easily gets really complicated. Also, for the FESOM data, there is no land points in the grid. So we'd start handling this with shape files and then we'd need a way of ensuring that these shapes really represent the model geometry rather than that of the real Earth (people use little modifications to the model topography especially in regions where the model resolution is of the same order as geographical features like seamounts, islands, straits etc).

So I think we should handle these cases outside of the balltree activities for now. What could be helpful, would be a way of explicitly returning the distances. (Like the tolerance approach but with the user being able to check this outside of the accessor.)

benbovy commented 4 years ago

Yes, I don't know whether or not it's an edge case, but this could be addressed later.

In the mid/long-term, it will be nice to have implementations of different tree-based indexes (xarray compatible) in a dedicated package. What would be nice is a numba flexible implementation that would allow plugging in user-defined, jitted metric functions.

What could be helpful, would be a way of explicitly returning the distances.

Maybe like this: ds.balltree.sel(..., dist_varname="distances") that would add a distances data variable in the returned dataset.

But I'm not sure if we really need it. ds.balltree.sel() returns the adjusted coordinates, so it should be pretty easy (though maybe expensive) to re-compute the distances if needed. Also, the distances might not be helpful to solve the issue of land vs. sea.

willirath commented 4 years ago

In the mid/long-term, it will be nice to have implementations of different tree-based indexes (xarray compatible) in a dedicated package.

It would at least be very easy to support KD trees and ball trees from sklearn as they have a very similar or even identical API.

Maybe it would be a good design decision to make the tree construction and the query completely pluggable? If I see that correctly, we'd only have to make

self._index = BallTree(X, **kwargs)

from ds.balltree.set_index() configurable via a kwarg?

What would be nice is a numba flexible implementation that would allow plugging in user-defined, jitted metric functions.

This is a good idea.

willirath commented 4 years ago

Looking through what we learned in this issue discussion, I wonder if we should move ds.balltree into a repo and compile a roadmap or a design doc?

Also, we could rename it to ds.nntree to make it easy to swap in different trees? Not sure if this is too soon, but the public API defined in the gist feels mature enough to be fixed for now.

willirath commented 4 years ago

NNTree is already taken for a neural-network based tree. But what about ANTS.

benbovy commented 4 years ago

Maybe it would be a good design decision to make the tree construction and the query completely pluggable?

I think it would be a good design choice, unless the logic currently put in _query, get_pos_indexers, etc. differ too much from one tree to another, or unless it would need slightly different signatures for sel. In that case, it would be best to maintain separate accessors with a common interface.

I wonder if we should move ds.balltree into a repo and compile a roadmap or a design doc

Yes!

But what about ANTS.

I like it. Although the library scope could be extended to other kinds of selection than nearest neighbor lookup. For example, a range tree implementation would be useful for orthogonal indexing. We could also use kd-trees and ball trees for point radius selection.

What about xoak? Or x<any-fancy-tree-species>?

"<repo-name> is a xarray extension that provides tree-based indexes used for selecting irregular, n-dimensional data"

benbovy commented 4 years ago

unless the logic currently put in _query, get_pos_indexers, etc. differ too much from one tree to another

This would probably happen if we use tricks to parallelize / distribute the trees.

willirath commented 4 years ago

I like xoak, because it is about the shortest easiest to type of all possible variants of x<tree>.

willirath commented 4 years ago

Just a quick heads-up: I'll create a repo with the latest version from your Gist.

benbovy commented 4 years ago

Great, thanks!

I don't have strong opinion about package structure. I haven't used poetry yet. Recently I've started putting source files under src/ and I'm happy with it.

Do you mind if I have commit rights on this repo? I'll continue contributing through PRs anyway.

willirath commented 4 years ago

I think you have them per your status in ESM-VFC. Let me check...

willirath commented 4 years ago

@benbovy Can you see if you could (in principle) merge #3?

benbovy commented 4 years ago

There's no visible merge button on my side.

benbovy commented 4 years ago

It works now, thanks!

willirath commented 4 years ago

Needed to add write access per repository. I've added you as full admin.