xarray-contrib / xdggs

Xarray extension for DGGS
Apache License 2.0
50 stars 9 forks source link

Convolution #8

Open benbovy opened 8 months ago

benbovy commented 8 months ago

One way to achieve convolution is a "map-reduce"-ish two step procedure on a Dataset / DataArray:

  1. Add one cell "neighbors" dimension a. get neighbor indices for each grid cell
    • Healpix: healpy.pixelfunc.get_all_neighbours()
    • H3: h3.grid_disk() (unfortunately not yet vectorized) b. retrieve (data) variable data along the new neighbor dimension (vectorized indexing, potentially expensive)
  2. Aggregate the neighbors dimension. This step is trivial and can be done with xarray API (e.g., .mean(), etc.)
tinaok commented 8 months ago

This question is linked with how do we want to optimise the convolution with nan values (land for oceanography users).

If we keep vector index, we can save RAM usage, but neighboring search may be expensive (but how much ? benchmarking?)

If we transform to sparce matrix, we can't save RAM for computing. (saving to disk we can compress)

benbovy commented 8 months ago

@tinaok I'm not sure to fully grasp your high-level use case. Do you have a concrete example?

I was rather thinking here about something like image convolution, using a fixed-size kernel representing the direct & contiguous neighborhood of the DGGS cells. For simplicity I assume only level-1 neighborhood (direct neighbors) but maybe we can expand that up to a n-neighbors level or a fixed distance, depending on backend support (h3 and h3ronpy support that but not sure for healpy?).

Regarding masking and nan values (I guess you have a dataset with contiguous coverage in ocean and a land mask?): I think it would be possible to first drop all the masked land cells before applying the convolution (in xdggs the DGGS cell coordinate is 1-dimensional and full-earth coverage is not mandatory). For step 1b in my comment above it is possible to have nan values for neighbor cell ids that are not found in the Dataset / DataArray (just like when doing xarray reindexing). Finally for step 2 I guess we could use a reduce method that ignores nan values. This would already be memory and storage friendly.

Or do I misunderstand and you start with a Dataset that has a very sparse distribution of DGGS cells and you want true nearest-neighbor search (i.e., "dynamic-sized" convolution kernel)?

keewis commented 8 months ago

From what I can tell, the idea is that we perform the neighbors search once and then construct a sparse matrix that encodes the convolution operation (or really, any operation that is linear in the input cells). This is the expensive part, and also where you would apply any additional conditions like ignoring nan values (but dropping those does sound like a better idea, even though it would make the neighbors search slightly more complex) or performing a more realistic neighbors search that considers land as barriers.

Once we have that, we can simply do a sparse matrix multiplication (using xr.dot backed by opt_einsum), which can usually be done without exploding memory or loading the entire array into memory. What could happen, though, is that the matrix itself exceeds the memory even with the sparse format, but I think that can be avoided by chunking the sparse matrix, and usually each output cell is independent of any other output cell.

The advantage is that this makes caching easier, but it also means that the tricky part is now the construction of the matrix (plus, non-linear operations might not work?).

benbovy commented 8 months ago

Ah I see. In the case of a basic ("grid contiguous") neighbor search, I think that the construction of the matrix could be done by:

  1. select only cells that are part of the domain for interest (e.g., ocean)
  2. for each of the selected cells get the neighbor cell ids like in step 1a above
  3. maybe get the distances between cell centroids or any metric that is relevant for encoding the operation
  4. based on the DGGSIndex underlying pandas index, get the position (integer) of each neighbor along the DGGS dimension (or -1 if no cell is found in the index) using something similar to xarray.core.indexes.get_indexer_nd()
  5. construct the sparse matrix using the preferred encoding scheme

xdggs could "easily" expose some utility functions or methods for steps 2, 3 and 4.

Neighbor search that is more advanced than simply excluding masked neighbors cells (e.g., using propagation algorithms with barriers) looks much trickier and specific IMO. Especially if you explicitly consider DGGS cell geometry (polygon) and not only their centroid.

EDIT: relying on a pandas index may be limited when using a grid with very high resolution combined with a large domain of interest, but this is a more general issue (same for spatial indexing that is hard to scale horizontally).

keewis commented 8 months ago

the algorithm certainly is somewhat specific, but can be abstracted by a mask and by providing a propagation-based distance metric based on that. That metric should probably not be included in this package, but maybe we can provide a function to calculate a sparse matrix for the standard case? That might be easier to understand than splitting the whole thing up (though if we later discover that this can be done without being too complicated we still add it). Applying the operation to the data can already be done using xr.dot.

combined with a large domain of interest

What do you mean by this? Does that translate to the item size of the convolution kernel, such that the sparse matrix would be a lot less sparse? If so, I agree that this is tricky, and we'd probably need some sort of spatial map_overlap. However, most of the time I think kernels are small enough so I'd say let's skip that issue for now.

benbovy commented 8 months ago

maybe we can provide a function to calculate a sparse matrix for the standard case?

Yes sure! Better if it is something that can be supported across more than one DGGS.

By high resolution combined with a large domain (spatial extent) I simply mean a lot (tens of millions? hundreds of millions?) of cells to index, which might eventually reach the limits of using a pandas index or any spatial index that is not distributed / partitioned.

danlooo commented 8 months ago

I think this is why some DGGS have the neighborship encoded in the cell id directly (e.g. Uber H3 or DGGRID Q2DI). Therefore, there is no need for a dedicated spatial index that does not scale well horizontally. Neighbor functions e.,g. h3.grid_disk are based on this and don't rely on an external (spatial) index or pandas.Index.

benbovy commented 8 months ago

Yes indeed for certain cases a pandas or spatial index might not even be needed, although I guess this also assumes a number of other things, e.g., that the cell ids of the DGGS coordinate are ordered (two contiguous coordinate labels represent two spatially contiguous cells) so that we can rely on simpler indexing rules to retrieve and handle neighbor data?

We may want to support those cases in xdggs. Probably we could make optional the creation of a pandas index when setting a new DGGSIndex? There is actually some discussion on the Xarray side to support alternative default indexes in cases where a pandas index is not really needed (https://github.com/pydata/xarray/discussions/7041#discussioncomment-3662179). It will be nice for xdggs to integrate well with Xarray regarding this.

For a number of other cases (e.g., discontinuous or messy coordinate DGGS cell ids, precise intersection between DGGS cells and arbitrary vector features, etc.) the spatial and/or topological information encoded in cell ids will likely not be enough so a pandas or spatial index will still be relevant.

tinaok commented 8 months ago

Would it possible for XDGGS, serving as an index, to hold the indices of the first neighbors for each cell (similar to what has been implemented for latitude and longitude coordinates)?

For instance: For index 10, would 1st_neighbour_index[10, [15, 19, 39, 40]] (in the case of healpix 4, but with h3, more) be feasible?

Furthermore, this could dynamically expand in scenarios where one of the neighbors is land, resulting in: 1st_neighbour_index[10, [15, 19, Nan, 40]]

This enhancement might accelerate convolution processes and optimize root search?

benbovy commented 8 months ago

@tinaok yes this is what I propose in the top comment (step 1a). The result could be another DataArray or Dataset with a new "neighbor" dimension where the size depends on the grid type. One limitation is that Xarray only works with fixed dimension sizes while this might not be the case for all DGGSs (e.g., H3's hexagons vs. pentagons) but I think using NaNs is fine (also in case of the presence of a mask like land).