Open benbovy opened 1 year ago
This question is linked with how do we want to optimise the convolution with nan values (land for oceanography users).
If we keep vector index, we can save RAM usage, but neighboring search may be expensive (but how much ? benchmarking?)
If we transform to sparce matrix, we can't save RAM for computing. (saving to disk we can compress)
@tinaok I'm not sure to fully grasp your high-level use case. Do you have a concrete example?
I was rather thinking here about something like image convolution, using a fixed-size kernel representing the direct & contiguous neighborhood of the DGGS cells. For simplicity I assume only level-1 neighborhood (direct neighbors) but maybe we can expand that up to a n-neighbors level or a fixed distance, depending on backend support (h3 and h3ronpy support that but not sure for healpy?).
Regarding masking and nan values (I guess you have a dataset with contiguous coverage in ocean and a land mask?): I think it would be possible to first drop all the masked land cells before applying the convolution (in xdggs the DGGS cell coordinate is 1-dimensional and full-earth coverage is not mandatory). For step 1b in my comment above it is possible to have nan values for neighbor cell ids that are not found in the Dataset / DataArray (just like when doing xarray reindexing). Finally for step 2 I guess we could use a reduce method that ignores nan values. This would already be memory and storage friendly.
Or do I misunderstand and you start with a Dataset that has a very sparse distribution of DGGS cells and you want true nearest-neighbor search (i.e., "dynamic-sized" convolution kernel)?
From what I can tell, the idea is that we perform the neighbors search once and then construct a sparse matrix that encodes the convolution operation (or really, any operation that is linear in the input cells). This is the expensive part, and also where you would apply any additional conditions like ignoring nan
values (but dropping those does sound like a better idea, even though it would make the neighbors search slightly more complex) or performing a more realistic neighbors search that considers land as barriers.
Once we have that, we can simply do a sparse matrix multiplication (using xr.dot
backed by opt_einsum
), which can usually be done without exploding memory or loading the entire array into memory. What could happen, though, is that the matrix itself exceeds the memory even with the sparse format, but I think that can be avoided by chunking the sparse matrix, and usually each output cell is independent of any other output cell.
The advantage is that this makes caching easier, but it also means that the tricky part is now the construction of the matrix (plus, non-linear operations might not work?).
Ah I see. In the case of a basic ("grid contiguous") neighbor search, I think that the construction of the matrix could be done by:
xarray.core.indexes.get_indexer_nd()
xdggs could "easily" expose some utility functions or methods for steps 2, 3 and 4.
Neighbor search that is more advanced than simply excluding masked neighbors cells (e.g., using propagation algorithms with barriers) looks much trickier and specific IMO. Especially if you explicitly consider DGGS cell geometry (polygon) and not only their centroid.
EDIT: relying on a pandas index may be limited when using a grid with very high resolution combined with a large domain of interest, but this is a more general issue (same for spatial indexing that is hard to scale horizontally).
the algorithm certainly is somewhat specific, but can be abstracted by a mask and by providing a propagation-based distance metric based on that. That metric should probably not be included in this package, but maybe we can provide a function to calculate a sparse matrix for the standard case? That might be easier to understand than splitting the whole thing up (though if we later discover that this can be done without being too complicated we still add it). Applying the operation to the data can already be done using xr.dot
.
combined with a large domain of interest
What do you mean by this? Does that translate to the item size of the convolution kernel, such that the sparse matrix would be a lot less sparse? If so, I agree that this is tricky, and we'd probably need some sort of spatial map_overlap
. However, most of the time I think kernels are small enough so I'd say let's skip that issue for now.
maybe we can provide a function to calculate a sparse matrix for the standard case?
Yes sure! Better if it is something that can be supported across more than one DGGS.
By high resolution combined with a large domain (spatial extent) I simply mean a lot (tens of millions? hundreds of millions?) of cells to index, which might eventually reach the limits of using a pandas index or any spatial index that is not distributed / partitioned.
I think this is why some DGGS have the neighborship encoded in the cell id directly (e.g. Uber H3 or DGGRID Q2DI).
Therefore, there is no need for a dedicated spatial index that does not scale well horizontally. Neighbor functions e.,g. h3.grid_disk
are based on this and don't rely on an external (spatial) index or pandas.Index
.
Yes indeed for certain cases a pandas or spatial index might not even be needed, although I guess this also assumes a number of other things, e.g., that the cell ids of the DGGS coordinate are ordered (two contiguous coordinate labels represent two spatially contiguous cells) so that we can rely on simpler indexing rules to retrieve and handle neighbor data?
We may want to support those cases in xdggs. Probably we could make optional the creation of a pandas index when setting a new DGGSIndex? There is actually some discussion on the Xarray side to support alternative default indexes in cases where a pandas index is not really needed (https://github.com/pydata/xarray/discussions/7041#discussioncomment-3662179). It will be nice for xdggs to integrate well with Xarray regarding this.
For a number of other cases (e.g., discontinuous or messy coordinate DGGS cell ids, precise intersection between DGGS cells and arbitrary vector features, etc.) the spatial and/or topological information encoded in cell ids will likely not be enough so a pandas or spatial index will still be relevant.
Would it possible for XDGGS, serving as an index, to hold the indices of the first neighbors for each cell (similar to what has been implemented for latitude and longitude coordinates)?
For instance:
For index 10, would 1st_neighbour_index[10, [15, 19, 39, 40]]
(in the case of healpix 4, but with h3, more) be feasible?
Furthermore, this could dynamically expand in scenarios where one of the neighbors is land, resulting in:
1st_neighbour_index[10, [15, 19, Nan, 40]]
This enhancement might accelerate convolution processes and optimize root search?
@tinaok yes this is what I propose in the top comment (step 1a). The result could be another DataArray or Dataset with a new "neighbor" dimension where the size depends on the grid type. One limitation is that Xarray only works with fixed dimension sizes while this might not be the case for all DGGSs (e.g., H3's hexagons vs. pentagons) but I think using NaNs is fine (also in case of the presence of a mask like land).
One way to achieve convolution is a "map-reduce"-ish two step procedure on a Dataset / DataArray:
healpy.pixelfunc.get_all_neighbours()
h3.grid_disk()
(unfortunately not yet vectorized) b. retrieve (data) variable data along the new neighbor dimension (vectorized indexing, potentially expensive).mean()
, etc.)