zarr-developers / geozarr-spec

This document aims to provides a geospatial extension to the Zarr specification. Zarr specifies a protocol and format used for storing Zarr arrays, while the present extension defines conventions and recommendations for storing multidimensional georeferenced grid of geospatial observations (including rasters).
106 stars 10 forks source link

Per-chunk metadata (e.g., bbox) #4

Closed benbovy closed 1 year ago

benbovy commented 1 year ago

@rabernat mentioned in https://twitter.com/rabernat/status/1617209410702696449 the idea of attaching a "GeoBox" (i.e., bbox + CRS + grid metadata) to a dataset, which is implemented in odc-geo and which is indeed useful for indexing.

Now I'm wondering if it would be possible to reconstruct such GeoBox for each chunk of a Zarr array or dataset? This would require storing a bbox per chunk. I'm not very familiar with Zarr specs, though. Is it possible/easy to store arbitrary metadata per chunk?

One potential use case would be scalable (e.g., dask-friendly) implementation of spatial regridding / resampling algorithms that would work with non-trivial datasets (e.g., curvilinear grids).

There is an interesting, somewhat related discussion in the geoarrow-specs repository: https://github.com/geoarrow/geoarrow/issues/19. As far as I understand, geospatial vector datasets are currently partitioned using multiple parquet files (dask-geopandas parquet IO, dask-geopandas.read_parquet). For GeoZarr, however, we don't want one Zarr dataset per spatial partition I guess.

benbovy commented 1 year ago

It could also be defined at the shard level...

briannapagan commented 1 year ago

Maybe can we have @joshmoore input here?

rabernat commented 1 year ago

If images have different spatial extents, I think it would make more sense to store them as distinct arrays, rather than as chunks of the same array.

rabernat commented 1 year ago

I want to close this as out of scope.

Zarr does not allow per-chunk metadata. We are not making any Zarr extensions here. So therefore, we need to find a different solution to this use case. The obvious one to me is to just store images with different bbox in separate arrays.

benbovy commented 1 year ago

I think that at this stage this is still a good place here (better in a new issue) for discussing if/how in general we can facilitate spatial indexing and/or partitioning of large datasets, even if this would require multiple zarr arrays (groups?) or any kind of zarr extension.

TomAugspurger commented 1 year ago

I might be missing something, but this should be possible today without need for per-chunk metadata. As long as you know have something like the geotransform so that you know where the "origin" pixel is and the space between each pixel, and the size of each chunk, you should be able to get the bbox of each chunk with a bit of math.

This should be exactly the same has how GDAL / COG handle reading a single block out of a larger COG, just using multiple files / chunks.

Perhaps it isn't safe to assume that every chunk of this dataset is on the same grid / projection. But in that case, I'd recommend storing them in separate arrays.

rabernat commented 1 year ago

As long as you know have something like the geotransform so that you know where the "origin" pixel is and the space between each pixel, and the size of each chunk, you should be able to get the bbox of each chunk with a bit of math

This is interesting. 🤔 So the idea is that you would have an array stack with dimensions

image[item, y, x, band]

and then coordinate variables like

x_origin[item]
y_origin[item]

Then you could construct the geotransforms on the fly for the entire collection, create a geodataframe, etc.

For the image collections we are talking about, it is safe to assume that the images all have the same x and y dimensions? Or could they possibly be different sizes?

joshmoore commented 1 year ago

briannapagan commented 2 days ago Maybe can we have @joshmoore input here?

Sorry for the slow response. I don't any definitive responses but ...

rabernat commented 18 hours ago I want to close this as out of scope. Zarr does not allow per-chunk metadata. We are not making any Zarr extensions here. So therefore, we need to find a different solution to this use case.

Big :+1: for this strategy on this repo with the caveat that the individual convention efforts (GeoZarr, NGFF, etc) will likely identify things that to make it to zarr-specs.

As long as you know have something like the geotransform so that you know where the "origin" pixel

This reminds me somewhat of https://github.com/ome/ngff/pull/138 (which also triggered a discussion in NGFF space about use of cfconventions...)