zarr-developers / geozarr-spec

This document aims to provides a geospatial extension to the Zarr specification. Zarr specifies a protocol and format used for storing Zarr arrays, while the present extension defines conventions and recommendations for storing multidimensional georeferenced grid of geospatial observations (including rasters).
116 stars 10 forks source link

GeoZarr to Support 2D RGB and Multi-Dimensional EO Data #51

Open christophenoel opened 1 week ago

christophenoel commented 1 week ago

Context

Consider an Earth Observation (EO) scene raster made available as a COG or via a Map Tiling Service:

Concretly, OpenLayers can display easily a RGB thumbnail/visualisation on a map.

Considerations for GeoZarr

  1. RGB Display in GeoZarr:

    • GeoZarr should enable encoding 2D rasters in an RGB format in a standardized way. The idea is that coordinates (latitude, longitude, and bands like RGB) are explicitly named to facilitate their interpretation. This way, a client wouldn’t need to deeply parse all metadata to understand these dimensions. Additionally, it could be useful to indicate explicitly that this data is of type 2D RGB in a main attribute so that users can easily identify it.
  2. Multi-Dimensional Data Support (3D/4D+):

    • GeoZarr already allows encoding 3D and 4D+ rasters, with dimensions like time, wavelength, and altitude. There should be a syntax convention (that can be used within GeoZarr or in any metadata format such as STAC) to express a GeoZarr subset without requiring to parse all metadata (e.g. [time=1],[altitude=2])
    • For 3D time series, there should be a convention for the time dimension, and probably a type or requirement class advertised for such GeoZarr.
    • For other 3D+ data, there should be a convention for a client application to express a visualisation/preview by identying the subset for R, G, B.
  3. Variable Identification for Multi-Layer Data:

    • Similar to formats like NetCDF, GeoZarr may contain multiple data variables at the same level. A standard method for identifying these variables within GeoZarr is needed, especially in STAC items, to specify how to retrieve and display the data effectively.

This is only very basic initial thought, but I think that the OGC GeoDataCube SWG may be working on similar challenges.

rbavery commented 1 day ago

Glad an equivalence between GeoZarr and STAC Catalogs is being proposed. It's common for STAC Collections to contain STAC Items with different CRSs. Do those working on the spec think this is within scope? Right now I only see examples online for resamping every raster to a common CRS before saving to Zarr, like https://earthmover.io/blog/serverless-datacube-pipeline/.

this section of the spec makes me think maybe this is within scope?

If multiple Array Variables share heterogeneous dimensions or coordinates, a primary homogeneous set of variables MUST be located at root level, and the other sets declared in children datasets. https://github.com/zarr-developers/geozarr-spec/blob/main/geozarr-spec.md#geozarr-dataset

I would love to be able to turn STAC Collections or Catalogs into GeoZarrs with two geospatial index levels, one common CRS for all rasters that indexes each raster by it's extent on the common CRS. and another index level that is particular to each raster group that shares a common CRS.

My use case is inference on rasters and georeferencing the results. I might want to load a GeoZarr, filter rasters that intersect an area of interest using the top level CRS, run model inference on individual rasters and then use each rasters's individual CRS for georeferencing the model results. I'd like to avoid reprojecting all the raster pixels to a global projection throughout this process since it is an expensive operation and compromises on equal area.

I hope there is a way to handle the above with GeoZarr, while also getting the benefits of Zarrv3 sharding for performant data loading and cloud storage.

christophenoel commented 22 hours ago

It's common for STAC Collections to contain STAC Items with different CRSs. I expect the same with GeoZarr when simply converting a "product" (scene) to that format. Resampling would be applied to obtain aggregation (datacube), analysis-ready data, or to generate Level 3+ data.

Your idea seems very interesting but quite challenging. As a first step, I would simply aim to expose and access some typical product types in STAC, taking into account bands and extra dimensions.