zarr-developers / geozarr-spec

This document aims to provides a geospatial extension to the Zarr specification. Zarr specifies a protocol and format used for storing Zarr arrays, while the present extension defines conventions and recommendations for storing multidimensional georeferenced grid of geospatial observations (including rasters).
106 stars 10 forks source link

Initial OGC spec - Table of content #47

Closed christophenoel closed 6 days ago

christophenoel commented 1 month ago

I have prepared a first draft of the table of contents for the OGC-based specification.

Our primary goals are to ensure interoperability with most client and mapping tools, while also maximising compatibility with various source formats.

I believe we should focus on:

  1. Initiating a flexible model that facilitates conversion from any data source (Annexes with mappings will demonstrate that).
  2. Identifying specific cases that cannot be mapped to GeoZarr and creating specification requirements to address these gaps (for example, if origin/offset encoding of coordinates is not supported, we can extend our model and describe the encoding).
  3. Defining optimal classes of requirements that will aid in the interpretation of data through standard mapping tools and clients. For example, the Dataset requirement class outlines how to store Geo2D data for easy display on a mapping tool, as well as the tiling and pyramiding classes.

I propose also to implement a python script that generates a small dummy GeoZarr that represent all requirements provided in the specification.

christophenoel commented 1 month ago

Updated - refined and motivated the concept of Gridded Data (similar to the concept of dataset in NetCDF/xArray). From my point of view, the primary asset of GeoZarr, NetCDF, GeoTiff, cloud native format is Gridded Data for which we should provide requirements to facilitate the discovery and display in map tools.

GeoZarr places special emphasis on Gridded Dataset which might be discovered, interpreted and displayed on a map. Gridded Dataset which refers to a structured format of spatial data represented as a matrix of cells or pixels, organized in a regular grid. Each cell holds a value representing a specific geographic area. Gridded Datasets includes 2D Rasters, Raster Time Series, Geo-Datacubes (with dimensions like time, light spectrum, altitude, etc.)

All/most source format might be encoded in Zarr, but GeoZarr must focus the standardisation of how encoding dataset from GDAL/CF ecosystems (most of the time inclusive, but sometimes implying a compromise). In particular, some of the GDAL assets (.e.g OriginOffset) have not yet a clear mapping to Zarr/GeoZarr.

mdsumner commented 1 month ago

I don't like how this talks about "coordinates". Please talk about georeferencing, and whether (and how) an element has an extent or is just a point. Coordinates materialized are for blob vector geometry, or pure columns, meshes need a more nuanced abstraction (a raster is a special case mesh). A raster, a regular grid in 2D, is defined by six numbers abstractly. Rectilinear or curvilinear grids need more numbers (ncol + nrow, or ncol * nrow, or their edge+1 equivalents). It's very very normal to have a regular grid with compact description, I feel like the netcdf/xarray community has just advanced too far with the lowest common denominator of labelling arrays.

I'm also not comfy in this (mainly python) space, but I'm reading and listening and trying to get across it. (It used to be a MATLAB and Fortran heavy space). I'm commenting only with good intentions, I see how much work and angst is going into this. I hope I can help.

christophenoel commented 1 month ago

I don't like how this talks about "coordinates".

Thank you for the detailed feedback.

Here's my perspective on the use of the term "coordinates" and the broader context of georeferencing in our specifications:

A GeoZarr/NetCDF/CF is not only 2D, but supports n-dimensional arrays (tensors, datacubes) with multiple dimensions. In this context, "coordinates" refers to not only to georeferencing, but more generally the mechanism that provides a function to the position within these arrays.

While the term "coordinates" might seem simplistic, it is widely understood across various domains. One could also prefer the 'coverage domain' as per ISO/OGC terminology)

The current approach of labelling arrays serves as a foundation, which has the advantage to be supported already by the libraries. This approach provides a simple and effective way to manage and access multidimensional data, and for any dimension (not only lat/lon). However, our intention is to extend this by describing coordinates using origin offset, vectors, and other kinds of proposed encoding.

christophenoel commented 1 month ago

NOTE: please note that the focus is this PR is around the table of content. The definitions (mostly from CF/NetCDF/Xarray ecosystem) are provided as example to illustration the intention of the sections.

mdsumner commented 1 month ago

The current approach of labelling arrays serves as a foundation, which has the advantage to be supported already by the libraries. This approach provides a simple and effective way to manage and access multidimensional data, and for any dimension (not only lat/lon). However, our intention is to extend this by describing coordinates using origin offset, vectors, and other kinds of proposed encoding.

Awesome, thank you. And apologies for my "blurt" up there, it's well-intentioned but I am sure must seem wildly out of place. I'm not sure where to engage yet in this Python-heavy landscape but I'm learning.

I understand the n-dimensional-ness. You can mix compact representations with fully-materialized ones in an array, and an affine transform can be any dimensional. One of the reasons I'm agitating is because as well as the degenerate rectlinear norm in netcdf (where the x and y dims could be described by offset and scale, or range and shape), also often you have "cryptic" situations where a perfectly regular map projection was devolved to longitude latitude arrays and the original situation (six numbers, and a string) was not recorded - so it looks like curvilinear but really isn't, or doesn't need to be (I'm working on communicating this within xarray as specific identifiable cases ... FWIW).

Also, again this out of place but I'm finding my way and really appreciate you responding in detail - there's a floating assumption here and there that "x and y" are "longitude and latitude", but sometimes they are paired coordinates in a map projection so I wonder if the text here is really meant to be so coordinate-system-specific? We have the same problem with "easting and northing", they often don't actually point in those canonical directions, so "x" and "y" is really more normal and safe I think no matter what the CRS is.

christophenoel commented 1 month ago

@mdsumner : Thank you for your feedback. Note that I'm not personnally expert in GIS: I'm not coming from the NetCDF/Python ecosystem, neither GDAL, but I'm more oriented on Geospatial Data Platforms and Catalogues, as well with a sngificiative interest in Data access standards (OGC API Coverage, TIling).

christophenoel commented 1 month ago

Note: add "geographic control point" to better explain "labelled array" coordinates

mdsumner commented 6 days ago

@mdsumner : Thank you for your feedback. Note that I'm not personnally expert in GIS: I'm not coming from the NetCDF/Python ecosystem, neither GDAL, but I'm more oriented on Geospatial Data Platforms and Catalogues, as well with a sngificiative interest in Data access standards (OGC API Coverage, TIling).

Cool, I hope this software gets the ideas right. It's bigger than standards and requires people with deep experience to really cut through and not just in one language. I hope Python can reach out wide enough but I don't see it happening yet. Apologies if this is not an appropriate forum to express concerns but I don't know what is yet. 🙏