zarr-developers / geozarr-spec

This document aims to provides a geospatial extension to the Zarr specification. Zarr specifies a protocol and format used for storing Zarr arrays, while the present extension defines conventions and recommendations for storing multidimensional georeferenced grid of geospatial observations (including rasters).
106 stars 10 forks source link

Spatial Coordinate Variable Support #7

Closed dblodgett-usgs closed 1 year ago

dblodgett-usgs commented 1 year ago

We have use cases for coordinate variables encoded as:

  1. "raster": origin/offset
  2. "COARDS": vector
  3. "2D": array

Any other spatial coordinate variable types that need to supported?

All three will need to be supported in geozarr. The CF style for COARDS and 2D style seems to be a clear initial candidate. Is there an zarr encoding for raster coordinates to consider?

(edit -- the original post had "coords" instead of "COARDS" see https://cfconventions.org/cf-conventions/cf-conventions.html#coards-relationship for more on COARDS and CF)

rouault commented 1 year ago

Some thoughts / questions:

  1. "raster": origin/offset. If rotational terms are desired, then a 2D affine geotransformation matrix with 6 terms is a potential solution (as in the GDAL data model). That said, such rotation is a affine transformation in the CRS, and is different from a rotated CRS, such as a rotated longitude/latitude CRS which is a 3D Euler rotation involving trigonometry. Leaving aside rotational terms, and just taking into consideration origin/offset, the GDAL netCDF driver looks at the value taken by X/lon and Y/lat arrays to see if they are regularly spaced. This tends to be a bit error prone, because sometimes the variables are encoded in single-precision float, and due to lack of precision, the spacing tends to be non constant, so you need to have some tolerance margin. Explicit metadata given origin_x, origin_y, step_x, step_y would be clearer. One usual issues with the raster model is what origin_x, origin_y represents: center of pixel or top-left corner of pixel ? The raster model also tends to favor the image data model where y=0 is the top of the screen, and thus you use negative step_y to encode georeferencing. But this is more a practice than a constraint of the model. It is perfectly possible to have positive step_y if line 0 is meant to be the southern most one.

  2. "coords": vector. Could you give example of what you mean with that ?

  3. "2D": array. I assume you're thinking to a geospatial variable reference by (j, i) dimensions, and X/longitude and Y/latitude arrays being themselves referenced by (j,i) ? This is the "geolocation array" concept in GDAL: https://gdal.org/development/rfc/rfc4_geolocate.html

Any other spatial coordinate variable types that need to supported?

I don't know if that needs to be supported, but I've seen something exotic lately, where the geospatial variable was indexed by a single dimension "node", and the "lat" and "lon" arrays where indexed by "node" itself. So this is scattered point data / ungridded. Cf indexed. As far as I read it, the current draft of GeoZarr excludes such scenario since it mandates each dimension to be indexed by a 1D variable of the same name (https://github.com/zarr-developers/geozarr-spec/blob/main/geozarr-spec.md#geozarr-coordinates); It would also exlclude the "2D": array scenario

I would say that if GeoZarr wants to support many different use cases, close to what all netCDF CF allows (and netCDF CF allows to do pretty much anything), then it might be best that GeoZarr == netCDF CF (or a subset of it) translated to JSON without any semantic change. Reinventing something somewhat similar but different than netCDF CF would be just a loss of time IMHO.

This is an important decision that must be taken early in the process:

briannapagan commented 1 year ago

Not adding much to this thread just wanted to point to: https://docs.ogc.org/per/21-032.html#toc23 which I see @rouault contributed to and found super interesting.

christophenoel commented 1 year ago

From my understanding, tools supporting Zarr are currently all based on 2D array coordinates. Note it was well supported by xarray for super big datacubes.

I assume this should be the baseline even if I believe vectors and offset based coordinates might be great.

briannapagan commented 1 year ago
  • is GeoZarr meant to be simple for consumers, especially generic purposes consumers that don't know anything about the specificities of the dataset, and thus restrict the possibilities for data producers, and possibly oblige them to do processing to fit their data into what GeoZarr allows,
  • or does it want to be friendly with data producers and thus harder for (generic purpose) consumers Based on my recent experience with a totally unrelated standard, you cannot make both parties happy at the same time.

Great discussion points here. My opinion on this question above is the first. That is already how any other standard works - putting the onus on data producers.

christophenoel commented 1 year ago

I'm not sure what is the actual assumption behind "simple for consumers" (in particular for the present topic of variables)

More generally, I agree with the statement "oblige them to do processing to fit their data" and being generic. But I would expect Geozarr easily support a wide range of data (across domains):

Moreover, if the user doesn't know about specificities of the dataset, we must provide recommendations for various types of data to be encoded based on standard conventions: for example, formats supporting multispectral band might be encoded in a single DataArray with a dimension for band (subsidiary question: how to encode such data if the bands are not all provided with same resolution ? how to discover the various part from a parent Zarr dataset ?).

dblodgett-usgs commented 1 year ago

There is a real trade off here, but it's maybe not quite as simple as the dichotomy that @rouault laid out.

I agree with @christophenoel that " I would expect Geozarr easily support a wide range of data (across domains)".

The extension to that statement should be that it [Geozarr] should do that with a minimum set of design patterns chosen from those that are readily available in software with both read and write functionality.

Apologies for loosing the thread @rouault -- I see you had a question about my original issue.

"coords": vector. Could you give example of what you mean with that ?

That should have been COARDS which is what the CF convention was largely based on.

I had not included discrete geometry in my original list... that could certainly be considered in scope so the potential list would be:

  1. "raster": origin/offset
  2. "COARDS": vector
  3. "2D - curvilinear": array
  4. "discrete geometry": indexed array

I tend to agree with @rouault that if we were to try to encompass all of that scope, which is supported by CF, we would probably want to adopt more or less all of CF. I could see an argument for a CF clone that used WKT/PROJJSON and had support for origin/offset coordinates.

dblodgett-usgs commented 1 year ago

I think I will call this issue overcome by events and close it in favor of #17

We may want to support auxiliary coordinate variables as is done in CF, but that should be brought up in a separate, more specific issue.