zarr-developers / geozarr-spec

This document aims to provides a geospatial extension to the Zarr specification. Zarr specifies a protocol and format used for storing Zarr arrays, while the present extension defines conventions and recommendations for storing multidimensional georeferenced grid of geospatial observations (including rasters).
106 stars 10 forks source link

Canonical Dimensions #8

Closed dblodgett-usgs closed 1 year ago

dblodgett-usgs commented 1 year ago

The "COARDS" and by extension CF convention typically support XYZT axes. But we also have key use cases three (atleast) other canonical dimensions.

  1. band, e.g. spectral or color.
  2. ensemble member, e.g. monte-carlo simulation
  3. scenario member, e.g. good, bad, ugly

How do we want to recognize (and support?) the existence of relatively common non spatio-temporal axes?

Do we want to consider any of these non spatio-temporal axes to be "canonical" in the sense that they are actually built into the convention and supported in general?

rouault commented 1 year ago

If WKT (or PROJJSON) is considered to express CRS, then the scope of which dimensions WKT/PROJJSON covers should be decided. Traditionnally what think about WKT CRS as being only about the "horizontal" coordinates, latitude/longitude, easting/northing. But WKT CRS allows also for compound CRS that are horizontal+vertical axis, or horizontal+temporal, or horizontal+parametric+temporal. Cf https://docs.ogc.org/is/18-010r7/18-010r7.html#125 and https://docs.opengeospatial.org/as/18-005r4/18-005r4.html#34 That said, such compound CRS involving temporal and parametric axis are quite exotic, and could be poorly supported by coordinate transformation libraries (although from a quick test transforming between a compound horizontal+parametric or horizontal+temporal to another horizontal CRS, I see the PROJ library behaves OK by just taking into account the horizontal component of the compound CRS)

rabernat commented 1 year ago

Do we want to consider any of these non spatio-temporal axes to be "canonical" in the sense that they are actually built into the convention and supported in general?

To answer this, it would be helpful if you could elaborate what you mean by "supported". There are two flavors of possible interpretation:

dblodgett-usgs commented 1 year ago

I think it is a given that we will have a scheme that accommodates dimensions not governed by the spec in the same way that CF and GDAL do. In the case of CF, you can add dimensions after XYZT or you can overload variable names with information that encodes an additional dimension into the data (both are done in the wild). In the case of geotiff, typical use of "band" is supported as a tiff image is inherently multiband in core use cases. But people overload "band" in practice as an additional dimension (like time or variable) or encode additional dimensions in file names.

Let's just zoom in on one important instance to clarify what I'm asking. What about "band"? I think a lot of people consider that core to interpretation of "geo" data.

Some points of reference to think about how use-case creep has been dealt with in other conventions:

In spatial geometry coordinates, we usually have XYZM, where M is a "measure" that can carry any number of additional quantities at the geometry node level.

In data cubes that follow COARDS/CF, we have XYZTABC where ABC would be additional dimensions that are not governed by convention.

In geotiff we have XYB where B is some extra dimension like time or band that is not strictly governed by convention but accommodated in practice as bands of a tiff image.

The real question is if we are going to try to recognize additional dimensions beyond XYZ at all?

CF, for example, doesn't support "band" as a dimension but geotiff (tiff really?) has various "band" accommodations. CF does support "time" as a dimension but geotiff does not.

So to summarize and clarify the question:

Let's assume XYZ are in scope for "geo".

  1. Is time in scope?
  2. Is band in scope?
  3. Do we want to give informative (non-normative) guidance that might lead to enhanced interoperability for ensembles and scenarios?
christophenoel commented 1 year ago

CF, for example, doesn't support "band" as a dimension but geotiff (tiff really?) has various "band" accommodations. CF does support "time" as a dimension but geotiff does not.

CF defines that a variable may have any number of dimensions, and that dimensions other than those of space and time may be included. Also CF defines standard names for bands. a standard name for bands (e.g., sensor_band_identifier) , and thus allows defining dimensions of any type.

christophenoel commented 1 year ago

However, I also don't really catch the discussion comparing GDAL (tool) with CF (data conventions): when you convert source data (e.g., in GeoTiff, NetCDF, JP2K, etC.) to Zarr, the resulting metadata encoding in Zarr does not depend on the conversion tool (GDAL, Xarray, etc.) but on the source data format:

dblodgett-usgs commented 1 year ago

Fair point on the "standard_name" accommodation for bands -- from a software implementers' point of view, that is akin to an overloaded variable name and more of a hack than what I would call "support". The point stands that there is no canonical dimension or in-built structural support for dimensions other than XYZT. While you can define additional dimensions, a software implementer has to divine the semantic meaning of the additional dimensions outside the scope of the convention.

I shouldn't have used "GDAL" in that example... Really, it should be "2D raster as typically supported by GDAL".

With regards to coordinate reference systems, that is not really core to the question posed in this issue, but it is a useful factor to the decision. CF does not lump time into its "grid_mapping" paradigm, favoring attributes carried on a time variable. As @rouault points out, in WKT, time and parametric coordinates are only used in exotic use cases.

This discussion really points to the answer being that:

Like CF and geotiff, geozarr will not have direct semantic support for dimensions beyond the traditional spatio-temporal XYZT but will have support for additional dimensions as generic additional dimensions.

Does anyone disagree with that and think the specification should, for instance, define a "band" or "ensemble" dimension and how to specify which dimension of a data variable is one of those?

christophenoel commented 1 year ago

While you can define additional dimensions, a software implementer has to divine the semantic meaning of the additional dimensions outside the scope of the convention".

I'm not entirely clear on your perspective (from my understanding, CF specifically supports XYZT, but standard names enable the unambiguous identification of the semantics of any other unit).

The significant advantage of using standard names for all GeoZarr variables is that it accommodates the description of most possible units used in dimensions, including wavelength, latitude, latitude_grid, sensor_band_identifier (and all units of measure for observed data), allowing users or clients to comprehend and process the data. This aspect is crucial for GeoZarr since we already have COG for 2D raster, and GeoZarr introduces the capability to support complex n-D arrays and ARD based on additional dimensions.

dblodgett-usgs commented 1 year ago

My point there is that if additional dimensions are used to store band or ensemble, you don't know which dimension is which unless you use process of elimination. In the case that you had even more dimensions -- as in scenario AND ensemble -- process of elimination becomes impossible.

The standard_name is very powerful in that, as a data producer, you can jam pretty much anything into a file. But as a data consumer, if you want to have general support for the semantics implied by a standard_name, it is very very difficult to accommodate in all cases. It's fine with a human in the loop or on a case by case basis, but in general, it becomes very difficult to scale.

christophenoel commented 1 year ago

@dblodgett-usgs : I think I see your point if you mean the client doesn't know the corresponding wavelength for a specific band idnetifier. I suppose my example is not the best one.

Let me revise my example with our number one use case in HDSA project: for hyperspectral data, the wavelength can be indicated in dimension [sensor_band_central_radiation_wavelength] for example. Any client supporting such kind of hyperspectral data will be able to discover which coordinates maps to the wavelength and be able to process the data acros its wavelength and the other dimensions.

(it might be a solution to address universaly bands if instead of the identifier, you always encode the corresponding wavelength, then it would become very standard)

christophenoel commented 1 year ago

My point there is that if additional dimensions are used to store band or ensemble, you don't know which dimension is which

If I understand your point differently, then key difference between NetCDF (CF) and GeoZarr is that all CF arrays have dimension names while Zarr arrays do not.

GeoZarr follows xarray conventions which define a special GeoZarr array attribute: _ARRAY_DIMENSIONS. The value of this attribute is a list of dimension names (strings), for example ["time", "lon", "lat"]

Therefore, you can discover as many dimensions as you wish.

christophenoel commented 1 year ago

(And for each dimension, you have a correpsonding variable which defines the semantics using the standard name type, so you can interpret correctly all dimensions)

dblodgett-usgs commented 1 year ago

@christophenoel -- I'm not really following your example.

Let me see if i can clarify. In the following example, if you wanted to interpret x as a time varying multiband data variable:

Lat, lon, and time are XYT because of the units. Since this follows COARDS, where dimensions and coordinate variables share names, you atleast know that band is a coordinate variable but you don't know how to interpret it. But how would we know that band is to be interpreted as bands of an image without a human user specifying that they want to use the band dimension in that way?

dimensions:
  lat = 180 ;
  lon = 360 ;
  band = 10;
  time = UNLIMITED ;
variables:
  double x(band,time,lat,lon);
  double time(time);
    time:units = "hours since 1999-01-01 00:00" ;
  double lon(lon) ;
    lon:units = "degrees_east";
  double lat(lat) ;
    lat:units = "degrees_north" ;
  double band(band);
    band:[attribute] = "something to distinguish that this is to be interpreted as a multiband dimension";
christophenoel commented 1 year ago

@dblodgett-usgs : Ok, good approach.

When I talk about standard names, I'm refering the canonical units defined in the CF Standard Name Table which standardizes the identifier of most known unit: https://cfconventions.org/Data/cf-standard-names/current/build/cf-standard-name-table.html

We encoded multispectral PRISMA data product (not bands, but wavelengths) as follows in GeoZarr. Using the standard_name attribute, the client can interpret the various dimensions without ambiguity.

Dimensions:      (latitude: 1220, longitude: 1244, wavelength: 234)
Coordinates variables:
  float64 longitude(longitude) ;
    longitude:units = "degrees_east";
    longitude:standard_name = "longitude"
  float64 latitude(latitude) ;
    latitude:units = "degree_north" ;
    latitude:standard_name = "latitude"
  float64 wavelength(wavelength);
    wavelength:units = "m" ;
    wavelength:standard_name = "radiation_wavelength"
Data variables:
  float32 reflectance  (wavelength, latitude, longitude):
    reflectance  :units = "1" ;
    reflectance  :standard_name = "surface_bidirectional_reflectance"

Same example for the timeseries of Sentinel-2 data (with bands in this case)

Dimensions:      (band: 12, time: 739, y: 5490, x: 5490)
Coordinates:
  datatime64  time(time):
  float64 x(longitude) ;
  float64 y(latitude) ;
  float64 band(band);
    band:standard_name = "sensor_band_identifier"
(values:     (band) <'AOT' 'B02' 'B03' 'B04' ... 'B12' 'B8A' 'SCL' 'WVP'>
Data variables:
- float32 data(time, band, y, x)):
    data:units = "1" ;
    data:standard_name = "surface_bidirectional_reflectance"

Therefore, the client (of course adapted for hyperspectral data or multispectral data) can read the standard name attribute to interpret the semantic of the various dimensions.

dblodgett-usgs commented 1 year ago

OK -- thanks for that clarification.

So GeoZARR should support canonical dimensions for Wavelength/Band via coordinate variable standard names, "radiation_wavelength" and "sensor_band_identifier"?

What happens in the case of an auxiliary coordinate variable where the name of the variable is different than the dimension name?

There are two additional accommodations in CF for XYZT, the axis attribute and the coordinates attribute. In the case below, how would a client know that the wavelength variable is a coordinate for the band dimension of the x variable? I think we would use the coordinates attribute as I've shown below? Note that since time, lon, and lat are not auxiliary coordinate variables (they are COARDS style coordinate variables) we do not add them to the coordinates attribute. Note though that this seems to go beyond the intent of CF... quoting from the section linked just above:

Any longitude, latitude, vertical or time coordinate which depends on more than one spatiotemporal dimension must be identified by the coordinates attribute of the data variable. The value of the coordinates attribute is a blank separated list of the names of auxiliary coordinate variables.

This is the source of my statements that:

The point stands that there is no canonical dimension or in-built structural support for dimensions other than XYZT.

I like the idea of a proposed extension of band and perhaps some other non-spatiotemporal dimensions like ensemble member where we would point to auxiliary coordinate variables in a coordinates attribute and introduce addditional valid values for the axis attribute. e.g. XYZTB... for cases where you have more than one coordinate variable in the file and you want to express which is the primary axis to be used. (Axis specification is here.)

dimensions:
  lat = 180 ;
  lon = 360 ;
  band = 10;
  time = UNLIMITED ;
variables:
  double x(band,time,lat,lon);
    x:coordinates = "wavelength"
  double time(time);
    time:units = "hours since 1999-01-01 00:00" ;
  double lon(lon) ;
    lon:units = "degrees_east";
  double lat(lat) ;
    lat:units = "degrees_north" ;
  double wavelength(band);
    band:standard_name = "radiation_wavelength";
christophenoel commented 1 year ago

So GeoZARR should support canonical dimensions for Wavelength/Band via coordinate variable standard names, "radiation_wavelength" and "sensor_band_identifier"?

GeoZarr should support ANY standard name for the dimension coordinates of a data variable.

The currently used convention (which doesn't care about CF) is:

christophenoel commented 1 year ago

Moreover, going into details, I think that a ZEP-4 recommendation should provide a set of recommended standard names for various popular product types (e.g. multipspectral, hyperspectral, multi-tiles cubes, etc.)

dblodgett-usgs commented 1 year ago

It would be very useful to pull out list of standard_names that are used to indicate a variable is to be interpreted as a coordinate variable of a specific type.

What about my questions about the coordinates and axis attribute detail that I call out above --- I don't see a response to those questions.

I'm not quite following:

currently used convention

What currently used convention?

for each defined dimension name: corresponding variable definition indicating the standard name (= semantic)

Does the variable name have to be the same as the dimension name? What if you had multiple variables that could be interpreted as coordinate variables?

christophenoel commented 1 year ago

What currently used convention?

The conventions described by current GeoZarr draft (partially based on xarray conventions).

It would be very useful to pull out list of standard_names that are used to indicate a variable is to be interpreted as a coordinate variable of a specific type.

GeoZarr/Xarray convention is that variable declared in dimension is a coordinate (see xarray: https://docs.xarray.dev/en/stable/user-guide/terminology.html ) I don't remember but maybe class is present somewhere in Zarr metadata.

Does the variable name have to be the same as the dimension name? What if you had multiple variables that could be interpreted as coordinate variables?

Yes, for each dimension name listed in the data, you have Geozarr Coordinates (see classes defined in https://github.com/zarr-developers/geozarr-spec/blob/main/geozarr-spec.md)

What about my questions about the coordinates and axis attribute detail that I call out above --- I don't see a response to those questions.

Sorry, I have probably not understood the actual question.

dblodgett-usgs commented 1 year ago

OK -- so, we need to start linking to lines in the specification. I've been missing what's already there because we haven't been talking about specific text. Apologies.

So, according to: https://github.com/zarr-developers/geozarr-spec/blob/main/geozarr-spec.md?plain=1#L66

The geozarr "spec" (convention) incorporates XYZTBW already. So Band and Wavelength are canonical in the sense that a geozarr compliant software implementation would know how to work with data according to those dimensions.

With regards to coordinates and axis -- these two attributes do not appear in the current geozarr "spec" (convention). coordinats is used to connect a data variable to the appropriate coordinate variables for it in the case that the coordinate variable does not share a name with the dimension it describes. The axis attribute is similar to coordinates but is carried by the coordinate variable its self and indicates what canonical axis (XYZT currently in CF) a coordinate variable is intended to describe.

christophenoel commented 1 year ago

The geozarr convention (and xarray) incorporates every possible standard name as dimension which is something I really like --> this doesn't restrict you to encode rasters. In various mission, we have many arrays for which dimensions are not XYZT at all, and this provides a wide range of data to be encoded.

I would expect OGC-equivalent of "requirements-class" to potentially restrict/declare the typical allowed standard names for a set of usual domain (as you suggest: wavelength, altitude, band for satellite products).

dblodgett-usgs commented 1 year ago

From a data consumer's point of view, no restriction on what can be a coordinate of data is daunting. I understand how it is nice for a data provider, but we need to be careful not to create something that is nearly unsupportable.

OK, so getting back to the point of this issue:

geozarr will support XYZTBW. Where B is band and W is wavelength.

geozarr will not support Ensemble or Senario dimensions except as generic dimensions that must be interpreted manually or treated as semantically ambiguous in client software that does not recognize them.

Is that agreeable in terms of intended scope?

christophenoel commented 1 year ago

While I understand your perspective, I believe that making a decision at this stage would be premature without a more comprehensive view of the proposed specification or convention.

It's important to recognize that space data extends beyond raster XYZT data and often includes auxiliary data with varying coordinate systems. Our customers plan to convert data from numerous missions into the Zarr format, and as such, Geozarr is not simply another 2D raster container like GeoTiff.

On my side, I'm definitely against such restriction but again, would appreciate a convention based on profiles / requirements-class as in various OGC specs.

dblodgett-usgs commented 1 year ago

I don't think I'm being clear. It is very important that we fully understand the scope of this work up front.

It is a given that geozarr will allow n-dimensional variables with coordinate variables for dimensions other than XYZT that may not be a-priori known according to the convention. I think that is agreed? i.e. we will not limit what is allowable in terms of dimensions people want to include.

It is not a given that we will recognize dimensions that should be treated in a particular way in software. e.g. an "ensemble" dimension needs to be treated very differently from a "band" dimension in the use cases that are appropriate to those kinds of dimensions. A client software developer would benefit from consistent handling of these extra dimensions' metadata in geozarr so we don't end up in the place the NetCDF-CF is where some people put extra dimensions in variable names and others make custom dimensions that most software fails to represent.

You say:

space data extends beyond raster XYZT data and often includes auxiliary data with varying coordinate systems

Noting that there is no intention to limit inclusion of a-priori unknown dimensions types, the real question I'm asking is can and/or should we enumerate what some of those additional dimensions are so that the semantics of the data is known in the convention?

christophenoel commented 1 year ago

Thank you for the clarification, I better see the intent.

christophenoel commented 1 year ago

Noting that there is no intention to limit inclusion of a-priori unknown dimensions types, the real question I'm asking is can and/or should we enumerate what some of those additional dimensions are so that the semantics of the data is known in the convention?

Surely yes. Probably that in the upcoming months, we will better identify more representative use cases with related data type. So there might be core set of dimensions (XYZTBW ?) and additional recommendatiosn for a particular type of data (e.g. satellite, hyperspectral).

dblodgett-usgs commented 1 year ago

Great. I'm all for the typical remote sensing additional dimensions as being in scope for a "Geo" convention. What about ensemble and scenario? Those are modelling-specific concepts that are common in spatially referenced modeling data, but are they sufficiently core that we would specify how they should be referenced in GeoZarr?

My gut says yes with the caveat that it would need to be a very light hand approach -- e.g. via standard_name and perhaps use of a "coordinates" attribute for auxiliary coordinate variables.

christophenoel commented 1 year ago

On my side, I'm not familiar about those concepts, but why not, this is probably very relevant.

tylere commented 1 year ago

In terms of identifying representative use cases, I found the Sentinel-2 example stated earlier to be unexpected:

Same example for the timeseries of Sentinel-2 data (with bands in this case)

Dimensions: (band: 12, time: 739, y: 5490, x: 5490)

Sentinel-2 image bands have three different pixel grids (10m, 20m, 60m) so I would not expect them all to have the same y and x dimensions. @christophenoel does this assume that all bands had been resampled to a consistent (~20m) pixel size? If so, how would a multi-band image containing bands with differing pixel sizes be encoded?

(I am intentionally not using the term "spatial resolution" for the pixel grid spacing, even though the Sentinel-2 docs use that term.)

christophenoel commented 1 year ago

@tylere: Thank you for your insightful comment! Your suggestion is in line with my desire to incorporate data-specific recommendations in future iterations (multispectral data for example).

During the HDSA project, our primary objective was to make Sentinel-2 data more "processing-friendly" while avoiding the creation of separate Zarr files for each resolution (pixel grid). As the conversion to Zarr involves costs, the aim is to maximize the efficiency of the target data by implementing additional, yet sustainable, encoding adjustments.

The consortium explored various approaches, such as defining subgroups or multiple arrays. In the end, we decided to downsample bands 2, 3, and 4, and upsample bands 1, 9, and 10. Due to time and budget constraints, we weren't able to fully develop the proof-of-concept. However, the ultimate aim is to offer all bands across all three pixel grid sizes through multiscale functionality, always using the best available resolution.

dblodgett-usgs commented 1 year ago

I think this discussion is mostly tapped out. We don't have strong consensus other than, we will largely support what's in CF where standard_names imply that a coordinate variable should be interpreted as a given canonical dimension.