zarr-developers / zarr-specs

Zarr core protocol for storage and retrieval of N-dimensional typed arrays
https://zarr-specs.readthedocs.io/
Creative Commons Attribution 4.0 International
88 stars 28 forks source link

v3: dimension_names issues #268

Open DennisHeimbigner opened 1 year ago

DennisHeimbigner commented 1 year ago

I am in the process of implementing Zarr V3 in netcdf-c. The "dimension_names" for arrays leaves a lot to be desired from a netcdf-c point of view. The whole point of "shared dimensions" (i.e. the netcdf-c version of named dimensions) is that a reader can tell by looking at the metadata which array dimensions are semantically related because they use the same name. Note that this approach requires that a dimension be an object defined at a single point in the metadata. Further, because those dimension objects can be in different groups, all references to a dimension must use a fully qualified name.

The "dimension_names" approach makes it technically impossible for a user to tell if two occurrences of a name that happen to have the same size (as determined by the "shape") are the same by accident or the same by design.

I propose that:

  1. "dimension_names" be eliminated
  2. the "shape" array is allowed to be mix of integers and fully qualified dimension names.
  3. one of two things be added: a. Each zarr.info in a group contain a "dimensions" key similar to the way "attributes" is handled" b. or the zarr.info for the root group contain all dimensions but using the fully qualified name instead of a simple name.

The 3.b proposal is an attempt to address the complaint about limit the amount of metadata that must be read in order to access an array.

jbms commented 1 year ago

Would it be sufficient for your implementation to just impose some additional semantics and constraints on dimension names? For example, you could require that if the same name is used for two different arrays within a group (or entire hierarchy), that it refer to the same dimension.

Or you could use some special syntax for dimension names, to distinguish between dimension names that are local to a single array, and dimension names that are shared within a group/hierarchy.

In Neuroglancer, for example, if multiple arrays are added to a given viewer, any matching dimension names are assumed to correspond, but the user can also override this, and also uses a special syntax to indicate "local" vs "global" dimensions.

DennisHeimbigner commented 1 year ago

I am afraid I do not fully understand your comment. There are, I suppose, two issues here. First is how NCZarr will handle shared dimensions. My current plan is to do pretty much what I did in NCZarr version 2.

The second and more important question is properly defining dimensions in the Zarr Version 3 spec. As it is, "named_dimensions" are more or less macros for the shape values. They have no meaning whatsoever. I am concerned about this.

One more note. It is clear that "named_dimensions" was an attempt to include the XARRAY "_ARRAY_DIMENSIONS". But the important point to know about that is that xarray always assumed their were no groups in the metadata, only a single root group. Extending the concept to metadata with multiple groups is not straightforward.

shoyer commented 10 months ago

dimension_names was intended to be a minimal domain-agnostic version of named dimensions, without the full complexity of netCDF-style explicit dimension objects, e.g., something that would make sense both for weather and neuroscience data in generic plotting software. In my experience, the non-weather/climate research community is not excited about implementing the full complexity of explicit dimension names.

For the full netCDF data model, I would suggest adding separate metadata fields for creating and referring to dimension objects, in addition to filling out dimension_names.

DennisHeimbigner commented 10 months ago

That is pretty much what I did.