zarr-developers / zarr-specs

Zarr core protocol for storage and retrieval of N-dimensional typed arrays
https://zarr-specs.readthedocs.io/
Creative Commons Attribution 4.0 International
88 stars 28 forks source link

Spec version vs `zarr_format` #299

Open d-v-b opened 5 months ago

d-v-b commented 5 months ago

The zarr_format metadata is an integer, but the spec document uses a string identifier that can represent major and minor versions. So, unlike the spec document, the zarr_format metadata cannot ever represent a minor version. Is this a problem? It seems like skew between the spec version and zarr_format is a recipe for trouble, but I don't see how to fix this without some disruption.

cc @WardF, as this relates to some of our conversations from the community meeting the other day, and I think the netcdf perspective would be useful here.

jbms commented 5 months ago

It was intentional that zarr_format not include a precise version number of the spec. Instead, it is intended that the spec defines some broad compatibility guarantees, and zarr_format only needs to be updated when we need to step outside of those guarantees.

The rationale is:

  1. If we include a precise version in the spec, then when creating the metadata, the implementation will need to choose which version to specify. For maximum compatibility, we want this version to be as low as possible, but it also need to be high enough to support all of the features that are used. Therefore we need some logic to figure out the minimum version of the zarr spec that supports all of the features that are used. Furthermore, if some of these features are experimental/still in the process of being standardized then the minimum version in some sense doesn't even exist yet.
  2. When reading the metadata, if the implementation encounters a newer version number than is known, there is also the question of what to do. One option is to just fail with an error immediately. However, it seems likely that at least some writers may not carefully choose the minimum version that supports all features actually used in the metadata, and instead may just pick the latest known version. In that case, it would be better when reading to just ignore the version number and attempt to parse the metadata anyway, and only fail if an unsupported feature is encountered.

If we don't include a precise version number, then when creating an array we don't have to worry about picking a version number, and when reading an array, we can still just validate the metadata according to the actual features in use.

d-v-b commented 5 months ago

thanks @jbms, that's helpful. I think it would be good to write this logic into the spec. I will ping you if I submit a PR to that effect.

yarikoptic commented 2 months ago

if I got it right, this relates to

as a formalization of those "features" used/present in any given Zarr of a "major" zarr_format version. Is my understanding correct?

d-v-b commented 2 months ago

I think this discussion and #262 concern different levels of abstraction.

The properties of a particular version of zarr are formalized by the relevant zarr specification. See the specification for zarr version 2, or the specification for zarr version 3. I raised this issue to discuss a particular detail about how the zarr v3 specification defines the metadata that declares which version of zarr it is.

By contrast, ZEP 4 is at a higher level of abstraction: it concerns formalizing specifications of conventions that contain zarr data. To quote that ZEP, "A Zarr implementation itself should not even be aware of the existence of the convention.".