zarr-developers / zarr-specs

Zarr core protocol for storage and retrieval of N-dimensional typed arrays
https://zarr-specs.readthedocs.io/
Other
81 stars 25 forks source link

Spec version vs `zarr_format` #299

Open d-v-b opened 3 weeks ago

d-v-b commented 3 weeks ago

The zarr_format metadata is an integer, but the spec document uses a string identifier that can represent major and minor versions. So, unlike the spec document, the zarr_format metadata cannot ever represent a minor version. Is this a problem? It seems like skew between the spec version and zarr_format is a recipe for trouble, but I don't see how to fix this without some disruption.

cc @WardF, as this relates to some of our conversations from the community meeting the other day, and I think the netcdf perspective would be useful here.

jbms commented 3 weeks ago

It was intentional that zarr_format not include a precise version number of the spec. Instead, it is intended that the spec defines some broad compatibility guarantees, and zarr_format only needs to be updated when we need to step outside of those guarantees.

The rationale is:

  1. If we include a precise version in the spec, then when creating the metadata, the implementation will need to choose which version to specify. For maximum compatibility, we want this version to be as low as possible, but it also need to be high enough to support all of the features that are used. Therefore we need some logic to figure out the minimum version of the zarr spec that supports all of the features that are used. Furthermore, if some of these features are experimental/still in the process of being standardized then the minimum version in some sense doesn't even exist yet.
  2. When reading the metadata, if the implementation encounters a newer version number than is known, there is also the question of what to do. One option is to just fail with an error immediately. However, it seems likely that at least some writers may not carefully choose the minimum version that supports all features actually used in the metadata, and instead may just pick the latest known version. In that case, it would be better when reading to just ignore the version number and attempt to parse the metadata anyway, and only fail if an unsupported feature is encountered.

If we don't include a precise version number, then when creating an array we don't have to worry about picking a version number, and when reading an array, we can still just validate the metadata according to the actual features in use.

d-v-b commented 3 weeks ago

thanks @jbms, that's helpful. I think it would be good to write this logic into the spec. I will ping you if I submit a PR to that effect.