zarr-developers / zarr-specs

Zarr core protocol for storage and retrieval of N-dimensional typed arrays
https://zarr-specs.readthedocs.io/
Creative Commons Attribution 4.0 International
88 stars 28 forks source link

Codec: clarify expectations regarding unknown metadata fields #270

Open sbesson opened 1 year ago

sbesson commented 1 year ago

As a preamble, I wanted to highlight the Zarr v3 specification provides a list of officially supported codecs each with their own specification e.g. blosc. Even though https://zarr-specs.readthedocs.io/en/latest/v3/codecs.html is still marked under construction, this is a noticeable improvement over the Zarr v2 specification. Having an official registry of codecs also allows new additions to be proposed using the standard Zarr Enhancement Proposals process.

This issue is motivated by a compatibility issue initially raised in https://github.com/zarr-developers/jzarr/issues/14: a new feature of the dev.zarr:jzarr:0.4.0 implementation added an extra key (numThreads) to the blosc object which in turned prevented the Zarr from been opened using zarr-python due to stricter semantics when reading the blosc dictionary. In that case, the extra key is not essential and a fix is under review to remove it.

This issue raises the wider question of how implementations should deal with codec objects containing unknown metadata fields. The must_understand key/value pair introduced in the v3 specification aims to handle similar scenarios. However as per the current terms

Future versions of this specification may also add new core features by adding new top-level metadata keys. Such features are required by default. However, if the value of an unknown feature is an object containing the key-value pair "must_understand": false, it can be ignored.
...
The array metadata object must not contain any other names. Those are reserved for future versions of this specification. An implementation must fail to open zarr hierarchies, groups or arrays with unknown metadata fields, with the exception of objects with a "must_understand": false key-value pair.
...
The group metadata object must not contain any other names. Those are reserved for future versions of this specification. An implementation must fail to open zarr hierarchies or groups with unknown metadata fields, with the exception of objects with a "must_understand": false key-value pair.

the scope of this key seems to be limited to unspecified top-level objects.

Ideally, the expectations regarding unspecified codec metadata fields should be enforced at the specification level. Note also there is an ongoing discussion in https://github.com/zarr-developers/zarr-specs/issues/72#issuecomment-1781645725 about whether must_understand should be defined and supported at arbitrary levels which might be relevant to this issue.