Open d-v-b opened 6 months ago
I think one of the biggest shortfalls of Zarr V2 is the lack of codec standardisation. Numcodecs has many codecs, but they are not very useful if they are unsupported by other zarr implementations and data viewers.
A zarr implementation does not need to support every codec to be conformant, but spec'ing codecs and supporting them across more than just one implementation is essential to move forward and increase adoption. What better place to put zarr codec specs than alongside the zarr spec?
We cannot require PRs against the spec for every new codec. If writing a new codec started with getting a PR accepted in zarr-specs, nobody would ever write a new codec.
A codec does not have to start with a spec, it can start with an experimental implementation. That is basically what most of the codecs in numcodecs are. Similarly, I have multiple experimental Zarr V3 codecs implemented in zarrs that I plan to put forward once the new ZEP process has been figured out.
I think one of the biggest shortfalls of Zarr V2 is the lack of codec standardisation. Numcodecs has many codecs, but they are not very useful if they are unsupported by other zarr implementations and data viewers.
I agree with this completely. My concern here is not whether we should standardize codecs; it's whether we should standardize codecs inside the Zarr specification document, or in a separate specification document.
What better place to put zarr codec specs than alongside the zarr spec?
I think outside the Zarr spec entirely is the best place to put the codec specs. The codecs don't depend on Zarr; instead, Zarr depends on them.
A codec does not have to start with a spec, it can start with an experimental implementation.
That's a good idea, but technically your codecs cannot start with an experimental implementation. According to the text of the spec, your experimental codec is only valid when it is defined in a separate specification, and you give your codec a URI that resolves to a human-readable specification of the codec. Personally I don't think this is a reasonable requirement for experimental codecs.
Just copying my response from the zarr-python thread here:
I think it is useful to have minimal set of codecs that we expect any zarr impl to support (e.g. bytes, transpose, blosc). Other codecs can be optional. I think the zarr specification is a actually good place to list available codec specs.
I feel quite strongly, that non-standard codecs need to be labeled as such (e.g. through URI-style naming instead of short names). Having multiple codecs (even if the encoded format is only slightly different) with the same name would be a desaster. Perhaps zarr-python should even enforce that (ie. don't allow short names for non-standard codecs).
@normanrz could you elaborate on these points a bit? Do you think the spec should require or merely suggest that implementations support a fixed set of codecs? If you want this to be a requirement, how would we enforce it?
Given that the spec currently requires that all codecs have a specification, how do we formally distinguish "standard" from "non-standard" codecs? What is the process for converting a "non-standard" codec to a "standard codec", or vice versa?
Do you think the spec should require or merely suggest that implementations support a fixed set of codecs?
Some codecs are essential to how Zarr works and should be required by all implementations. Most minimally, that is the bytes
codec. Other codecs are so popular and general that all implementations should implement it, e.g. blosc
, transpose
, gzip
, zstd
, sharding_indexed
. Then, there might be codecs that might only be relevant for a subset of the community, such as image or segmentation compression codecs. These might be optional from a Zarr pov but required by higher level format (e.g. OME-Zarr).
If you want this to be a requirement, how would we enforce it?
I like to think that enforcement of the Zarr spec comes through validation from multiple implementations. When opening an array or group, implementations parse the metadata and therefore implicitly or explicitly validate the metadata. If you only ever use your data with a single implementation, you might not get that validation. But then you also might not care about the interoperability that the spec provides. Of course, we could (and maybe should) also provide validation tools alongside the spec (e.g. json schema).
Given that the spec currently requires that all codecs have a specification, how do we formally distinguish "standard" from "non-standard" codecs?
"Standard" codec get a short name assigned by the Zarr spec (e.g. bytes
). "Non-standard" codecs have a URI-style name (e.g. https://zarr.dev/numcodecs/lz4
). That way, we minimize the risk of non-standard codecs conflicting each other. I think we can drop the requirement that the URI points to a human readable codec spec. A unique name should suffice for my concerns.
What is the process for converting a "non-standard" codec to a "standard codec", or vice versa?
I think we can use the ZEP process for that. Implementations that support non-standard codecs might need to support both names once a codec becomes standardized.
What better place to put zarr codec specs than alongside the zarr spec?
I think outside the Zarr spec entirely is the best place to put the codec specs. The codecs don't depend on Zarr; instead, Zarr depends on them.
From a theoretical pov, I can see that splitting the codec spec from Zarr might make sense. From a practical pov, I don't see how that would make anything easier or facilitate interoperability among the Zarr impls. I think it is best to keep the codec spec in the Zarr spec.
I think it is best to keep the codec spec in the Zarr spec.
Is the current set of codecs inside the zarr spec? I think this is actually the root of my concern.
given that the zarr v3 spec document itself says that it doesn't define a list of codecs (and this claim is internally consistent -- that document does not in fact define a list of codecs), what spec are are the codec definitions part of?
Is the current set of codecs inside the zarr spec?
I think they are.
given that the zarr v3 spec document itself says that it doesn't define a list of codecs (and this claim is internally consistent -- that document does not in fact define a list of codecs), what spec are are the codec definitions part of?
I think it is unfortunate that the paragraph you cite did not get updated during the v3 spec process (a quick git blame shows that). I agree that it is inconsistent because the spec actually lists codecs. Most implementations have implemented this list of codecs. We should certainly revise this paragraph.
I think one of the biggest shortfalls of Zarr V2 is the lack of codec standardisation. Numcodecs has many codecs, but they are not very useful if they are unsupported by other zarr implementations and data viewers.
I agree with this from two points of view:
So I am in favour of having a finite set of codecs included in the zarr spec that implementations must support.
To come back on some of the concerns above:
If writing a new codec started with getting a PR accepted in zarr-specs, nobody would ever write a new codec.
I'm not sure this is true - most of (all?) the codecs currently used by zarr
were developed independently of the zarr spec by teams outside the zarr developers, so would exist regardless of zarr existing.
Requirements in the spec should be restricted to essential features, but supporting the Gzip compressor is simply not essential, for users who don't work with Gzip-compressed data. So any list of codecs should be a recommendation, not a requirement.
Supporting sharding is not essentials for users who don't want sharded data, but it is a useful enough feature for enough people that it's worth mandating it as part of the spec, so for those users who want to use it they know it is guarenteed to be supported. I think the same argument holds for a list of standard codecs - I might not want to use all of them, but I want to be guarenteed that the one I do use is supported by all implementations.
There is no enforcement mechanism
Well, there's no 'enforcement mechanism' for any of the spec, but if someone wants to claim the have written an implementation then they have to implement the whole spec. I'm not sure why codecs would be any different here?
So it seems like most people in this conversation believe that the v3 spec should specify a set of codecs that Zarr implementations must support. This is at variance with the language of the spec today:
To allow for flexibility to define and implement new codecs, this specification does not define any codecs, nor restrict the set of codecs that may be used. Each codec must be defined via a separate specification. In order to refer to codecs in array metadata documents, each codec must have a unique identifier, which is a URI that dereferences to a human-readable specification of the codec. A codec specification must declare the codec identifier, and describe (or cite documents that describe) the encoding and decoding algorithms and the format of the encoded data. ... The Zarr core development team maintains a repository of codec specifications, which are hosted alongside this specification in the zarr-specs GitHub repository, and which are published on the zarr-specs documentation Web site. For ease of discovery, it is recommended that codec specifications are contributed to the zarr-specs GitHub repository. However, codec specifications may be maintained by any group or organisation and published in any location on the Web. For further details of the process for contributing a codec specification to the zarr-specs GitHub repository, see ZEP 0 which describes the process for Zarr specification changes.
To make the spec document match the general opinion expressed in this issue (i.e., that the spec should list a required set of codecs), we need to make the following changes:
Do these changes seem sufficient? If so, we can start writing up a ZEP.
Regarding the Zstd
draft spec in https://github.com/zarr-developers/zarr-specs/pull/256, is the checksum
parameter really necessary? I went though the list of implementations in different languages and it seems the large majority do not support adding a checksum to the compressed output. Wouldn't this limit how many Zarr implementations can support this codec's spec? It sounds to me that we need to be careful not to define a codec's spec based mostly on the features provided by it's python implementation and also consider what features many other languages offer.
I will summarize a few concerns I have about the way codecs are handled in the v3 spec, and propose some changes that I think could improve this situation.
the codec problem space
We need Zarr implementations across multiple languages to agree on standard JSON serialization for different codecs. This protects users from fragmentation, e.g. a situation where we end up with multiple flavors of JSON serialization for the same popular codec. At the same time, we want to make it easy for users to experiment with and create new codecs; this enables users to get the most from Zarr.
Also, codecs are generally useful for users outside of Zarr. There are plenty of non-Zarr use cases for compressing / rearranging array data. So I think the codec standardization should support these non-Zarr use cases.
concerns with codecs in the v3 spec
zarr-python
.zarr-specs
, nobody would ever write a new codec.how to resolve these concerns
I don't think naming a closed set of "official codecs" in the spec is realistic. There is no enforcement mechanism, and ultimately users don't care if an implementation doesn't support a codec they don't use. That is, if an implementation doesn't support codec X, and none of the users of that implementation use codec X, then IMO this is fine.
To express this differently, I think the Zarr spec should not enumerate the features / behavior an implementation must have. The Zarr spec should just describe the Zarr format, and we leave it to implementations to choose how they implement that format.
Extending this logic, the Zarr format is actually agnostic with respect to particular codecs. So specific codecs should not appear in the Zarr spec! I actually think codecs should be defined entirely in another spec, and we refer to this spec in the Zarr spec, e.g. "codecs is a JSON array of JSON objects that implement the Numcodecs spec (link to the numcodecs spec)" (we can choose a different name for the codecs spec, but it shouldn't refer to zarr).
Recall that In Zarr v2, codecs were basically standardized by the behavior of the
numcodecs
python library, which was a stand-alone library with no Zarr dependency. I think this illustrates the right relationship between codecs and the zarr format, but we shouldn't rely on a python library to define a standard for a cross-language concern. Zarr v3 tries to fix the latter problem by folding codec definition inside the spec itself, but as I have argued, this introduces a different set of problems. The solution is to define codecs separately, and make the zarr spec depend on that codec spec. The codec specification can manage a registry of codecs, etc, thereby abstracting the current behavior ofnumcodecs
in a language-agnostic way.Another advantage of a separate spec for codecs is that this spec could be used by any project that wants to compress arrays in a standard way. There is nothing Zarr-specific about serializing GZip parameters to JSON, so lets reflect this in the structure of the specification document.
tldr; I think the list of codecs in v3 is trying to solve a problem (a language-agnostic list of codecs) that we can solve in a better way: by migrating the codec specification from Zarr v3 into its own spec.
is this too much churn in the spec
I know it sucks to hear complaints about the spec after it's been finalized. Sorry. But I want zarr v3 to be really good, and I think the way we do codecs in v3 right now is very problematic; if my concerns are valid, then we owe it to users to get this resolved as soon as possible.