zarr-developers / zarr-specs

Zarr core protocol for storage and retrieval of N-dimensional typed arrays
https://zarr-specs.readthedocs.io/
Creative Commons Attribution 4.0 International
88 stars 28 forks source link

Representation of AA, AB, BB codecs #242

Closed clbarnes closed 1 year ago

clbarnes commented 1 year ago

There are hard order constraints on array->array (AA), array->bytes (AB), and bytes->bytes (BB) codecs. There can only be one AB codec. Additionally, they practically have different signatures:

From an implementation standpoint, it doesn't make a lot of sense to me to squash them into one interface where the distinctions between types of codecs are so clear-cut. It just adds more boilerplate and error sources in the code, and confusion for users mixing and matching codecs.

In order to keep the metadata arrays homogeneous, would it make sense to split the serialised codec config into those 3 types, as they would be in the implementation? For example

{
  ...,
  codecs: {
    "array_array": [aa_codec1, aa_codec2],
    "array_bytes": ab_codec_or_null,
    "bytes_bytes": [bb_codec1, bb_codec2]
  },
  ...
}

Fortunately the lexicographic sorting of those keys mirrors the order in which they're applied!

jbms commented 1 year ago

Certainly from an implementation I would agree that it is probably not very helpful to have a single "Codec" type, and instead this type of split representation makes more sense.

With the current combined list representation in the metadata, it seems most natural to me for an implementation to split them into the 3 codec types while parsing the metadata, which means from that point on, there isn't really any difference.

It seems that there are pros and cons to splitting vs combining the codec types in the metadata:

joshmoore commented 1 year ago

A small vote from my side for the linear representation: though I don't have a concrete example, conceivably there could also be a new codec input or output type, essentially making that part of the system extensible. That would be fine in the linear chain as long as the sequence of inputs/outputs collapses to the expected type (here bytes).

(If there's not sufficient metadata in the codecs currently in order to do that type checking on the chain, then that seems like an issue we might want to address.)

clbarnes commented 1 year ago

(If there's not sufficient metadata in the codecs currently in order to do that type checking on the chain, then that seems like an issue we might want to address.)

I suppose any implementation would know the input and output type (based on the documentation at the canonical URL of the codec), and any unimplemented (i.e. unknown) codec would just fail to open, so it doesn't matter too much.

clbarnes commented 1 year ago

I'm happy with the answers presented here!