Representation of AA, AB, BB codecs

clbarnes commented 1 year ago

There are hard order constraints on array->array (AA), array->bytes (AB), and bytes->bytes (BB) codecs. There can only be one AB codec. Additionally, they practically have different signatures:

c.compute_encoded_representation(decoded_representation_type) is only needed on AA (because for the others it's a simple return Bytes
The c.decode(..., decoded_representation_type) argument is only needed on AB, as far as I can tell
Whether the methods take byte readers or array readers is consistent within each class, not between classes

From an implementation standpoint, it doesn't make a lot of sense to me to squash them into one interface where the distinctions between types of codecs are so clear-cut. It just adds more boilerplate and error sources in the code, and confusion for users mixing and matching codecs.

In order to keep the metadata arrays homogeneous, would it make sense to split the serialised codec config into those 3 types, as they would be in the implementation? For example

{
  ...,
  codecs: {
    "array_array": [aa_codec1, aa_codec2],
    "array_bytes": ab_codec_or_null,
    "bytes_bytes": [bb_codec1, bb_codec2]
  },
  ...
}

Fortunately the lexicographic sorting of those keys mirrors the order in which they're applied!

jbms commented 1 year ago

Certainly from an implementation I would agree that it is probably not very helpful to have a single "Codec" type, and instead this type of split representation makes more sense.

With the current combined list representation in the metadata, it seems most natural to me for an implementation to split them into the 3 codec types while parsing the metadata, which means from that point on, there isn't really any difference.

It seems that there are pros and cons to splitting vs combining the codec types in the metadata:

Pro split: Makes the 3 separate codec types obvious in the metadata, and perhaps makes users more aware of the distinction.
Pro combine: Split representation is more verbose, and forces users to be aware of this distinction. The ordering constraint on codec types may be intuitively obvious to users without the need for them to be explicitly aware of the codec types. For example, attempting to apply "endian" codec, transpose codec, or sharding codec, after "gzip" logically makes no sense anyway. There may also be other constraints on codecs, such as on data types or ranks, that would be less intuitively obvious to users and which will also not be captured by this split. Users may often just specify a single bytes to bytes codec, in which case the distinction does not matter.
Splitting would theoretically allow the same codec name to be used for multiple difference codec kinds, but obviously that would be very confusing.
With the split representation, if a user specifies a codec as the wrong kind, it would probably be helpful for the implementation to indicate that in the error, rather than just say that the codec is not supported. This basically means the implementation needs to do just as much work for the split representation as for the combined representation.
Arguably we should require that the array -> bytes codec is always specified in the metadata (e.g. currently either endian or sharding_indexed), but when creating an array the implementation could supply the default if not specified. With the split representation, this just amounts to filling in an unspecified "array_bytes" field, while with the combined representation, this amounts to inserting an extra codec in the middle of the list, which may be more confusing.
Some codecs other than sharding may accept a chain of codecs as configuration options. For example, an "array -> bytes" codec for variable-length strings might generate both an array of sizes and the concatenated strings, and have as separate configuration options the sequences of codecs to apply to the array of sizes, and the sequence of codecs to apply to the concatenated string data. Here, the array of sizes, being an array, would support all 3 codec kinds (like the top-level array itself), while the concatenated string data would support just "bytes -> bytes" codecs. I am unsure which representation is preferable in this case, though.

joshmoore commented 1 year ago

A small vote from my side for the linear representation: though I don't have a concrete example, conceivably there could also be a new codec input or output type, essentially making that part of the system extensible. That would be fine in the linear chain as long as the sequence of inputs/outputs collapses to the expected type (here bytes).

(If there's not sufficient metadata in the codecs currently in order to do that type checking on the chain, then that seems like an issue we might want to address.)

clbarnes commented 1 year ago

(If there's not sufficient metadata in the codecs currently in order to do that type checking on the chain, then that seems like an issue we might want to address.)

I suppose any implementation would know the input and output type (based on the documentation at the canonical URL of the codec), and any unimplemented (i.e. unknown) codec would just fail to open, so it doesn't matter too much.

clbarnes commented 1 year ago

I'm happy with the answers presented here!

zarr-developers / zarr-specs

Representation of AA, AB, BB codecs #242