Closed clbarnes closed 1 year ago
Certainly from an implementation I would agree that it is probably not very helpful to have a single "Codec" type, and instead this type of split representation makes more sense.
With the current combined list representation in the metadata, it seems most natural to me for an implementation to split them into the 3 codec types while parsing the metadata, which means from that point on, there isn't really any difference.
It seems that there are pros and cons to splitting vs combining the codec types in the metadata:
endian
or sharding_indexed
), but when creating an array the implementation could supply the default if not specified. With the split representation, this just amounts to filling in an unspecified "array_bytes" field, while with the combined representation, this amounts to inserting an extra codec in the middle of the list, which may be more confusing.A small vote from my side for the linear representation: though I don't have a concrete example, conceivably there could also be a new codec input or output type, essentially making that part of the system extensible. That would be fine in the linear chain as long as the sequence of inputs/outputs collapses to the expected type (here bytes
).
(If there's not sufficient metadata in the codecs currently in order to do that type checking on the chain, then that seems like an issue we might want to address.)
(If there's not sufficient metadata in the codecs currently in order to do that type checking on the chain, then that seems like an issue we might want to address.)
I suppose any implementation would know the input and output type (based on the documentation at the canonical URL of the codec), and any unimplemented (i.e. unknown) codec would just fail to open, so it doesn't matter too much.
I'm happy with the answers presented here!
There are hard order constraints on array->array (AA), array->bytes (AB), and bytes->bytes (BB) codecs. There can only be one AB codec. Additionally, they practically have different signatures:
c.compute_encoded_representation(decoded_representation_type)
is only needed on AA (because for the others it's a simplereturn Bytes
c.decode(..., decoded_representation_type)
argument is only needed on AB, as far as I can tellFrom an implementation standpoint, it doesn't make a lot of sense to me to squash them into one interface where the distinctions between types of codecs are so clear-cut. It just adds more boilerplate and error sources in the code, and confusion for users mixing and matching codecs.
In order to keep the metadata arrays homogeneous, would it make sense to split the serialised codec config into those 3 types, as they would be in the implementation? For example
Fortunately the lexicographic sorting of those keys mirrors the order in which they're applied!