zarr-developers / zarr-specs

Zarr core protocol for storage and retrieval of N-dimensional typed arrays
https://zarr-specs.readthedocs.io/
Creative Commons Attribution 4.0 International
88 stars 28 forks source link

Require transpose codec `order` option to be specified as an explicit permutation #264

Closed jbms closed 1 year ago

jbms commented 1 year ago

The "C" and "F" constants were added to emphasize the relationship with the zarr v2 order metadata field. However, these contants are somewhat confusing:

"codecs": [
  {"name": "transpose", {"configuration": {"order": "F"}}},
  {"name": "transpose", {"configuration": {"order": "C"}}},
  {"name": "bytes"}
]

or

"codecs": [
  {"name": "transpose", {"configuration": {"order": "C"}}},
  {"name": "transpose", {"configuration": {"order": "F"}}},
  {"name": "bytes"}
]

the storage order is actually "F", while in the following:

"codecs": [
  {"name": "transpose", {"configuration": {"order": "F"}}},
  {"name": "transpose", {"configuration": {"order": "F"}}},
  {"name": "bytes"}
]

the storage order is actually "C".

mkitti commented 1 year ago

To me this sounds like a new codec called permutedims which is reminiscent of a julia function of the same name. There's also a MATLAB function permute and a R function aperm.

From your example, it does seem like that use of multiple transpose codecs in a row should not be permitted and perhaps that it should only be applied first.

jbms commented 1 year ago

To me this sounds like a new codec called permutedims which is reminiscent of a julia function of the same name. There's also a MATLAB function permute and a R function aperm.

From your example, it does seem like that use of multiple transpose codecs in a row should not be permitted and perhaps that it should only be applied first.

To be clear, the existing spec already allows an explicit permutation, I am just proposing to not allow "C" and "F" as aliases.

I would agree that permutedims would be a fine name, but happened to choose transpose to match the numpy function.

While applying the transpose codec multiple times consecutively could always be equivalently represented as a single application of transpose with the composed permutation, and implementations might indeed wish to perform that optimization, I don't think it would be helpful to actually disallow specifying transpose multiple times; that might reasonably occur if an automated tool modifies the metadata and doesn't handle composing the transposes itself. In any case even if multiple consecutive transposes are disallowed the issue can still occur if there are other codecs in between, such as delta encoding, sharding, etc.

normanrz commented 1 year ago

After giving it some thought, I agree to this change. In fact, I also had some headaches to make F and C aliases work in zarrita. Allowing just the array notation would make it clearer. I'm fine with sticking with transpose as a name for the codec.