zarr-developers / zarr-specs

Zarr core protocol for storage and retrieval of N-dimensional typed arrays
https://zarr-specs.readthedocs.io/
Creative Commons Attribution 4.0 International
86 stars 28 forks source link

Information available to sharding codec #241

Open clbarnes opened 1 year ago

clbarnes commented 1 year ago

Currently the sharding codec spec reads that "each integer must by divisible by the chunk_shape of the array as defined in the chunk_grid Array metadata.". However, array->array codecs can change the shape of the requested chunk (e.g. the transpose codec permuting dimensions); in this case, the sharding codec's chunk shape should correspond to the permuted shape rather than the array's shape.

The sharding codec will also need to duplicate empty-chunk logic from the array class, which also means it will need access to the array's fill value. However, as array->array codecs could feasibly change values in the array (could they also change the data type?), the fill value may also need these transformations applied to it.

This probably all falls under the codec.compute_encoded_representation_type method in the spec.

normanrz commented 1 year ago

In zarrita I pass a subset of the array metadata to the codecs: https://github.com/scalableminds/zarrita/blob/v3/zarrita/array.py#L101-L110

jbms commented 1 year ago

As far as the issue of the chunk shape being affected by prior codecs (also applies if there is nested sharding), you are correct. I tried to fix the wording to address that issue in this (otherwise unrelated) PR https://github.com/zarr-developers/zarr-specs/pull/237/files

As far as the fill value, that is indeed an issue that I hadn't considered. I agree that we should resolve this issue by saying that for "array -> array" codecs, compute_encoded_representation_type also needs to compute the new fill value for the "encoded" representation.