zarr-developers / zarr-python

An implementation of chunked, compressed, N-dimensional arrays for Python.
https://zarr.readthedocs.io
MIT License
1.46k stars 274 forks source link

[v3] Inner chunk size validation behavior for `ShardingCodec` when downstream of `TransposeCodec` #2050

Open bogovicj opened 2 months ago

bogovicj commented 2 months ago

Zarr version

3.0.0a0

Numcodecs version

0.13.0

Python Version

3.12.4

Operating System

Linux

Installation

using pip into conda env

Description

I expect a ShardingCodec downstream of a TransposeCodec to consume the transposed array. As a result, I would expect the inner chunk size would have to be "transposed" in the same way that the array was transposed.

If the size of shards/chunks along different dimensions do not share a common factor, there is no way (currently) to save a transposed and sharded array.

The code below reproduces the error.

Steps to reproduce

# data is an array with shape [15,32]
data = np.arange(15*32, dtype=np.single).reshape(15, 32)

# first codec (array -> array) transpose the array of size [15,32] to size [32,15]
transpose = zarr.codecs.TransposeCodec(order=[1, 0])

# second codec (array -> bytes) is sharding 
# I expect it to operate on the [32,15] array output by the TransposeCodec
# as a result, its chunk shape should evenly divide [32,15], for example [16,5] should work
sharding = zarr.codecs.ShardingCodec(chunk_shape=(16, 5), codecs=[zarr.codecs.BytesCodec()], index_codecs=[zarr.codecs.BytesCodec()])
codecs=[transpose, sharding]

store = zarr.store.LocalStore('<path>/test.zarr', mode='w')
z = zarr.array(data, path='transposed_sharded', chunk_shape=(15, 32), codecs=codecs, store=store)
# ValueError: The array's `chunk_shape` needs to be divisible by the shard's inner `chunk_shape`.

If instead, we don't transpose the sharding codec's chunk_shape, it seems to pass validation but crashes later with a ZeroDivisionError


data = np.arange(15*32, dtype=np.single).reshape(15, 32)
transpose = zarr.codecs.TransposeCodec(order=[1,0])
sharding = zarr.codecs.ShardingCodec(chunk_shape=(5, 16), codecs=[zarr.codecs.BytesCodec()], index_codecs=[zarr.codecs.BytesCodec()])
codecs=[transpose, sharding]

store = zarr.store.LocalStore('<path>/test.zarr', mode='w')
z = zarr.array(data, path='transposed_sharded_uhoh', chunk_shape=(15, 32), codecs=codecs, store=store)
# ZeroDivisionError: integer modulo by zero

Additional output

No response

dchen116 commented 1 month ago

watching

d-v-b commented 1 week ago

I am guessing this happens because we validate the sharding codec against the static shape of the zarr array, not against the shape of the array that codec will receive in the data encoding process, which might depend on any number of array -> array codecs. So to properly validate the sharding codecs, we need to be able to statically resolve the shape of its input, based on all previous codecs. Right now I think we only have the transpose codec to worry about, so this shouldn't be too hard to fix.

edit: we would need to change this routine: https://github.com/zarr-developers/zarr-python/blob/726fdfbf569c144310893440a40ee8ee05e6524e/src/zarr/core/metadata.py#L226-L228

The replacement should probably be a stand-alone function that takes an input shape, dtype, and a list of codecs and internally tracks the shape changes through the chain of codecs.

LDeakin commented 1 week ago

There is some related discussion to this here: https://ossci.zulipchat.com/#narrow/stream/423692-Zarr/topic/Transpose.20codec.20interpretation.