zarr-developers / zarr-python

An implementation of chunked, compressed, N-dimensional arrays for Python.
http://zarr.readthedocs.io/
MIT License
1.44k stars 272 forks source link

v3 array creation: codecs #1943

Open d-v-b opened 2 months ago

d-v-b commented 2 months ago

One thing about zarr v2 -> v3 that might surprise users is the change from the v2 compressor metadata (a single thing) + filters (an ordered collection) to the v3 codecs metadata (an ordered collection with a special required element).

I suspect most users coming from v2 won't use array-array or bytes-bytes codecs. These users will think in terms of a single compressor for their data, if they worry about the compressor at all. For such users, the codecs keyword argument in v3 array creation will be confusing, because a) it's not called "compressor", and b) it's an iterable. Users who do use filters will wonder where the filters keyword argument went, and they will have to discover that their filters are now called "codecs", and these codecs should be prepended in front of the thing that used to be called the compressor.

I wonder if we could smooth out some of this confusion by adding an abstraction on top of the v3 codecs metadata in our array creation routines, and returning to v2 terminology. Specifically, we could use the keyword "filters" to denote array-array codecs, "compressor" to denote the required array-bytes compressor, and introduce a new, v3-array-only keyword "post_compressor" to denote any bytes-bytes codecs. I'm not wedded to this name, feel free to suggest something better.

It would be an error to request a v2 array with a post-compressor, and otherwise the exact same keywords work for v2 and v3 array creation routines. Ergonomically this feels like an improvement and it would simplify today's chimeric AsyncArray.create function, which is burdened with supporting mutually exclusive codecs and compressor / filters keyword arguments.

e.g.

def create(
  shape, 
  dtype, 
  filters: Iterable[ArrayArrayCodec], 
  compressor: ArrayBytesCodec, 
  post_compressor: Iterable[BytesBytesCodec], 
  zarr_format, ...) -> AsyncArray

thoughts? Especially from people kicking the tires on the v3 array api (@rabernat)