zarr-developers / numcodecs

A Python package providing buffer compression and transformation codecs for use in data storage and communication applications.
http://numcodecs.readthedocs.io
MIT License
125 stars 87 forks source link

Supporting Zarr-Python 3 Codec API #502

Open jhamman opened 8 months ago

jhamman commented 8 months ago

Over in Zarr-Python, we are working on a a new major version (v3). This version will have a slightly new Codec API and will expose a new set of codec types (ArrayArrayCodec, ArrayBytesCodec, BytesBytesCodec, etc.). These codec classes are not wildly different from the existing Numcodecs API (perhaps except for the partial encode/decode options) but they are not in perfect alignment. With this in mind, some questions for discussion:

  1. Can Numcodecs conform to the Zarr-Python API?
    • I don't think this has to be seen as a breaking change but if it is, how do we weight the potential costs and benefits?
  2. Would Numcodecs be able to register codecs via the Zarr-Python entrypoint mechanism? (e.g. https://github.com/zarr-developers/zarr-python/pull/1588)
jhamman commented 8 months ago

cc @zarr-developers/python-core-devs

jni commented 8 months ago

@jhamman the link to the new codec API is 404...

normanrz commented 8 months ago

I fixed the OP.

jhamman commented 7 months ago

We discussed this in the Zarr-Python refactor meeting today. The outstanding task here is to experiment with Zarr v3 codec API by exposing this library's compression codecs and pre-compression filters through the BytesBytesCodec and ArrayArrayCodec interfaces. If that can be done effectively, these can be registered through the entrypoint mechanism described above.

This would be a good project for someone interested in getting involved in zarr-python 3's development.

martindurant commented 5 months ago

I would add here, that I think having some fallback support for numcodecs (+ other packages that make codecs following its API) is important to maintain readability of datasets in v3. We need to decide whether we can assume they work on bytes - which is by far the most common case - or otherwise can tell from the signature (or try/except) if they accept/produce arrays. That doesn't seem to hard.

Question: codecs are of course CPU-bound, and will be run in threads, hoping that the GIL is released. The to_thread call lives in zarr-python?

If all this is true, I don't see any reason to rewrite any codecs for v3, except where we wish to state the bytes Vs array nature of a codec.

normanrz commented 4 months ago

There is now a PR that adds the numcodecs.zarr3 module which contains Zarr v3 wrappers for the numcodecs codecs: #524

dstansby commented 1 month ago

Removing the "good first issue" label since there's now a PR for this