zarr-developers / numcodecs

A Python package providing buffer compression and transformation codecs for use in data storage and communication applications.
http://numcodecs.readthedocs.io
MIT License
125 stars 87 forks source link

Add wrappers for zarr v3 #524

Open normanrz opened 4 months ago

normanrz commented 4 months ago

The Zarr v3 specification only lists a few codecs that are officially supported. However, it is desirable to expose the codecs in numcodecs for use with v3 arrays as well. This PR adds wrapper classes for numcodecs support.

The name of the codecs is prefixed with numcodecs. to avoid naming collisions in case some codecs of numcodecs get added to the Zarr spec. Also, there is a warning that numcodecs codecs are not officially supported and will likely not work in any other Zarr implementation.

Most array-to-array ("filters") and bytes-to-bytes codecs are supported. Absent are the variable-length codecs as well as json, msgpack and pickle.

Here is an example of the persisted configuration:

{
  "name": "numcodecs.fixedoffsetscale",
  "configuration": {"offset": 0, "scale": 51, "astype": "uint16"}
}

Use of numcodecs in v2 arrays is not affected.

Fixes https://github.com/zarr-developers/numcodecs/issues/502

pep8speaks commented 4 months ago

Hello @normanrz! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! :beers:

Comment last updated at 2024-05-08 20:53:05 UTC
rabernat commented 3 months ago

The name of the codecs is prefixed with https://zarr.dev/numcodecs/ to avoid naming collisions in case some codecs of numcodecs get added to the Zarr spec

I am not sure about the idea of using a URL that does not actually resolve to anything useful.

rabernat commented 3 months ago

pcodec is actually an "Array to Bytes" codec: https://github.com/zarr-developers/numcodecs/blob/main/numcodecs/pcodec.py

How would that fit in here?

martindurant commented 3 months ago

Any thoughts about what to do with numcodecs codecs not defined in this repo, but currently used via entrypoints?

d-v-b commented 3 months ago

The name of the codecs is prefixed with https://zarr.dev/numcodecs/ to avoid naming collisions in case some codecs of numcodecs get added to the Zarr spec

I am not sure about the idea of using a URL that does not actually resolve to anything useful.

seconding this sentiment, a URL that doesn't resolve to anything is rather confusing. I think numcodecs.<codec_name> or numcodecs/<codec_name> are simpler templates for a numcodecs-qualified name.

rabernat commented 3 months ago

Any thoughts about what to do with numcodecs codecs not defined in this repo, but currently used via entrypoints?

Could we ask those codecs to implement Zarr codec entrypoints directly? Which codecs do you have in mind?

The challenge is that the V3 codecs are quite a bit more explicit in their typing (Array to Bytes, Bytes to Bytes, etc.) than legacy numcodecs codecs. So automatically translating an arbitrary numcodecs codec to a V3 codec is not possible.

martindurant commented 3 months ago

I am thinking of https://github.com/fsspec/kerchunk/blob/main/kerchunk/codecs.py and imagecodecs. There are probably others.

normanrz commented 3 months ago

The name of the codecs is prefixed with https://zarr.dev/numcodecs/ to avoid naming collisions in case some codecs of numcodecs get added to the Zarr spec

I am not sure about the idea of using a URL that does not actually resolve to anything useful.

I had asked @MSanKeys963 to setup the respective redirects to the numcodecs docs. That should solve that.

normanrz commented 3 months ago

pcodec is actually an "Array to Bytes" codec: https://github.com/zarr-developers/numcodecs/blob/main/numcodecs/pcodec.py

How would that fit in here?

Must have missed pcodec. I'll add it.

codecov[bot] commented 3 months ago

Codecov Report

Attention: Patch coverage is 42.00000% with 145 lines in your changes missing coverage. Please review.

Project coverage is 94.31%. Comparing base (42f89d2) to head (b75e41e). Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
numcodecs/zarr3.py 57.22% 74 Missing :warning:
numcodecs/tests/test_zarr3.py 7.79% 71 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #524 +/- ## ========================================== - Coverage 99.91% 94.31% -5.61% ========================================== Files 59 61 +2 Lines 2334 2584 +250 ========================================== + Hits 2332 2437 +105 - Misses 2 147 +145 ``` | [Files with missing lines](https://app.codecov.io/gh/zarr-developers/numcodecs/pull/524?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=zarr-developers) | Coverage Δ | | |---|---|---| | [numcodecs/tests/test\_zarr3.py](https://app.codecov.io/gh/zarr-developers/numcodecs/pull/524?src=pr&el=tree&filepath=numcodecs%2Ftests%2Ftest_zarr3.py&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=zarr-developers#diff-bnVtY29kZWNzL3Rlc3RzL3Rlc3RfemFycjMucHk=) | `7.79% <7.79%> (ø)` | | | [numcodecs/zarr3.py](https://app.codecov.io/gh/zarr-developers/numcodecs/pull/524?src=pr&el=tree&filepath=numcodecs%2Fzarr3.py&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=zarr-developers#diff-bnVtY29kZWNzL3phcnIzLnB5) | `57.22% <57.22%> (ø)` | |
dstansby commented 2 weeks ago

Could you say something a bit about why this code makes sense to be in numcodecs instead of zarr-python? My first thoughts are that numcodecs sees much less development/maintenance than zarr-python at the moment, so unless there's a good reason maybe this code should live in zarr-python?

normanrz commented 2 weeks ago

The idea was that the zarr package contains the "official" Zarr3 codecs and other libraries can add other codecs via the entrypoint mechanism. The zarr package provides base classes for doing so. numcodecs is one of these libraries that can provide additional codecs. There is quite a bit of glue code to make the existing v2 codecs work with the new v3 base classes. This glue code is tightly coupled with the codecs itself, which is why I think it makes more sense to have it in numcodecs rather than zarr. If we would add a new codec to numcodecs, we would need to make a new zarr release to support it. I see that it is a bit weird because zarr itself depends on numcodecs.

dstansby commented 2 weeks ago

That makes sense - I'll try and give this a proper review in the next couple of days!