zarr-developers / numcodecs

A Python package providing buffer compression and transformation codecs for use in data storage and communication applications.
http://numcodecs.readthedocs.io
MIT License
121 stars 82 forks source link

Use blosc2 package instead of bundled blosc #538

Closed dstansby closed 2 weeks ago

dstansby commented 2 weeks ago

This is a first attempt at fixing https://github.com/zarr-developers/numcodecs/issues/262 by replacing the bundled blosc library with the blosc2 package available on PyPI.

There are several tests currently marked as xfail that need investigating and potentially fixing, but I thought it was worth opening this to avoid anyone else duplicating the work so far, and to see if anyone else wants to help investigate the pytest.xfails I had to put in to get tests passing on my local machine.

normanrz commented 2 weeks ago

I wonder if blosc2 should rather be a separate codec because of the missing forward compatibility. From the readme:

Note: Python-Blosc2 is meant to be backward compatible with Python-Blosc data. That means that it can read data generated with Python-Blosc, but the opposite is not true (i.e. there is no forward compatibility).

Also, the blosc c code base is used for other codecs that are bundled with blosc, e.g. zstd, lz4.

dstansby commented 2 weeks ago

I that case I might close this PR - I couldn't get the blosc package to install on my local machine, because it's no longer maintained (last upload Dec '22) and wheels aren't available for Python 3.12

d-v-b commented 2 weeks ago

I that case I might close this PR - I couldn't get the blosc package to install on my local machine, because it's no longer maintained (last upload Dec '22) and wheels aren't available for Python 3.12

I think the blosc developers want people to use blosc2, and understandably don't have much interest in blosc1 maintenance. We should definitely get blosc2 set up as a zarr codec for this reason.

mkitti commented 2 weeks ago

We really should not be encoding new data with in the Blosc-1 chunk format with upstream support being sparse and the upstream authors strongly encouraging us to to migrate.

Blosc1 chunk format: https://github.com/Blosc/c-blosc/blob/main/README_CHUNK_FORMAT.rst

Blosc2 contiguous frame format: https://www.blosc.org/c-blosc2/format/cframe_format.html

martindurant commented 2 weeks ago

blosc c code base is used for other codecs that are bundled with blosc, e.g. zstd, lz4.

but these codecs are also available without blosc, in an incompatible way because of extra framing blosc adds.

normanrz commented 2 weeks ago

I would love to see a ZEP that adds blosc2 to the Zarr spec.