zarr-developers / numcodecs

A Python package providing buffer compression and transformation codecs for use in data storage and communication applications.
http://numcodecs.readthedocs.io
MIT License
126 stars 88 forks source link

Blosc2 Codec? #413

Open rabernat opened 1 year ago

rabernat commented 1 year ago

I recently noticed that Blosc2 2.0 has been released: https://twitter.com/Blosc2/status/1605529031780311041. This made me wonder whether we should revisit the idea of adding blosc2 support to Numcodecs and Zarr.

Obviously blosc2 has gone in the direction of adding more features--it's not much more than just a compression codec, and includes I/O, metadata, plugins, etc.--such that there is no clear boundary between Zarr's features and blosc2's features. So we would first want to decide which parts of blosc2 would be advantageous to expose as a codec in numcodecs. The main question would be whether there is a benefit to using the blosc2 superchunk feature to store multiple chunks in a single shard. If so, we will quickly resurrect the discussion about Caterva in Zarr (https://github.com/zarr-developers/zarr-python/issues/713).

JackKelly commented 1 year ago

Ooh, yes, I'd be really interested to see blosc2 integrated with numcodecs / Zarr.

For now, I believe that imagecodecs supports blosc2. But deep integration with numcodecs & Zarr would be very interesting.

jakirkham commented 1 year ago

@FrancescAlted and I discussed adding Blosc2 support to Zarr during the 2022 NumFOCUS Summit. Think what we concluded it should be possible. Note this is a different approach than what is outlined in the blogpost above.

The relevant thing for this discussion is the Blosc2 chunk format. This is the thing (I think) we would want to interact with.

The Blosc2 chunk format has blocks, which in Zarr we would call shards. AFAICT these are the same, but the terminology is different. Will use the term shards (as that is what we are familiar with), but keep this in mind when reading the spec.

In Blosc2 it tracks starts/offsets into shards, which we would likely want to extract and add to the metadata. This overlaps a bit with the sharding work @jstriebel has been doing ( https://github.com/zarr-developers/zarr-python/pull/1111 ) so potentially could work with that approach. Think we would want to remove the header as this would already be stored in Zarr metadata. As a first pass this would likely entail some copying to add/remove the header. Longer term we might want the option to pass in the header separately (or something like this).

Anyways this is my recollection of that conversation, which is not as fresh as it was. I may very well have forgotten/misunderstood things.

jakirkham commented 1 year ago

Perhaps another way to go about this would be look at using Kerchunk with Blosc2

martindurant commented 1 year ago

Yes, kerchunk is interested in accessing the chunks within a compressed stream. You could regard the compressed blocks as chunks, but they would presumably not be equal length, so additional logic would be needed.

With the release of indexed_gzip, we may be able to something similar across all implementations. There is some tradeoff here between writing lots of references for individual chunks versus storing "shard" information elsewhere versus just requesting the exact matrix offsets from the storage and having the compression layer figure out what to actually read (I don't know if the third is actually possible)

FrancescAlted commented 1 year ago

Numcodecs adopting Blosc2 would be great. BTW, what we recently released as 2.0 is Python-Blosc2, not C-Blosc2 (whose 2.0 release happened 1,5 years ago).

For what is worth, we have just merged Caterva into the main branch of C-Blosc2, so the later has gained multidimensional capabilities. During the merge, the API has changed a bit (mainly to adapt to the Blosc way of doing things), but the functionality in the new C-Blosc2 is the same. We will let the new API to rest a bit, and when the dust would be settled, we will proceed with releasing C-Blosc2 (probably 2.7.0) pretty soon.

mkitti commented 1 year ago

Just so the situation is clear, Blosc2 compressed data is not decompressable by Blosc1. On the other hand, Blosc1 compressed data can be decompressed by Blosc2.

https://github.com/Blosc/hdf5-blosc/issues/29#issuecomment-1030773468

For this reason Blosc1 and Blosc2 are registered as separate HDF5 filter plugins: https://portal.hdfgroup.org/display/support/Filters#Filters-32026

I suspect numcodecs will need to support both Blosc1 and Blosc2 compression, simultaneously, for the sake of backwards compatibility.

You may also want to consider deprecating Blosc1 compression in favor of Blosc2 compression.

fschwar4 commented 1 year ago

Hi all,

If anyone really wants the Blosc2 compressors, they could check out the Python implementation of Blosc2. You can easily register this as a new Numcodec. A first test showed improved behaviour over Blosc1 in most cases. I will do some more rigorous testing next week.

import blosc2
import numcodecs

enum_dict = {
    'blosclz': blosc2.Codec.BLOSCLZ,
    'lz4': blosc2.Codec.LZ4,
    'lz4hc': blosc2.Codec.LZ4HC,
    'zlib': blosc2.Codec.ZLIB,
    'zstd': blosc2.Codec.ZSTD,
    'NDLZ': blosc2.Codec.NDLZ,
    'ZFP_ACC': blosc2.Codec.ZFP_ACC,
    'ZFP_PREC': blosc2.Codec.ZFP_PREC,
    'ZFP_RATE': blosc2.Codec.ZFP_RATE,
}

class Blosc2(numcodecs.abc.Codec):

    codec_id = 'blosc2'

    def __init__(self, cname='BLOSCLZ', clevel=5, shuffle=1, blocksize=0):
        self.cname = cname
        self.clevel = clevel
        self.shuffle = shuffle
        self.blocksize = blocksize

    def encode(self, data):
        return blosc2.compress2(data, codec=enum_dict[self.cname], clevel=self.clevel, filter=blosc2.Filter(self.shuffle), blocksize=self.blocksize)

    def decode(self, data):
        return blosc2.decompress(data)

numcodecs.register_codec(Blosc2, 'blosc2')

zarr_blosc1_vs_blosc2_(60, 10000)

zarr_blosc1_vs_blosc2_(60, 45000000)

zarr_blosc1_vs_blosc2_(60, 10000000)_100_000_chunks

joshmoore commented 1 year ago

Wow, thanks for the info, @fschwar4. (And of course @FrancescAlted for the PR!:wink:)