Open rabernat opened 3 years ago
Definitely. We have designed Caterva as a multidimensional building block with the intention that other libraries can leverage it; so I think it makes totally sense (and we would be very happy) if Zarr can do so. Just a couple of remarks:
1) Caterva does support both persistency either with a single file or a directory (i.e. à la Zarr). This is a consequence of the recent implementation of sparse frames in the C-Blosc2 library (we actually blogged about it: https://www.blosc.org/posts/introducing-sparse-frames/)
2) Caterva brings way more features than filters and codecs. It is meant to become a full-fledged binary container for binary data, and in particular, it implements a two-level chunking that allows for finer granularity while doing slices (https://github.com/Blosc/cat4py/blob/master/notebooks/slicing-performance.ipynb).
Finally, Caterva has a well-stablished roadmap that will be trying to follow: https://github.com/Blosc/Caterva/blob/master/ROADMAP.rst. If you think that Zarr can benefit from any of these planned features, we will be glad to accept contributions (in any form of suggestions/code/grants).
cc @joshmoore @shoyer (in case you find this interesting ;)
I started playing with this today. As a first step, I am just trying to implement encoding / decoding of numpy data into caterva, as needed by numcodecs.
But immediately I hit a roadblock. I can't figure out how to get the encoded bytes / buffer out of caterva. For example to encode an array, I am doing
import caterva as cat
import numpy as np
data = np.random.rand(10000, 10000)
c = cat.from_buffer(
data.tobytes(),
shape=data.shape,
itemsize=data.dtype.itemsize,
chunks=(1000, 10000),
blocks=(100, 100)
)
# encoded_data = ?
The c.to_buffer()
method returns the uncompressed data. I could persist the caterva data to disk, e.g. by passing filename='some/string/path
, but this is not what numcodecs needs. It just wants to encoded bytes. As far as I can tell, caterva does not expose this in its public API.
Am I missing something?
AFAIK we have not implemented yet an accessor to the compressed data in python-caterva, but even if we did, I am afraid that you couldn't immediately leverage it because Caterva uses C-Blosc2 frames so as to store the compressed data, plus the metalayer for dimensionality. Then, frames contain the C-Blosc2 chunks. It goes like this:
You can find more info on the Caterva metalayer here: https://caterva.readthedocs.io/en/latest/getting_started/overview.html.
In case you still want to access raw Caterva buffers, you can do that using the C API. First, and in order to avoid copies, you need to create a contiguous buffer by setting the caterva_storage_properties_blosc_t.sequencial to true
and then you can access that buffer with blosc2_schunk_to_buffer(cat_array->sc, ...).
Thanks for the tips Francesc. It sounds like we will probably have to create a cython wrapper for Caterva in numcodecs, similar to what we currently do with Blosc.
Understanding how to best leverage Caterva for Zarr is going to be a bit trickier than I hoped, because the Numcodecs API only defines decompress_partial
for a single contiguous byte range:
Which we use in Zarr python here:
The implementation is basically hard-coded to Blosc
In order to leverage the ND-slicing capabilities of Caterva, we would need to further refactor the interface between Numcodecs and Zarr.
Raised issue ( https://github.com/intake/filesystem_spec/issues/766 ) about supporting range queries in fsspec
. That seems relevant here, but still thinking through exactly how we would use that
I've been reading about Caterva and have chatted a few times about it with @FrancescAlted. Caterva clearly has some overlap with Zarr, but I think it would be great if we could find some points for collaboration. A key difference is that Caterva stores everything in a single file, so consequently it is aimed at "not-so-big data". By combining Zarr with Caterva, we may get the best of both worlds.
The specific idea would be to encode a Zarr chunk as a Caterva array. This would allow us to leverage Caterva's efficient sub-slicing for partial chunk reads.
Does this make sense? I think so. @FrancescAlted suggests this explicitly in these slides https://www.blosc.org/docs/Caterva-HDF5-Workshop.pdf.
The path forward would be to create a numcodecs codec for Caterva.