zarr-developers / zarr-python

An implementation of chunked, compressed, N-dimensional arrays for Python.
https://zarr.readthedocs.io
MIT License
1.5k stars 278 forks source link

Caterva inside Zarr #713

Open rabernat opened 3 years ago

rabernat commented 3 years ago

I've been reading about Caterva and have chatted a few times about it with @FrancescAlted. Caterva clearly has some overlap with Zarr, but I think it would be great if we could find some points for collaboration. A key difference is that Caterva stores everything in a single file, so consequently it is aimed at "not-so-big data". By combining Zarr with Caterva, we may get the best of both worlds.

The specific idea would be to encode a Zarr chunk as a Caterva array. This would allow us to leverage Caterva's efficient sub-slicing for partial chunk reads.

Does this make sense? I think so. @FrancescAlted suggests this explicitly in these slides https://www.blosc.org/docs/Caterva-HDF5-Workshop.pdf.

The path forward would be to create a numcodecs codec for Caterva.

FrancescAlted commented 3 years ago

Definitely. We have designed Caterva as a multidimensional building block with the intention that other libraries can leverage it; so I think it makes totally sense (and we would be very happy) if Zarr can do so. Just a couple of remarks:

1) Caterva does support both persistency either with a single file or a directory (i.e. à la Zarr). This is a consequence of the recent implementation of sparse frames in the C-Blosc2 library (we actually blogged about it: https://www.blosc.org/posts/introducing-sparse-frames/)

2) Caterva brings way more features than filters and codecs. It is meant to become a full-fledged binary container for binary data, and in particular, it implements a two-level chunking that allows for finer granularity while doing slices (https://github.com/Blosc/cat4py/blob/master/notebooks/slicing-performance.ipynb).

Finally, Caterva has a well-stablished roadmap that will be trying to follow: https://github.com/Blosc/Caterva/blob/master/ROADMAP.rst. If you think that Zarr can benefit from any of these planned features, we will be glad to accept contributions (in any form of suggestions/code/grants).

jakirkham commented 3 years ago

cc @joshmoore @shoyer (in case you find this interesting ;)

rabernat commented 3 years ago

I started playing with this today. As a first step, I am just trying to implement encoding / decoding of numpy data into caterva, as needed by numcodecs.

But immediately I hit a roadblock. I can't figure out how to get the encoded bytes / buffer out of caterva. For example to encode an array, I am doing

import caterva as cat
import numpy as np

data = np.random.rand(10000, 10000)
c = cat.from_buffer(
    data.tobytes(),
    shape=data.shape,
    itemsize=data.dtype.itemsize,
    chunks=(1000, 10000),
    blocks=(100, 100)
)

# encoded_data = ?

The c.to_buffer() method returns the uncompressed data. I could persist the caterva data to disk, e.g. by passing filename='some/string/path, but this is not what numcodecs needs. It just wants to encoded bytes. As far as I can tell, caterva does not expose this in its public API.

Am I missing something?

FrancescAlted commented 3 years ago

AFAIK we have not implemented yet an accessor to the compressed data in python-caterva, but even if we did, I am afraid that you couldn't immediately leverage it because Caterva uses C-Blosc2 frames so as to store the compressed data, plus the metalayer for dimensionality. Then, frames contain the C-Blosc2 chunks. It goes like this:

image

You can find more info on the Caterva metalayer here: https://caterva.readthedocs.io/en/latest/getting_started/overview.html.

In case you still want to access raw Caterva buffers, you can do that using the C API. First, and in order to avoid copies, you need to create a contiguous buffer by setting the caterva_storage_properties_blosc_t.sequencial to true and then you can access that buffer with blosc2_schunk_to_buffer(cat_array->sc, ...).

rabernat commented 3 years ago

Thanks for the tips Francesc. It sounds like we will probably have to create a cython wrapper for Caterva in numcodecs, similar to what we currently do with Blosc.

Understanding how to best leverage Caterva for Zarr is going to be a bit trickier than I hoped, because the Numcodecs API only defines decompress_partial for a single contiguous byte range:

https://github.com/zarr-developers/numcodecs/blob/98c9e08fc7895dae4d5f9d2abf7b3e405f407402/numcodecs/blosc.pyx#L566-L569

Which we use in Zarr python here:

https://github.com/zarr-developers/zarr-python/blob/adc430a75918520301798946caf4cefc86dd3a3b/zarr/core.py#L1961-L1965

The implementation is basically hard-coded to Blosc

https://github.com/zarr-developers/zarr-python/blob/adc430a75918520301798946caf4cefc86dd3a3b/zarr/indexing.py#L874-L884

In order to leverage the ND-slicing capabilities of Caterva, we would need to further refactor the interface between Numcodecs and Zarr.

jakirkham commented 3 years ago

Raised issue ( https://github.com/intake/filesystem_spec/issues/766 ) about supporting range queries in fsspec. That seems relevant here, but still thinking through exactly how we would use that