zarr-developers / numcodecs

A Python package providing buffer compression and transformation codecs for use in data storage and communication applications.
http://numcodecs.readthedocs.io
MIT License
121 stars 81 forks source link

Add Apache Arrow codec #227

Open vdwees opened 4 years ago

vdwees commented 4 years ago

Adding an Apache Arrow codec for efficient data loading.

For data stored in the filesystem, Apache parquete might be added as well.

Also mentioned here: https://github.com/zarr-developers/zarr-python/issues/515

jakirkham commented 4 years ago

How do you imagine this working?

If people are using Parquet, do they actually need Zarr?

vdwees commented 4 years ago

I guess one of the selling features of zarr for me is being able to load only the chunks I need off of a remote server. Arrow is only an in-memory representation, so I guess it is conceivable that a chunk is larger than is reasonable for memory, and it gets spooled to a local disk as a Parquet file?🤔

I’ll experiment a bit with Arrow, and if I can get the behavior I’m hoping for I’ll submit a PR.

jakirkham commented 4 years ago

Before getting to a PR, it would be good to get a clearer idea on the usage pattern and how well it generalizes (though it sounds like we are still working on those questions 😉).

alimanfoo commented 4 years ago

Hi @vdwees, just to second @jakirkham's comment, it would be helpful to clarify goals and usage patterns here.

IIUC Arrow provides a standard way to share memory buffers between processes. So, e.g., you could imagine loading data from one or more chunks of any Zarr array into an PyArrow array, rather than a numpy array as currently. That is something completely independent of codecs, it's more about how to lay out memory buffers and expose them to applications.

Parquet is a file format for columnar data, i.e., serialisation of multiple 1D arrays.

Codecs in Zarr are are things like compressors which transform arrays during serialisation or deserialisation. Some of the current codecs in the numcodecs package do borrow some ideas from the Parquet format, but that's some very specific things, e.g., about how to serialise strings.

Hth.