zarr-developers / zarr-specs

Zarr core protocol for storage and retrieval of N-dimensional typed arrays
https://zarr-specs.readthedocs.io/
Creative Commons Attribution 4.0 International
85 stars 28 forks source link

Sparse chunk memory layout #48

Open alimanfoo opened 5 years ago

alimanfoo commented 5 years ago

This is a placeholder for a potential protocol extension to define sparse memory layouts for chunks.

The idea is to enable use of a sparse memory layout (e.g., CSR, CSC or COO) within each chunk of a Zarr array. I.e., a Zarr array has a regular chunk grid as normal, but instead of using a dense C contiguous or F contiguous layout for the data within each chunk, use a sparse memory layout.

E.g., in the case of COO the memory layout would comprise two memory blocks, one storing the coordinates, the other storing the data values. For the purposes of encoding and storage, these two memory blocks could be concatenated into a single memory block, which could then be passed down through filter and compressor codecs and stored as normal. When retrieving and decoding the chunk, the coordinates and the data values could be presented as views of different regions of the memory block, to avoid extra memory copies.

In terms of the Zarr v3 core protocol, this could be specified as a protocol extension, defining new memory layouts that could be used within the chunk_memory_layout array metadata property.

An implementation in Python could be relatively straightforward, by using an existing sparse array library like SciPy (for 2D chunks) or sparse (for ND chunks) to manage the chunks, instead of numpy.

This could also integrate nicely with blocked parallel computing frameworks like Dask, because each chunk would be presented as a sparse array, and so any computational steps within the task graph that could operate directly on the sparse representation could do so, rather than forcing data into a dense representation.

Note that this is different from discussions about defining conventions for storing sparse arrays in Zarr, where a collection of two or more Zarr arrays are used to store a single sparse array. (E.g., for a COO array, the coords would be stored in one Zarr array, and the data in a second Zarr array). That may be equally worthwhile to pursue, but is a different concept and probably serves slightly different use cases

alimanfoo commented 5 years ago

cc @ryan-williams