Open elyall opened 1 year ago
I also meant to include @joshua-gould
I think this is both a great idea and pretty straightforward to implement. Before thinking about any spec changes or extension, I would work on a prototype implementation.
I think this can be implemented in two steps in zarr-python relatively simply.
sparse
array type using the newly introduced meta_array
parameter (see https://github.com/zarr-developers/zarr-python/pull/1131)Based on what is learned from this exercise, we can think about what, if any, spec changes or extensions would be required to "formally" support sparse.
p.s. I actually think that the sparse GCXS format is an excellent sparse encoding, superseding both CSR and CSC.
Hey! I will try and write up a longer reponse later, but am at a hackathon. Just wanted to add a few points
(with the current caveat seemingly that the central matrix is a dense np.ndarray due to zarr's limitation).
This is not the case, anndata sparse storage works well with zarr. The limitation here has been dask support for sparse arrays, which is also something that is being worked on.
However, some promising developments: https://scverse.zulipchat.com/#narrow/stream/315789-data-structures/topic/dask.20in.20anndata/near/355745689
@rabernat, I would be interested in talking about this in more detail, but don't think sparse arrays will work well with a codec. A lot of this has previously been discussed in https://github.com/zarr-developers/zarr-specs/issues/62. My current opinion is that structures like sparse
, ragged
, or dataframes
make much more sense as conventions than in the core zarr spec.
I'd also note that anndata is collaborating with graphblas team (and others) to generate a more general binary sparse representation built on top of things like zarr and hdf5. You can check that out here. v1 won't contain n-dimensional CSF/ GCXS support, but a spec for those kind of arrays is looking pretty mature. You can find this here:
Ideally the codec could be pipelined with other codecs such as zstd for compression.
I think this is one of the key problem with having sparse as a codec. Codecs make a lot of sense for arrays which are one contiguous chunks of memory that needs to be read with a single dtype. A CSR array is really 3 distinct chunks of memory, each of which can need to be interpreted with a different dtype. I think this will be a problem for any codec more complicated than "decompress arbitrary buffer" which seems like the directions codecs are headed in v3, e.g. endianness, transpose, delta...
I see your point @ivirshup and I recognize you have thought much more deeply about this than me! 😀
I think this will be a problem for any codec more complicated than "decompress arbitrary buffer"
In the ratified version of the V3 spec there are three types of codecs (see #222):
The spec does not explicitly say that array
has to be a dense numpy-style array. So in principle we can have a codec that takes an arbitrary buffer (however you want to serialize the spare array) and returns a sparse array (e.g. sparse.COO
or sparse.GCXS
). This seems like an implementation detail which would be language specific. In my mind, using such a sparse codec, together with meta_array
should be enough to trigger this sort of path in zarr-python.
The spec does not explicitly say that
array
has to be a dense numpy-style array. So in principle we can have a codec that takes an arbitrary buffer (however you want to serialize the spare array) and returns a sparse array (e.g.sparse.COO
orsparse.GCXS
). This seems like an implementation detail which would be language specific. In my mind, using such a sparse codec, together withmeta_array
should be enough to trigger this sort of path in zarr-python.
Yes, I think the sparse codec could be handled as an "array -> bytes" codec under the current v3 spec with no changes required. The in-memory representation of arrays is an implementation detail and not part of the spec.
Exactly how easily this can be implemented in zarr-python or zarrita remains to be seen, though.
One potential challenge with simply using meta_array
is that if the read or write request does not exactly correspond to a single chunk, then zarr-python needs to manipulate the chunks with array indexing operations. For example, if a read request intersects multiple chunks, zarr-python creates an empty array for the result, and then copies in the relevant portion of each chunk. As far as I can see, of the sparse formats supported by the sparse
library, only "dictionary of keys" (DOK) supports writes. I expect, therefore, that zarr-python would need some custom logic to assemble multiple sparse arrays together in order to support in-memory formats other than DOK.
@ivirshup you are indeed the expert on this subject. While I'm very new to this, I'm happy to help where I can.
anndata sparse storage works well with zarr. The limitation here has been dask support for sparse arrays, which is also something that is being worked on.
News to me. The website really makes it sound like sparse
implementing a subset of the NumPy API solved this problem. I will look into this more.
My current opinion is that structures like sparse, ragged, or dataframes make much more sense as conventions than in the core zarr spec.
I agree completely though that's a much taller task; I posted here thinking this was a good place to start. binsparse-specification looks promising. Is there a repo for writing/reading arrays to/from zarr or hdf5 files with this spec yet?
@rabernat, naively writing a codec seems approachable to me as the process seems to have good documentation. There are questions on implementation though that raises the question whether it is actually doable:
xarray
DataArray with multiple indexes along an axis and I know their Dataset has a to_zarr
method... Also @ivirshup, I'm still processing "anndata sparse storage works well with zarr". I need to play with it more and dive deeper into the code.
- As @ivirshup said, the pointers, indices, and values often have different dtypes and therefore should be stored as independent (chunked) objects. This does seem similar to an
xarray
DataArray with multiple indexes along an axis and I know their Dataset has ato_zarr
method...
In zarr v3, a sparse "array -> bytes" codec would be responsible for generating a sequence of bytes to represent the sparse array. It might do this by first converting the sparse array into a collection of dense arrays using one of the standard sparse array representations, then encoding each of these dense arrays, and then concatenating them together. It may make sense for this sparse array codec to have as configuration options separate sequences of codecs that can be applied to each of the dense arrays. Zarr v3 also allows you to specify additional "bytes -> bytes" codecs at the top-level that would apply to the concatenated result. For example:
"codecs":
[{"name": "sparse_gcxs",
"configuration": {
"ro_codecs": [{"name": "endian", "endian": "little"}, {"name": "blosc"}],
"co_codecs": [...],
"vl_codecs": [...]
}
}]
Normally when working with a sparse array you will need all of the dense arrays, so storing them all together in a single chunk makes sense. However, I suppose in some cases you might only need to access e.g. the values array and not the coordinates array, for example if applying the same operation to every value independently. That would not be possible if they are all stored in a single chunk. To be able to split the chunk over multiple keys in the underlying storage would require some changes to the zarr v3 codec model, though.
- Since the array won't be uniformly sparse, you either have to have the chunks have variable length on disk or save a separate object of pointers that gets read in first and specifies the starting index of each chunk so that the slice operation knows which chunk(s) to load in.
We already have chunks taking up a variable amount of space on disk when using compression with regular dense arrays.
- As @jbms points out, writing to sparse arrays is not trivial considering the whole point is to not allocate memory to values equalling the fill value. If a value needs to be added to memory I imagine it would cause a number of issues for disk memory allocation unless some extra space is included in each chunk (though this does have parallels to expanding an array along a non-contiguous dimension in memory).
To be clear, there is a difference between in-memory representation and on-disk representation. The existing sparse array formats are well-suited for on-disk representation, since even with dense arrays, zarr has to re-write chunks in their entirety when they are modified (assuming any sort of compression is used, and in practice likely will even if no compression is being used).
The in-memory representation that makes sense would depend on specific details of the implementation. I don't think there are any fundamental challenges there, but it may be necessary to make changes to the zarr-python code to support it.
I expect, therefore, that zarr-python would need some custom logic to assemble multiple sparse arrays together in order to support in-memory formats other than DOK
Just a [slightly off topic] thought...it would be very cool if pydata/sparse could implement a compressed-sparse-blocks format. There are several examples of this out there, although it looks like they are always assuming 2D matrices:
This would map very well to a Zarr, with each chunk forming a block.
@ivirshup can say more but my understanding from SciPy was that https://github.com/GraphBLAS/binsparse-specification should be ready for investigations in Zarr-land soon.
Has there been any progress made towards this in the past several months? @rabernat @joshua-gould @elyall @jbms
Based on the above links to https://github.com/ivirshup/binsparse-python, there hasn't been a binsparse-python commit in ~6 months, so it may be wise to proceed ahead without it. As it's already been mentioned above, there is good dask support for sparse arrays which is a starting point.
While there are complications with variable sizing of some sparse arrays (CSR for example) it make sense to at least start with an implementation to support sparse.COO
since it's relatively straightforward.
I'm hoping to take a stab at this over the coming weeks.
You should check out https://github.com/graphblas/binsparse-specification, which is more up to date. I haven't had time recently to update my initial python code, but hopefully will get to it in March.
Currently we are working more on doing stuff with sparse in dask, but I would not call the level of support "good".
As an update to this thread, I've been working with @ivirshup on updating the initial binsparse-py code for this use case
Has there been an update on this issue? Or is there a current workaround for saving sparse matrices on zarr files stores or using them with Dask?
The current workaround is to use https://github.com/ivirshup/binsparse-python for reading / writing sparse arrays from Scipy Sparse CSR, CSC, and COO formats
It would be nice if there was also support for the pydata sparse arrays as they are compatible with xarrays and therefore Dask, which is not directly possible with the scipy sparse arrays. Everytime I try to use Dask to convert the written scipy arrays, I have to densify the arrays, and then make them pydata sparse, which kind of defeats the purpose. Is this something that is being worked on or is planned on the horizon?
Perhaps there is a better way to do this, but I don't know how.
Zarr would benefit from formal support of storing sparse arrays, though I'm not sure what this would look like and the process for getting there (e.g. placing the functionality in the core spec, creating a numcodec, and/or creating a ZEP to get there). But from some of the linked comments below it does seem very achievable.
If I'm wrong anywhere, please correct me. I'm not an expert on these packages.
Example use case: parallel processing of single cell RNA sequencing (scRNA-seq) data across a cluster.
Single cell sequencing data produces large arrays, often sparse, and often too large to store in memory. They would benefit from chunked, sparse, on-disk storage to empower chunked parallel processing on a Dask or Ray cluster in an efficient manner.
The current scRNA-seq analysis stack utilizes annData to wrap the data and scanpy to analyze it. annData was built upon HDF5 files but there has been an effort to utilize zarr and dask (with the current caveat seemingly that the central matrix is a dense np.ndarray due to zarr's limitation). scanpy doesn't fully support dask arrays at the moment (e.g. https://github.com/scverse/scanpy/issues/2491) but since dask arrays follow the numpy API that shouldn't be hard to fix. Dask supports sparse arrays and Ray can be used as a plug-in scheduler for dask. In practice I've found anndata is able to load in a scipy sparse csr matrix from an hdf5 file.
Some previous efforts and discussions I've found
meta_array
argument to specify array typeProposed test code to set goalpost
We can try to implement all five of the array formats I've come across (included below), or stick to
sparse.COO
which follows the numpy API most completely, is more powerful by being n-dimensional, and is easy to chunk as the coordinates are exact (though less space efficient than csr or csc). Alsosparse.COO
is easily converted in memory to other sparse array types.My ultimate use case
I'd like to analyze this dataset on a (local/remote) ray cluster using scanpy's default workflow at the start.
Some relevant stakeholders
@ivirshup, @alimanfoo, @rabernat, @jakirkham, @daletovar, @MSanKeys963