Users come to Zarr with a variety of array-like objects -- numpy arrays, or dask arrays, or xarray DataArrays, zarr v2 arrays, zarr v3 arrays, etc. Imagine a venn diagram of attributes / methods for these objects: shape, __getitem__, dtype would be in the shared middle, and chunks, chunksize, attrs, dims, codecs, filters in the disjoint periphery. How can we conveniently model an arbitrary array-like object as a Zarr array? In particular, how can we ensure that you can create a complete Zarr array from an existing array-like object (which might be already a zarr array) with a single function call?
If we agree on that objective, then here is a rough outline of what that function could look like:
we should have a top-level from_array method that creates a Zarr array from an existing array-like object.
# numpy
np_arr = np.zeros(10)
zarr.from_array(np_arr) # memorystore-backed zarr v3 array with shape 10 and dtype float64, and default parameters for everything else
zarr.from_array(np_arr, zarr_format=2, compressor=Gzip(), attributes={'foo': 10}) # same as above, but v2, with gzip, and attributes
dask
da_arr = da.zeros((10,), chunks=(1,))
zarr.from_array(da_arr) # inherits the chunks attribute from the array
zarr.from_array(da_arr, chunking_bikeshed=(2,)) # overrides the chunks attribute, kwarg name tbd 🙃
xarray
xr_arr = xarray.DataArray(np.zeros(10), attrs={'foo': 10}, dims=('dim_0',))
zarr.from_array(xr_arr) # zarr v3 array with dimension names inherited xr_arr.dims, attrs from xr_arr.attrs)
zarr
zarr.from_array(zarr.zeros(10)) # makes a copy of the array
some open questions:
- should we copy data? over in `pydantic-zarr` I implemented a [`from_array`](https://github.com/janelia-cellmap/pydantic-zarr/blob/main/src/pydantic_zarr/v2.py#L181) function that only creates array metadata, because users might not want to eagerly move 10 TB of data at array definition time. Perhaps this could be controlled with a keyword argument.
- should we support creating v2 arrays through this API, or use a `v2.from_array` function for that? I'm fine either way.
- How much work is required to implicitly model the different array-like libraries enough for the above functionality to be useful?
- There is a similar question about zarr groups, but the set of "zarr-group-like objects" is a bit narrower than array-likes.
Thoughts?
Users come to Zarr with a variety of array-like objects -- numpy arrays, or dask arrays, or xarray DataArrays, zarr v2 arrays, zarr v3 arrays, etc. Imagine a venn diagram of attributes / methods for these objects:
shape
,__getitem__
,dtype
would be in the shared middle, andchunks
,chunksize
,attrs
,dims
,codecs
,filters
in the disjoint periphery. How can we conveniently model an arbitrary array-like object as a Zarr array? In particular, how can we ensure that you can create a complete Zarr array from an existing array-like object (which might be already a zarr array) with a single function call?If we agree on that objective, then here is a rough outline of what that function could look like:
from_array
method that creates a Zarr array from an existing array-like object.dask
da_arr = da.zeros((10,), chunks=(1,)) zarr.from_array(da_arr) # inherits the
chunks
attribute from the array zarr.from_array(da_arr, chunking_bikeshed=(2,)) # overrides the chunks attribute, kwarg name tbd 🙃xarray
xr_arr = xarray.DataArray(np.zeros(10), attrs={'foo': 10}, dims=('dim_0',)) zarr.from_array(xr_arr) # zarr v3 array with dimension names inherited xr_arr.dims, attrs from xr_arr.attrs)
zarr
zarr.from_array(zarr.zeros(10)) # makes a copy of the array