zarr-developers / zarr-python

An implementation of chunked, compressed, N-dimensional arrays for Python.
http://zarr.readthedocs.io/
MIT License
1.41k stars 269 forks source link

Supporting links and references #389

Open jakirkham opened 5 years ago

jakirkham commented 5 years ago

In HDF5, there are a few different mechanisms used to refer to other data in an HDF5 file from a location other than where they are stored.

For example, HDF5 supports different types of links like hard links, soft links, and external links. These are analogous to links on the filesystem with the exception of external links, which constitute a soft link to a different HDF5 file. These can refer to groups or datasets in HDF5.

Also HDF5 supports a couple kinds of references such as object references and region references. These refer simply to datasets or some subselection of datasets.

It would be useful to support these in Zarr as there appear to be some use cases for them. ( https://github.com/zarr-developers/zarr/issues/297 ) ( https://github.com/zarr-developers/zarr/issues/298 ) ( https://github.com/zarr-developers/zarr/issues/333 ) Also downstream libraries like pynwb would like to use them. ( https://github.com/NeurodataWithoutBorders/pynwb/issues/230 )

Am raising here to discuss how we might support these features in Zarr across various stores.

amkigit commented 4 years ago

I like and use the feature of hdf5 to use object references. Instead of using symbolic links would it be possible to create a "virtual" Group, a directory or a data storage, which does contain references (path) to the effective Group members. Instead of using the file system the reference is in a lets say .vgroup file. May be an easy and flexible approach.

NumesSanguis commented 4 years ago

Someone recently made an issue over at ASDF (Advanced Scientific Data Format) to integrate with Zarr: https://github.com/spacetelescope/asdf/issues/718 ASDF supports:

ASDF currently doesn't support chunking however, which would make Zarr a good addition to ASDF. The original issue points out that integrating with Zarr would benefit parallel computation with ASDF (e.g. with Dask).

alimanfoo commented 4 years ago

Thanks @NumesSanguis for making the connection, very interesting.

perrygreenfield commented 4 years ago

Thanks @NumesSanguis from the ASDF side as well. This is something we want to take a serious look at.

Cadair commented 4 years ago

Thanks for making the link @NumesSanguis once we have wrapped our heads around things a little more I might open some issues here :grinning:

NumesSanguis commented 4 years ago

Is there any update on this issue? It would be great to be able to refer to parts of a larger array as a new dataset, without having to copy the data or having to create your own parsing code.

perrygreenfield commented 4 years ago

From our view we would like to start working on this for ASDF in a couple months (after a new staff member starts). If someone is willing to help, even better.

marcel-goldschen-ohm commented 1 year ago

Any news on this? I'm currently trying to plan out how to use zarr for electrophysiology datasets. These data can have repeated stimulus patterns and identical time arrays for which shared arrays (see #690) and/or groups seem like they could be really useful.