zarr-developers / zarr-specs

Zarr core protocol for storage and retrieval of N-dimensional typed arrays
https://zarr-specs.readthedocs.io/
Creative Commons Attribution 4.0 International
88 stars 28 forks source link

RFC: a solution for versioned Zarrs based on versioned S3 bucket #314

Open yarikoptic opened 2 months ago

yarikoptic commented 2 months ago

Inspired by

I've decided to share ongoing design we are pursuing and seek for possible feedback and possibly guidance and/or collaboration.

In DANDI archive (https://dandiarchive.org/) where we use versioned S3 bucket for actual data storage, we are also working to allow for versioning of Zarr filesets. Notes on the ultimate design could be found in

but in a nutshell it is centered around simple aspects of S3 versioned bucket, checksum over files in a Zarr and collecting a "manifest" file with information about keys/versionIds for a given version of Zarr (so ideas similar to git itself). In more detail:

To show feasibility of such approach we provide

But I wondered, is there a way or a need to possibly formalize some "zarr manifest" listing which could then be reused across solutions? I am not quite sure if it is at the level of storage transformers since IMHO it should be rather a specification on top of zarr instance, in comparison to the specification within zarr. WDYT?

rabernat commented 2 months ago

Hi @yarikoptic, thanks for sharing this! Looks cool!

FYI, we are planning on open sourcing the solution we have built at Earthmover later this fall.

rabernat commented 1 month ago

Hi folks! We released our project! You can read all about it here: https://icechunk.io/