zarr-developers / zarr-python

An implementation of chunked, compressed, N-dimensional arrays for Python.
https://zarr.readthedocs.io
MIT License
1.45k stars 273 forks source link

feature(store): V3 ZipStore #2078

Open jhamman opened 1 month ago

jhamman commented 1 month ago

I threw together a super basic implementation of a v3 zip store. I am still testing this but it seems to be working for basic things.

closes #2010

try it!

In [1]: import zarr
   ...: 
   ...: store = zarr.store.ZipStore('data/example.zip', mode='w')
   ...: root = zarr.group(store=store)
   ...: z = root.create_array(shape=(100, 100), chunks=(10, 10), name="foo")

In [2]: z[:] = 42

In [3]: store.close()

# unzip -l data/example.zip

TODO:

d-v-b commented 1 month ago

I played around with this a bit locally.

jhamman commented 4 weeks ago

I played around with this a bit locally.

  • the zip archive gets completely broken if you forget to call store.close after writing (or, presumably, if the program exits before reaching the line where store.close() is invoked). I think this is severe enough that this PR should have the ZipStore implement ContextManager to defend against data corruption issues.

There is now a context manager interface here. I'll note that I'm mildly uncomfortable with this because its not clear if this should be a async context manager or not. While I think this is what you were asking for, its not really the way classes that do async things should work. I've got some ideas for how we can improve this but I think we should hold them for another PR.

Two asides:

  1. not closing the v2 ZipStore also led to bad things. So this is "expected behavior" I guess.
  2. we could consider an atexit cleanup of the file handle but I think we should first decide on the context api.
  • attempting to write twice to the array fails, because the zip store doesn't allow deletion. In such a case, the user will see a NotImplementedError. We should probably make something more informative for this case. But I don't think this is a blocker.

Noting that this behavior is also present in v2: https://github.com/zarr-developers/zarr-python/blob/11fd8db8f06372ffc1f6f7e22315d43e4702ec60/zarr/storage.py#L1884-L1885

  • We will at some point need to address the possibility of zip file stored on s3. Our current store hierarchy can't express this easily I think. but that's a future problem.

I think fsspec can handle most of this for us. Some work is likely needed there but it should be possible.

joshmoore commented 1 week ago

A sidenote that similar to the situation with consolidated metadata, the zip store will bring v3 to parity with v2 (good thing :tm: 👍🏽) but will leave other implementations in the same situation of not having a spec. If there's a chance that the discussion around that spec will lead to changes in the on-disk format, it would be in our best interest to make sure we have a specification to go along with these change so that there's less likelihood of multiple versions we need to support.

d-v-b commented 1 week ago

A sidenote that similar to the situation with consolidated metadata, the zip store will bring v3 to parity with v2 (good thing ™️ 👍🏽) but will leave other implementations in the same situation of not having a spec. If there's a chance that the discussion around that spec will lead to changes in the on-disk format, it would be in our best interest to make sure we have a specification to go along with these change so that there's less likelihood of multiple versions we need to support.

+1, it would be great to define a spec for this so other implementations can safely support it

jhamman commented 1 day ago

Draft spec to go along with this feature is now available for review: https://github.com/zarr-developers/zarr-specs/pull/311

LDeakin commented 3 hours ago

I tried this, and it produces zip files that are compatible with the zip store implementation in zarrs. However, I noticed that this fails because it tries to call delete:

z = root.create_array(shape=(100, 100), chunks=(10, 10), name="foo", dtype=np.uint8, fill_value=42)
z[:] = 42

I would expect it to succeed and just not write empty chunks.

We will at some point need to address the possibility of zip file stored on s3. Our current store hierarchy can't express this easily I think. but that's a future problem.

For reference, in zarrs I opted to implement zip support as a "storage adapter" that can be layered on other stores. See example. But zip support is read-only, regardless of the backing store.