zarr-developers / zarr-python

An implementation of chunked, compressed, N-dimensional arrays for Python.
https://zarr.readthedocs.io
MIT License
1.47k stars 274 forks source link

`KeyError` in `Group.__getitem__` for `zarr_format=2` #2275

Open TomAugspurger opened 3 hours ago

TomAugspurger commented 3 hours ago

Zarr version

v3

Numcodecs version

na

Python Version

na

Operating System

na

Installation

na

Description

In zarr-v2, you could create an array at a path and access it via a group:

In [4]: import zarr  # 2.18.3

In [5]: store = {}

In [6]: zarr.open_array(store=store, shape=(4,), path="foo")
Out[6]: <zarr.core.Array '/foo' (4,) float64>

In [7]: zarr.open_group(store)["foo"]
Out[7]: <zarr.core.Array '/foo' (4,) float64>

Steps to reproduce

In zarr v3, this raises a KeyError:

import zarr

store = zarr.store.MemoryStore(store_dict={}, mode="w")
zarr.open_array(store=store, shape=(4,), path="foo", zarr_format=2)
zarr.open_group(store)["foo"]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[1], line 5
      3 store = zarr.store.MemoryStore(store_dict={}, mode="w")
      4 zarr.open_array(store=store, shape=(4,), path="foo", zarr_format=2)
----> 5 zarr.open_group(store)["foo"]

File ~/gh/zarr-developers/zarr-python/src/zarr/core/group.py:814, in Group.__getitem__(self, path)
    813 def __getitem__(self, path: str) -> Array | Group:
--> 814     obj = self._sync(self._async_group.getitem(path))
    815     if isinstance(obj, AsyncArray):
    816         return Array(obj)

File ~/gh/zarr-developers/zarr-python/src/zarr/core/sync.py:135, in SyncMixin._sync(self, coroutine)
    132 def _sync(self, coroutine: Coroutine[Any, Any, T]) -> T:
    133     # TODO: refactor this to to take *args and **kwargs and pass those to the method
    134     # this should allow us to better type the sync wrapper
--> 135     return sync(
    136         coroutine,
    137         timeout=config.get("async.timeout"),
    138     )

File ~/gh/zarr-developers/zarr-python/src/zarr/core/sync.py:91, in sync(coro, loop, timeout)
     88 return_result = next(iter(finished)).result()
     90 if isinstance(return_result, BaseException):
---> 91     raise return_result
     92 else:
     93     return return_result

File ~/gh/zarr-developers/zarr-python/src/zarr/core/sync.py:50, in _runner(coro)
     45 """
     46 Await a coroutine and return the result of running it. If awaiting the coroutine raises an
     47 exception, the exception will be returned.
     48 """
     49 try:
---> 50     return await coro
     51 except Exception as ex:
     52     return ex

File ~/gh/zarr-developers/zarr-python/src/zarr/core/group.py:228, in AsyncGroup.getitem(self, key)
    226 zarr_json_bytes = await (store_path / ZARR_JSON).get()
    227 if zarr_json_bytes is None:
--> 228     raise KeyError(key)
    229 else:
    230     zarr_json = json.loads(zarr_json_bytes.to_bytes())

KeyError: 'foo'

Additional output

No response

TomAugspurger commented 3 hours ago

Ah I see the potential issue: the Group is opened with zarr_format=None which defaults to 3. So we have a Group with v3 metadata, which assumes that all its children will be v3 arrays, which is probably sensible. The fix here is to set zarr_format in both places:

zarr.open_group(store, zarr_format=2)["foo"]

So probably not a bug.

d-v-b commented 2 hours ago

For top level stuff like zarr.open, what do you think about automatically discovering the zarr format? We would need to handle the case where both a v2 and v3 node are under the same prefix, but I think this would be a pretty user-friendly and intuitive feature.

TomAugspurger commented 2 hours ago

I think the current implementation does a good job with auto-discovery when reading. The problem with my snippet is that the zarr.open_group(store) automatically created a Group, which falls back to a v3 default.

If the store had already had a (v2) Group then this works:

In [1]: import zarr

In [2]: store = zarr.store.MemoryStore(store_dict={}, mode="w")

In [3]: zarr.open_array(store=store, shape=(4,), path="foo", zarr_format=2)
Out[3]: <Array memory://4537150336/foo shape=(4,) dtype=float64>

In [4]: zarr.open_group(store, zarr_format=2)  # create the v2 Group
Out[4]: Group(_async_group=<AsyncGroup memory://4537150336>)

In [5]: zarr.open_group(store)["foo"]  # read the v2 Group (relying on zarr_format inference)
Out[5]: <Array memory://4537150336/foo shape=(4,) dtype=float64>

So maybe this question here is to what extent we want to support hierarchies with a mixture of Zarr V2 and Zarr V3 nodes. That makes my head hurt a little bit, but I guess it's doable (don't assume that Group.zarr_format will match the child zarr format, so don't pass that through in the __getitem__ call and rely on zarr's auto discovery?).

d-v-b commented 2 hours ago

So maybe this question here is to what extent we want to support hierarchies with a mixture of Zarr V2 and Zarr V3 nodes.

I think we want no support for this. IMO, a v3 group with a v2 array as a member should not be expressible in zarr-python. But I don't think it should be an error to create a v3 and v2 group in the same place. Maybe part of the problem is the open_group function itself, which by name is ambiguous about whether it creates or reads. I would suggest adding a read_group function that uses mode=r under the hood, thereby avoiding surprise group creation. Similarly for read_array, and an array-or-group read function. I can open a PR for this when i get some time this week.

TomAugspurger commented 2 hours ago

Explicit read_* and create_* would be fantastic :)