zarr-developers / zarr-python

An implementation of chunked, compressed, N-dimensional arrays for Python.
https://zarr.readthedocs.io
MIT License
1.45k stars 273 forks source link

[v3] Batch array / group access #1805

Open d-v-b opened 4 months ago

d-v-b commented 4 months ago

In v3, since the storage API is asynchronous, we can open multiple array or groups concurrently. This would be broadly useful, but we don't have a good template from zarr-python v2 to extrapolate from, so we have to invent something new here (new, relative to zarr-python, that is).

Over in #1804 @martindurant brought this up, and I suggested something like this:

def open_nodes(store: Store, paths: tuple[str, ...], options: dict[Literal["array", "group"], dict[str, Any]]) -> Array | Group:
  ...

def open_arrays(store: Store, paths: tuple[str, ...], options: dict[str, Any]) -> Array:
  ...

def open_groups(store: Store, paths: tuple[str, ...], options: dict[str, Any]) -> Group:
  ...

I was imagining that the arguments to these functions would be the paths of arrays / groups anywhere in a Zarr hierarchy; we could also have a group.open_groups() method which can only "see" sub-groups, and similarly for group.open_arrays().

An alternative would be to use a more general transactional context manager:

with transaction(store) as tx:
     a1_maybe = tx.open_array(...)
     a2_maybe = tx.open_array(...)
    # IO gets run concurrently in `__aexit__`

a1 = a1_maybe.result()
a2 = a2_maybe.result()

I'm a lot less sure of this second design, since I have never implemented anything like it. For example, should we use futures for the results of tx.open_array()?

Are there other ideas, or examples from other implementations / domains we could draw from?

martindurant commented 4 months ago

I would also add data-getter, so in the first model something like

def get_data({arr_obj1: ((0, [1, 2, 3], slice(None, None, None)), arr_obj2: (Ellipsis, 0), ...})

or similar,

In addition to a convenience addition to the functions above

group.descent_to_path(path1, path2, ...

(where store and options are fixed).

I am not suggesting these are the right signatures, but this is the functionality I would want. After all "open many nodes" is sort of already covered in v2 for the special case of consolidated metadata (one call, no more latency, rather than many concurrent calls).