Zarr-v3 Consolidated Metadata

TomAugspurger commented 2 weeks ago

Implements the optional Consolidated Metadata feature of zarr-v3: https://github.com/zarr-developers/zarr-specs/pull/309

This defines a new dataclass: ConsoliatedMetadata. It's an optional field on the existing GroupMetadata object. Opening as a draft until the PR to the zarr-specs repo finishes up.

(currently rebasing this on top of https://github.com/zarr-developers/zarr-python/pull/2117 and https://github.com/zarr-developers/zarr-python/pull/2118)

TODO:

[ ] Add unit tests and/or doctests in docstrings
[ ] Add docstrings and API docs for any new/modified user-facing classes and functions
[ ] New/modified features documented in docs/tutorial.rst
[ ] Changes documented in docs/release.rst
[ ] GitHub Actions have all passed
[ ] Test coverage is 100% (Codecov passes)

TomAugspurger commented 2 weeks ago

This isn't handling nested groups correctly yet. My understanding (and I'll clarify this in the spec) is that given a structure like

/root
  /arr-1
  /arr-2
  /child-group
    /child-arr-1
    /child-arr-2

we'd expect that the metadata from child-group, child-arr-1, and child-arr-2 should end up in the consolidated metadata.

TomAugspurger commented 1 day ago

Quick update here:

AsyncGroup now uses consolidated metadata in some operations (primarily getitem. I'm going through some more spots now). This means you can do AsyncGroup.getitem(key) to get a child node without any additional I/O.
https://github.com/zarr-developers/zarr-python/pull/2113/commits/fc901eb5be0e4aa113cd913feb6a135223ba9bb0 added support for reading zarr V2 consolidated metadata. Needs some cleanup and probably more testing, but the basics seem to work.

TomAugspurger commented 11 hours ago

As I start to use this a bit, I'm rethinking the in-memory representation of consolidated metadata for a nested hierarchy. Specifically the consolidated_metadata.metadata dictionary which maps keys to ArrayMetadata | GroupMetadata objects. Our options are:

A flat structure, where the keys include all path segments from the root group and the root ConsolidatedMetadata.metadata dict is equal to the length of all child nodes (not just immediate children). This matches the representation on disk.
A nested structure, where the keys include just the name (so not the leading segments of the path). The root ConsolidatedMetadata.metadata dict holds just immediate children nodes. Metadata for nested groups can still be accessed, but through the .consolidated_metadata on the children. This matches the logical representation.

The motivation for this rethink is from the current implementation having to be careful about just using consolidated_metdata.metadata directly. If you want to propagate consolidated metadata, you need to use the Group.getitem method. See Group.members where this becomes relevant.

Here's an example:

Given a hierarchy like

root/
  g0/
   c0/
    array0
    array1
   c1/
     array0
     array1
  g1/
    c0/
      array0
      array1
    c1/
      array0
      array1

We'll represent the consolidated metadata as a flat mapping of (store) keys to values.

"g0": {"attributes": ..., "node_type": "group"},
"g1": {"attributes": ..., "node_type": "group"},
"g0/c0": {"attributes": ..., "node_type": "group"},
...
"g0/c0/array0": {"shape": [...], "node_type": "array"},
...
"g1/c1/array1": {"shape": [...], "node_type": "array"}

But in memory, what is the Group.metadata.consolidated_metadata for each of theses groups?

Should it match the flat structure on disk, where the keys imply the structure?

# g0
GroupMetadata(
    attributes=...,
    consolidated_metadata=ConsolidatedMetadata(
        metadata={
            "c0": GroupMetadata(...),
            "c0/array0": ArrayMetadata(...),
            ...,
            "c1/array1": ArrayMetadata(...),
        },
)

Or should it have a nested / tree-like structure, where just immediate children appear in group.metadata.consolidated_metadata.metadata, and nested members can be accessed through that dictionary?

# g0
GroupMetadata(
    attributes=...,
    consolidated_metadata=ConsolidatedMetadata(
        metadata={
            "c0": GroupMetadata(
                attributes=...,
                consoliated_metadata=ConsolidatedMetadata(
                    metadata={
                        "array0": ArrayMetadata(...),
                        "array1": ArrayMetadata(...),
                    }
                )
            ),
            "c1": GroupMetadata(
                attributes=...,
                consolidated_metadata=ConsolidatedMetadata(
                    metadata={
                        "array0": ArrayMetadata(...),
                        "array1": ArrayMetadata(...),
                    }
                )
            )
        },
)

Right now I've implemented option 1. I'll out option 2 today.

d-v-b commented 11 hours ago

do we have to pick just 1 in-memory representation? Over in pydantic-zarr I do something similar to metadata consolidation, and I use a tree representation or a flat dict[str, Array | Group] representation situationally (see the to_flat and from_flat functions).

TomAugspurger commented 10 hours ago

I think we'll be able to translate between the two pretty easily. I think we'll want methods like you shared to convert between the two.

I think the main choice is what to if we want a "default" representation that's exposed through GroupMetadata.consolidated_metadata.metadata. I think we do, and I'm leaning towards nested being the default.

zarr-developers / zarr-python

Zarr-v3 Consolidated Metadata #2113