zarr-developers / zarr-python

An implementation of chunked, compressed, N-dimensional arrays for Python.
https://zarr.readthedocs.io
MIT License
1.45k stars 273 forks source link

Zarr-v3 Consolidated Metadata #2113

Open TomAugspurger opened 2 weeks ago

TomAugspurger commented 2 weeks ago

Implements the optional Consolidated Metadata feature of zarr-v3: https://github.com/zarr-developers/zarr-specs/pull/309

This defines a new dataclass: ConsoliatedMetadata. It's an optional field on the existing GroupMetadata object. Opening as a draft until the PR to the zarr-specs repo finishes up.

(currently rebasing this on top of https://github.com/zarr-developers/zarr-python/pull/2117 and https://github.com/zarr-developers/zarr-python/pull/2118)

TODO:

TomAugspurger commented 2 weeks ago

This isn't handling nested groups correctly yet. My understanding (and I'll clarify this in the spec) is that given a structure like

/root
  /arr-1
  /arr-2
  /child-group
    /child-arr-1
    /child-arr-2

we'd expect that the metadata from child-group, child-arr-1, and child-arr-2 should end up in the consolidated metadata.

TomAugspurger commented 1 day ago

Quick update here:

TomAugspurger commented 11 hours ago

As I start to use this a bit, I'm rethinking the in-memory representation of consolidated metadata for a nested hierarchy. Specifically the consolidated_metadata.metadata dictionary which maps keys to ArrayMetadata | GroupMetadata objects. Our options are:

  1. A flat structure, where the keys include all path segments from the root group and the root ConsolidatedMetadata.metadata dict is equal to the length of all child nodes (not just immediate children). This matches the representation on disk.
  2. A nested structure, where the keys include just the name (so not the leading segments of the path). The root ConsolidatedMetadata.metadata dict holds just immediate children nodes. Metadata for nested groups can still be accessed, but through the .consolidated_metadata on the children. This matches the logical representation.

The motivation for this rethink is from the current implementation having to be careful about just using consolidated_metdata.metadata directly. If you want to propagate consolidated metadata, you need to use the Group.getitem method. See Group.members where this becomes relevant.

Here's an example:

Given a hierarchy like

root/
  g0/
   c0/
    array0
    array1
   c1/
     array0
     array1
  g1/
    c0/
      array0
      array1
    c1/
      array0
      array1

We'll represent the consolidated metadata as a flat mapping of (store) keys to values.

"g0": {"attributes": ..., "node_type": "group"},
"g1": {"attributes": ..., "node_type": "group"},
"g0/c0": {"attributes": ..., "node_type": "group"},
...
"g0/c0/array0": {"shape": [...], "node_type": "array"},
...
"g1/c1/array1": {"shape": [...], "node_type": "array"}

But in memory, what is the Group.metadata.consolidated_metadata for each of theses groups?

Should it match the flat structure on disk, where the keys imply the structure?

# g0
GroupMetadata(
    attributes=...,
    consolidated_metadata=ConsolidatedMetadata(
        metadata={
            "c0": GroupMetadata(...),
            "c0/array0": ArrayMetadata(...),
            ...,
            "c1/array1": ArrayMetadata(...),
        },
)

Or should it have a nested / tree-like structure, where just immediate children appear in group.metadata.consolidated_metadata.metadata, and nested members can be accessed through that dictionary?

# g0
GroupMetadata(
    attributes=...,
    consolidated_metadata=ConsolidatedMetadata(
        metadata={
            "c0": GroupMetadata(
                attributes=...,
                consoliated_metadata=ConsolidatedMetadata(
                    metadata={
                        "array0": ArrayMetadata(...),
                        "array1": ArrayMetadata(...),
                    }
                )
            ),
            "c1": GroupMetadata(
                attributes=...,
                consolidated_metadata=ConsolidatedMetadata(
                    metadata={
                        "array0": ArrayMetadata(...),
                        "array1": ArrayMetadata(...),
                    }
                )
            )
        },
)

Right now I've implemented option 1. I'll out option 2 today.

d-v-b commented 11 hours ago

do we have to pick just 1 in-memory representation? Over in pydantic-zarr I do something similar to metadata consolidation, and I use a tree representation or a flat dict[str, Array | Group] representation situationally (see the to_flat and from_flat functions).

TomAugspurger commented 10 hours ago

I think we'll be able to translate between the two pretty easily. I think we'll want methods like you shared to convert between the two.

I think the main choice is what to if we want a "default" representation that's exposed through GroupMetadata.consolidated_metadata.metadata. I think we do, and I'm leaning towards nested being the default.