Open TomAugspurger opened 2 weeks ago
This isn't handling nested groups correctly yet. My understanding (and I'll clarify this in the spec) is that given a structure like
/root
/arr-1
/arr-2
/child-group
/child-arr-1
/child-arr-2
we'd expect that the metadata from child-group, child-arr-1, and child-arr-2 should end up in the consolidated metadata.
Quick update here:
AsyncGroup
now uses consolidated metadata in some operations (primarily getitem
. I'm going through some more spots now). This means you can do AsyncGroup.getitem(key)
to get a child node without any additional I/O.As I start to use this a bit, I'm rethinking the in-memory representation of consolidated metadata for a nested hierarchy. Specifically the consolidated_metadata.metadata
dictionary which maps keys to ArrayMetadata | GroupMetadata
objects. Our options are:
ConsolidatedMetadata.metadata
dict is equal to the length of all child nodes (not just immediate children). This matches the representation on disk.ConsolidatedMetadata.metadata
dict holds just immediate children nodes. Metadata for nested groups can still be accessed, but through the .consolidated_metadata
on the children. This matches the logical representation.The motivation for this rethink is from the current implementation having to be careful about just using consolidated_metdata.metadata
directly. If you want to propagate consolidated metadata, you need to use the Group.getitem
method. See Group.members
where this becomes relevant.
Here's an example:
Given a hierarchy like
root/
g0/
c0/
array0
array1
c1/
array0
array1
g1/
c0/
array0
array1
c1/
array0
array1
We'll represent the consolidated metadata as a flat mapping of (store) keys to values.
"g0": {"attributes": ..., "node_type": "group"},
"g1": {"attributes": ..., "node_type": "group"},
"g0/c0": {"attributes": ..., "node_type": "group"},
...
"g0/c0/array0": {"shape": [...], "node_type": "array"},
...
"g1/c1/array1": {"shape": [...], "node_type": "array"}
But in memory, what is the Group.metadata.consolidated_metadata
for each of theses groups?
Should it match the flat structure on disk, where the keys imply the structure?
# g0
GroupMetadata(
attributes=...,
consolidated_metadata=ConsolidatedMetadata(
metadata={
"c0": GroupMetadata(...),
"c0/array0": ArrayMetadata(...),
...,
"c1/array1": ArrayMetadata(...),
},
)
Or should it have a nested / tree-like structure, where just immediate children appear in group.metadata.consolidated_metadata.metadata
, and nested members can be accessed through that dictionary?
# g0
GroupMetadata(
attributes=...,
consolidated_metadata=ConsolidatedMetadata(
metadata={
"c0": GroupMetadata(
attributes=...,
consoliated_metadata=ConsolidatedMetadata(
metadata={
"array0": ArrayMetadata(...),
"array1": ArrayMetadata(...),
}
)
),
"c1": GroupMetadata(
attributes=...,
consolidated_metadata=ConsolidatedMetadata(
metadata={
"array0": ArrayMetadata(...),
"array1": ArrayMetadata(...),
}
)
)
},
)
Right now I've implemented option 1. I'll out option 2 today.
I think we'll be able to translate between the two pretty easily. I think we'll want methods like you shared to convert between the two.
I think the main choice is what to if we want a "default" representation that's exposed through GroupMetadata.consolidated_metadata.metadata
. I think we do, and I'm leaning towards nested being the default.
Implements the optional Consolidated Metadata feature of zarr-v3: https://github.com/zarr-developers/zarr-specs/pull/309
This defines a new dataclass:
ConsoliatedMetadata
. It's an optional field on the existingGroupMetadata
object. Opening as a draft until the PR to the zarr-specs repo finishes up.(currently rebasing this on top of https://github.com/zarr-developers/zarr-python/pull/2117 and https://github.com/zarr-developers/zarr-python/pull/2118)
TODO: