zarr-developers / zeps

Zarr Enhancement Proposals
https://zarr.dev/zeps
Creative Commons Zero v1.0 Universal
12 stars 15 forks source link

[draft] zarr object models #46

Open d-v-b opened 1 year ago

d-v-b commented 1 year ago

this ZEP defines a representation of a zarr hierarchy, called a Zarr Object Model (ZOM). The purpose of this ZEP is to standardize an abstract representation of zarr hierarchies to support declarative Zarr APIs, and to give type systems access to the structure of zarr hierarchies. A side effect of this ZEP is a standardization of consolidated metadata, which can be defined as a flattening transformation applied to a ZOM representation of a zarr hierarchy.

I didn't use the template structure for this ZEP because it felt limiting, but if that's a big problem I can bring more of that structure back in.

In terms of what needs to be done:

cc @jhamman

jbms commented 1 year ago

For zarr v3, the user-defined attributes are stored within the main metadata file under the attributes member name. Renaming that to attrs in the representation proposed here may be confusing.

d-v-b commented 1 year ago

@jbms That's a great point; I'm happy to expand attrs to attributes. On the topic of naming, are there any objections to members?

normanrz commented 1 year ago

I am not sure I got the motivation for this. I understand the motivation of consolidated metadata. Maybe this ZEP should morph into that? Then, it should be an extension to the zarr.json for persistence imo. Also, I would focus the ZEP on v3.

ivirshup commented 1 year ago

Agree with @normanrz, a section for motivation and usage would be helpful.

d-v-b commented 1 year ago

I totally agree that the motivation needs to be expanded. At the moment, all we say is

Such a data structure or interface would facilitate operations like evaluating whether two Zarr hierarchies are identically structured, evaluating whether a given Zarr hierarchy has a specific structure, or creating a Zarr hierarchy with a desired structure.

@normanrz do you agree that these are valid motivations, and worthy of expanding on, or should we provide additional motivations?

Also, I would focus the ZEP on v3.

The basic ZOM applies to v3 and v2 equally, and I think it's important to emphasize this, because the ZOM representation would be useful for converting from v2 hierarchies to v3 (and from v3 to v4, if v4 ever exists). Would it help if I made this point clearly in the ZEP?

keller-mark commented 1 year ago

To help with the motivation, I think this point

give type systems access to the structure of zarr hierarchies

could be emphasized further, perhaps with an example of how it could eventually work.

It seems like this is a kind of "extended consolidated metadata" and could be framed more in that way. Beyond the "base consolidated metadata", my understanding is that this would also include the contents of the .zattrs/.zarray/.zgroup files (and the v3 equivalent) which would be used to implement the support for typing / validation / comparison.

the ZOM representation would be useful for converting from v2 hierarchies to v3 (and from v3 to v4, if v4 ever exists)

Perhaps it would be simpler to use the metadata file names directly in the flattened representation rather than abstracting / trying to unify across versions. Using $refs in JSON schemas could enable supporting both v2 and v3 despite possibly using different/split metadata file names.

It could be useful to define the name of an optional property that could be used to specify the URL of a JSON schema to use for validation of ZOM-structured stores such as OME-NGFF or AnnData. For example, Vega-Lite uses a $schema property for this.

Perhaps outside the scope of this proposal, but related, it might be useful for Zarr to make a distinction between ZOM-structured stores vs. non-ZOM-structured stores in the name of the root file/directory (as a convention, not a requirement). For example, as a human, if i look in my file explorer and see something.zarr, it would be nice at first glance to know whether it contains a particular ZOM-structured store or not (without context or looking at the contents). For example, a convention like something.anndata.zarr could be used for this. This would be analogous to using .h5ad rather than .h5 to store AnnData-structured HDF5. I have started doing this in personal work with Zarr stores, but it does not seem like it is a convention for Zarr stores in the wild.

d-v-b commented 1 year ago

just a clarification about the timeline: I'm working on a transatlantic move, so I can't promise a huge investment in this until mid-october. Thanks for your patience!

d-v-b commented 1 year ago

as per feedback, the field used for user metadata has been renamed from attrs to attributes. I also added a motivating example (comparing if two zarr hierarchies have the same structure), and I moved the zarr v2 stuff behind the zarr v3 examples (although those need fleshing out)

d-v-b commented 1 year ago

see also: https://github.com/google/tensorstore/blob/master/tensorstore/driver/zarr/schema.yml

MSanKeys963 commented 1 year ago

Hi @d-v-b. I've fixed the RTD build issue in #51.

The current PR can be viewed here: https://zeps--46.org.readthedocs.build/en/46/draft/ZEP0006.html

d-v-b commented 12 months ago

I threw in a JSON schema for ZOM[v3], a ZOM[v3] example hierarchy serialized to JSON, and some python / typescript static typing examples.

I am wondering if it would make more sense to push these hierarchy definitions into the v2 and v3 specs as addenda? This ZEP could exist for posterity, but this would be an easier way to formally associate a particular ZOM with a specific Zarr version. Since the change would be purely additive, it seems safe to do retrospectively (in the case of Zarr v2).

Thoughts?

bogovicj commented 11 months ago

What should the value of attributes be when there are no user-specified attributes? Say, for a zarr v2 group with no .zattrs.

Since the attributes field is required, null and {} both seem reasonable to me. The thing about {} is that it won't be possible to distinguish between non-existent attributes, and attributes containing an empty object. So perhaps null is preferable.

If we're willing not to require the attributes field, then a node without the field could work too.

I havn't formed a preference.

d-v-b commented 11 months ago

@bogovicj so far i've been thinking about exclusively using {} for "empty attributes", but that's just because I have never been in a situation where "no attributes at all" vs "attributes are there, but empty" was an important distinction. I think this experience stems from attributes historically being "easy" to access, I guess because they are small in size and always have the same name.

By contrast, I think there's an argument for making the .members property of a group nullable to distinguish "there are no members" from "have not checked for members". This accommodates two scenarios: first, some storage backends (http) don't allow discovering subgroups / arrays, so members: null would signify that it wasn't possible to check for members. Second, some hierarchy parsers might want to limit how deep into a zarr hierarchy they go, so members: null would signify that no check for members was done for a given group. See https://github.com/janelia-cellmap/pydantic-zarr/issues/2 for a discussion of this latter point.

That being said, if the attributes: null vs attributes: {} distinction can do some work for someone, then I'd be fine considering making attributes nullable