Open TomAugspurger opened 1 month ago
The array metadata object must not contain any other names. Those are reserved for future versions of this specification. An implementation must fail to open Zarr hierarchies, groups or arrays with unknown metadata fields, with the exception of objects with a "must_understand": false key-value pair.
Worth noting that the first and third sentences are blatantly contradictory! :upside_down_face:
Having a central place to advertise extensions is great. But to me having to write a ZEP feels like a pretty high bar. STAC extensions are quick and easy to create, and that's led to a lot of experimentation and eventual stabilization in STAC core. And some institutions will have private STAC extensions that they never intend to publish. IMO the extension story should lead with that and offer a zarr-extensions repository / organization for commonly used extensions / shared maintenance.
:100: this sounds like a great idea. I think requiring a ZEP for every extension is a headache and the end result will be that nobody does it. I'd be happy adjusting #312 along the lines of a separate zarr-extensions
repo if people generally think that's a good idea.
Thanks for sharing this Tom. It has been great to have you spending time on Zarr recently and bringing a fresh perspective to long-standing discussions. FWIW, I'm on record in multiple conversations as citing STAC as a good example for Zarr to emulate.
I do think that Zarr, as an actual file format (as opposed to a catalog format) may need a somewhat more conservative attitude than STAC regarding backwards compatibility, interoperability etc. It must be very clear to data producers, for example, how to create data that will be widely readable for a long period of time without any need to update the metadata.
However, I agree that our current approach to extensions basically doesn't work and is effectively preventing development. It's not even possible for Zarr Python to reach feature parity with Zarr V2 without multiple non-existent extensions (e.g. strings)--let alone innovating in new directions. So I am fully in favor of what is proposed here.
One concept that may be very useful for Zarr is the notion of extension maturity: https://github.com/radiantearth/stac-spec/blob/master/extensions/README.md#extension-maturity. This would guide data providers on how "risky" it would be to adopt a specific extension. This could be seen as a more nuanced version than "must understand" True / False.
I think this concept would also make obsolete my stalled proposal for Zarr "conventions": #262.
I'm also strongly in favor of adopting JSON schema for metadata conformance validation.
What do we need to do to move this forward? I suppose we need a ZEP propose an update the spec to redefine how extensions work. π΅βπ« I'd be happy to lead that effort if it would be helpful.
I suppose we need a ZEP propose an update the spec to redefine how extensions work
Yeah, that's the sticking point. We need some way to break the current logjam.
Thinking a bit more, I guess the addition of zarr_extensions
array is only necessary if we also intend to use jsonschema for validation for both the core metadata and extensions. I think the main thing to figure out is how the different fields that make up the final object are versioned (and potentially validated against a schema).
Take consolidated metadata as an example: regardless of whether zarr_extensions
is used, you'll end up with a similar metadata document for a Group. For example, with zarr_extensions
:
{
"zarr_format": 3,
// ...
"consolidated_metadata": {
"must_understand": false,
"name": ...,
...
},
"zarr_extensions": ["https://github.com/zarr-extensions/consolidated-metadata/v1.0.0/schema.json"]
}
Or without zarr_extensions
, with the version of the consolidated metadata extension inlined:
{
"zarr_format": 3,
// ...
"consolidated_metadata': {
"must_understand": false,
"version": "1.0.0",
...,
}
}
The advantage of zarr_extensions
is a uniform way for tools to validate the contents of core and extension metadata. Whether or not trying to introduce something like that at this stage of zarr v3, I'm not sure.
Thanks for sharing this, @TomAugspurger. I went through STAC's extension README, and I like how they've decoupled the extensions from the core. The ability to work on extensions without the involvement of the core specification authors or, in our case, the ZSC/ZIC could prove useful.
Going back to conversations I had with @alimanfoo in 2022, I think Alistair envisioned something similar for extensions β the community working on their extensions unrestrictively.
I also like how the STAC extensions webpage neatly lists the extensions. We could work on a similar repository/organisation for authors who would like to host their extensions under zarr-developers while also having the option to host their extensions outside of zarr-developers GitHub.
We worked on the ZEP process when the Zarr community needed a mechanism to solicit feedback and move forward in a structured manner. It worked well and helped us to finalise two proposals (ZEP1 and ZEP2), but if it's proving to be a roadblock for further development, then we should make changes to it.
I'm curious to hear @joshmoore and @jakirkham's thoughts.
My thoughts on moving this forward: I have a PR, https://github.com/zarr-developers/zeps/pull/59, which will revise the existing ZEP process. Among other changes, my PR removes the requirement of a ZEP proposal for extensions. Please check and review. ππ»
I'm also happy to write or collaborate with @rabernat on a ZEP proposal outlining the new process for extensions.
regardless of whether
zarr_extensions
is used, you'll end up with a similar metadata document for a Group.
I'm not particularly (at all) familiar with the design decisions of STAC so a question: what are the trade-offs of having the new JSON object (here: consolidated_metadata
) at the top-level and not within the extensions object itself?
Assuming embedding it under something like "extensions" is viable, it occurs to me if we could resurrect that field (which was previously in v3) by making use of must_understand
recursively. The field "extension" would make use of the extension (no quotes) mechanism itself. Further extensions (if that's too confusing, then another name like plugins, etc.) could be embedded in that object. They in turn have a "must_understand" field and that if ANY of those is True, then the top-level is true as well.
Tom's example from above might look like this:
{
"zarr_format": 3,
"extensions": {
"must_understand": true,
"https://github.com/zarr-extensions/consolidated-metadata/v1.0.0/schema.json": {
"must_understand": false,
"name": "..."
},
"https://github.com/zarr-extensions/something-else/schema.json": {
"must_understand": true
}
}
(If multiple objects of the same extension are needed, then this could be a list of dicts rather than a dict)
The benefits would be:
what are the trade-offs of having the new JSON object (here: consolidated_metadata) at the top-level and not within the extensions object itself?
In STAC, stac_extensions
is an array (of URLs to jsonschema definitions), not an object.
Where in the document the fields defined by an extension go (top level or under extensions
) doesn't matter from the point of view of json schema: you just need to ensure that the definition matches the usage.
Requiring that extensions place their additional fields under extensions
only helps with namespace collisions between an extension's field and the core spec (including future versions of the spec). It doesn't help with collisions between extensions, at least not at the json schema level. You could require by convention that all extensions use a namespace, but that's just a convention.
I agree that a separate extensions
object doesn't necessarily help --- I argued against that previously because I don't see a strong benefit in distinguishing between what was in the first version of the core spec and what is added in subsequent versions.
I do think it is valuable to avoid name collisions --- but I think we can accomplish that by using suitable unambiguous names in the top level equally as well as using such names within a nested extensions
object.
If the goal is to define and implement extensions without any central review, then to avoid collisions, then we should use a naming scheme for any top-level metadata fields added by extensions that avoids the possibility of collisions without relying on central review. The simplest solution is to use a domain name / URL prefix under the control of the extension author. For example, you could use:
{
"zarr_format": 3,
"https://github.com/TomAugspurger/consolidated-metadata": {
"must_understand": false,
...
}
}
or
{
"zarr_format": 3,
"github.com/TomAugspurger/consolidated-metadata": {
"must_understand": false,
...
}
}
Using https://github.com/zarr-extensions/...
would imply at least the approval of whoever is managing that github organization. Maybe the barrier for that could be extremely low, e.g. first come, first serve. But it is probably simpler to avoid even that level of central review for extensions intended not to be centrally reviewed.
FWIW, name collisions haven't been a problem in STAC. The convention to include a prefix in your newly defined keys (proj:shape
, for the shape
field defined by the projection
extension) is widely followed.
As mentioned in https://github.com/zarr-developers/zarr-specs/pull/309, I ran across some challenges with how the Zarr v3 spec does extensions. I think that we might be able to learn some lessons from how STAC handles extensions.
tl/dr: I think Zarr would benefit from a better extension story that removed the need to have any involvement from anyone other than the extension author and any tooling wishing to use that extension. JSON schema + a
zarr_extensions
field onGroup
andArray
would get us most of the way there. The current requirements ofmust_understand: false
andname: URL
in the extension objects feels like a weaker version of this.How STAC does extensibility
STAC is a JSON-based format for cataloging geospatial assets. https://github.com/radiantearth/stac-spec/blob/master/extensions/README.md#overview lays out how STAC allows itself to be extended, but there are a few key components
Collection
,Item
, etc.) include astac_version
field.Collection
,Item
) include astac_extensions
array with a list of URLs to JSON Schema definitions that can be used for validation.Together, these are sufficient to allow extensions to extend basically any part of STAC without any involvement from the core of STAC. Tooling built around STAC coordinates through
stac_extensions
For example, a validator can load the JSON schema definitions for the core metadata (using thestac_version
field) and all extensions (using the URLs instac_extensions
) and validate a document against those schemas. Libraries wishing to use some feature can check for the presence of a specificstac_extension
URL.You also get the ability to version things separately. The core metadata can be at
1.0.0
, while theproj
extension is a 2.0.0 without issue.How that might apply to Zarr
Two immediate reactions to the thought of applying that to Zarr:
Group
andArray
definitions (and possibly other fields within; STAC does this as well for, e.g.Assets
which live inside anItem
).How does this relate to what zarr has today?
I'm not sure. I was confused about some things reading https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#extension-points. The spec seems overly prescriptive about putting keys in the top level of the metadata:
STAC / JSON schema takes the opposite approach to their metadata documents. Any extra fields are allowed and ignored by default, but schemas (core or extension) can define required fields.
Having a central place to advertise extensions is great. But to me having to write a ZEP feels like a pretty high bar. STAC extensions are quick and easy to create, and that's led to a lot of experimentation and eventual stabilization in STAC core. And some institutions will have private STAC extensions that they never intend to publish. IMO the extension story should lead with that and offer a
zarr-extensions
repository / organization for commonly used extensions / shared maintenance.