Closed rabernat closed 1 year ago
cc @briannapagan (in case this is of interest to you)
This is directly relevant to the forthcoming geozarr work, so that's why I wanted to push it out in draft form
Awesome, thanks a lot! As mentioned in https://github.com/zarr-developers/zarr-specs/issues/169, this is also relevant for the issues https://github.com/zarr-developers/zarr-specs/issues/139 and https://github.com/zarr-developers/zarr-specs/pull/144.
I like that this is becoming a separate ZEP, it never occurred to me to separate this from ZEP 1.
The current proposal just allows a group/array to have a single convention. Perhaps for some use cases that makes sense. But the example in the proposal is "units", which could easily interoperate with numerous other possible conventions. Instead the naming of attributes could be done to allow multiple conventions to be used at once, for example:
{"zarr_convention": {"units-v1": {"units": "m^2"}, ...}
or
{"units-v1": {"units": "m^2"}, ...}
or
{"units-v1": "m^2", ...}
Very good point Jeremy.
TBH, I'm on the fence about whether the convention even needs to be explicitly identified. Like, maybe it could be enough to say
Arrays with the
units
attribute set are assumed to be using this convention
Of the proposals above, I definitely favor the first one because it doesn't touch the name of the actual attribute. I could also imagine
{
"zarr_convention_units-v1": True,
"zarr_convention_foobar-v2": True
}
Strongly support this concept.
Question: you mention the currently uncodified (by zarr) conventions in the wild. Is there something to be done about conventions that arise organically and are not described in the zarr docs?
I do believe it's useful that, once a convention is listed and given a name, it is explicitly mentioned in the attributes of the data that uses it.
Is there something to be done about conventions that arise organically and are not described in the zarr docs?
I think that it's natural for conventions to arise organically. Once there is sufficient alignment and adoption, they can be proposed as conventions.
For conventions created in the wild, or borrowed from other formats (e.g. CF Conventions), it could be hard to require the presence of the zarr_conventions
attribute. (I'm thinking about e.g. converting from NetCDF to Zarr.) There needs to be a way to simply document existing conventions, without prescribing new attributes to be present.
it could be hard to require the presence
Recommended, but not required?
I'd propose to remove the zarr_convention
key, and simply have a document which defines metadata convention (+ the process to add them as laid out in this ZEP :heart:). The user attributes could follow the metadata conventions (or not^^), e.g.:
{
"units": "m^2",
"writer": "zarr-python",
"origin": [12300, 45600],
"convention-key": "valuenotfollowingtheconvention",
"some-other-key": "foo",
…
}
I see the conventions as a good place to discuss & establish standards between implementations, not as a strict mechanism that must be enforced. Also, IMO it can't be enforced, having the zarr_convention
key doesn't avoid misusages of such conventions (e.g. other interpretations, using inches when only metric units are defined, bugs, …).
About specifying the convention: I think it's really quite useful to know which specific convention is being used and to version them.
For example, if two groups want to use the units
field, how should I know how to interpret that? What if you want to update the convention?
In anndata
, we specify conventions for our data with an encoding-type
and encoding-version
field in .attrs
(both in hdf5 and zarr). We used to not, and it kinda sucked to figure out people's IO errors or making any updates.
I'd also agree with the point made above that only allowing a single convention may be limiting. Maybe instead conventions could be stored like:
z.attrs["conventions"] = {"convention": "version", ...}
I tend to agree with Isaac. Maybe it helps to think through (and/or specify) what processing looks like. I've been debating whether to bring up my favorite soapbox (JSON-LD) as a way of specifying such metadata that already exists, has processing rules, etc. e.g.:
| Context | Field | Interpretation | |
|--------------------------------------------------|----------|-------------------------------|---|
| N/A | units | this-file#units | |
| {@context: {units: example.com/}} | units | example.com/units | |
| {@context: http://some-file.jsonld} | units | whatever-some-file-says/units | |
| {@context: {ns: https://some-other-file.jsonld}} | ns:units | some-other-namespace/units | |
Obviously, there are a lot of different edge cases there, but I do like the idea of not building our own.
From a practical point of view, it may be simply impossible to impose a hard requirement for convention identifiers. A big use case for us is transcoding NetCDF / HDF5 data that already exists into Zarr. This data was written 10+ years ago and the metadata is what it is.
I think way forward is for me to put together the template referred to above. This should have a section on "How to identify this convention".
From a practical point of view, it may be simply impossible to impose a hard requirement for convention identifiers. A big use case for us is transcoding NetCDF / HDF5 data that already exists into Zarr. This data was written 10+ years ago and the metadata is what it is.
Is it not an option to convert the metadata at the same time as the data conversion happens? It seems that during this data conversion is when you would have the most context for decoding any metadata.
From a practical point of view, it may be simply impossible to impose a hard requirement for convention identifiers.
Totally fair. Without any sort of identifier or format requirement this seems to me like a listing of conventions used with zarr.
If this is the direction, I wonder if even the "consensus" requirement for new conventions could be softened to "noteworthy" or removed. If no namespace is being reserved, then I don't think the zarr team needs to take on the responsibility of figuring out if a field has consensus on a file format. Especially since it's so easy to break consensus. For instance, of the given examples aren't Xarray Zarr, GDAL, and GeoZarr competing?
I think there's definitely value in collecting lists of conventions building on top of zarr, and making that visible. However, I wonder if doing more than that (like establishing credentials based on consensus/ use) is something better left to standards repositories like fairsharing.org?
Really good points @ivirshup. Yes, we definitely don't want to give ourselves (zarr developers) the job of mediating standards in different scientific domains. The only intention here is to provide a means to document an existing convention, not do any sort of evaluation or approval. I'll modify to reflect that.
On my end, I'm most interested in attributes that are relevant to general purpose tools like Neuroglancer, e.g. things like units, different types of labels. If there is no way to unambiguously identify the metadata then it is much more complicated to make use of it.
I think namespacing would be a good idea. In the OME-Zarr context, we are thinking about wrapping all the OME-specific metadata under the ome
key in the attributes https://github.com/ome/ngff/issues/182. I think that would be useful to allow metadata from different metadata conventions to exist in the same group/array.
Is there any existing work on that ? I don't see item ZEP4 in https://github.com/zarr-developers/zeps/tree/main/draft
I didn't realised it was a pull request sorry :) :)
My main concern is the lack of a concept for grouping conventions for a specific purpose, profile, or topic. This would provide greater flexibility for client applications to selectively support conventions, while still enabling interoperability with other Zarr implementations.
In some domains (e.g. Earth Observation), there may be hundreds of conventions, and client applications may only address subsets of those conventions. Similar to how OGC APIs Implementation Standards are written, I suggest introducing "requirement-classes" (or "convention-classes") as a means of decoupling a set of domain conventions into groups that can be advertised in Zarr as supported or not. For example:
conventions-classes: ["eo-core", "eo-multispectral","eo-multiscale", "eo-quicklook", "eo-symbology"]
Furthermore, I do not see a clear indication in the process of how the conventions become listed on https://zarr-specs.readthedocs.io/ under a specific domain section.
Regards,
Thanks for everyone's patience, and apologies for being slow to finish up this draft.
I plan to prioritize this over the next few weeks. I'll respond to the comments above and push a new draft that incorporates the feedback.
I have finally updated this ZEP. Thanks everyone for the patience. In my update, I incorporated the following changes
zarr_conventions
should be an array of strings, allowing multiple conventions to be composed togetherMy goal here is to include the very good ideas that have been proposed in the discussion above as recommended best practices while retaining the ability to support legacy conventions and practices already in use in the community.
Thanks for completing it, @rabernat, and everyone for reviewing this. Merging this as discussed in the ZEP Meeting today.
ZEP0004 is live here: https://zarr.dev/zeps/draft/ZEP0004.html.
@MSanKeys963, did a discussion get opened for this?
Hi Isaac! I believe it's on me to open a PR where the discussion will happen. I will try to do that today.
👍
Is it not meant to be a Discussion (as opposed to a PR/ Issue)? I think this kind of discussion heavily benefits from threading.
Maybe this changed: https://github.com/zarr-developers/zeps/pull/27 ?
TBH I think the ZEP process still has a lot of details to be ironed out. I agree a discussion makes sense.
This idea is excellent; I would love to help push this forward. What is the best way to collaborate?
@tasansal - the discussion is continuing in https://github.com/zarr-developers/zarr-specs/pull/262
The best way to collaborate would be to share your use cases there.
This ZEP describes how communities can standardize conventions around metadata and layout of Zarr data using user-defined attributes in order to meet domain-specific application needs without changes to the core data model and specification, and without specification extensions.