zarr-developers / zeps

Zarr Enhancement Proposals
https://zarr.dev/zeps
Creative Commons Zero v1.0 Universal
12 stars 15 forks source link

ZEP 4: Metadata Conventions #28

Closed rabernat closed 1 year ago

rabernat commented 1 year ago

This ZEP describes how communities can standardize conventions around metadata and layout of Zarr data using user-defined attributes in order to meet domain-specific application needs without changes to the core data model and specification, and without specification extensions.

jakirkham commented 1 year ago

cc @briannapagan (in case this is of interest to you)

rabernat commented 1 year ago

This is directly relevant to the forthcoming geozarr work, so that's why I wanted to push it out in draft form

jstriebel commented 1 year ago

Awesome, thanks a lot! As mentioned in https://github.com/zarr-developers/zarr-specs/issues/169, this is also relevant for the issues https://github.com/zarr-developers/zarr-specs/issues/139 and https://github.com/zarr-developers/zarr-specs/pull/144.

I like that this is becoming a separate ZEP, it never occurred to me to separate this from ZEP 1.

jbms commented 1 year ago

The current proposal just allows a group/array to have a single convention. Perhaps for some use cases that makes sense. But the example in the proposal is "units", which could easily interoperate with numerous other possible conventions. Instead the naming of attributes could be done to allow multiple conventions to be used at once, for example:

{"zarr_convention": {"units-v1": {"units": "m^2"}, ...}

or

{"units-v1": {"units": "m^2"}, ...}

or

{"units-v1": "m^2", ...}
rabernat commented 1 year ago

Very good point Jeremy.

TBH, I'm on the fence about whether the convention even needs to be explicitly identified. Like, maybe it could be enough to say

Arrays with the units attribute set are assumed to be using this convention

Of the proposals above, I definitely favor the first one because it doesn't touch the name of the actual attribute. I could also imagine

{
    "zarr_convention_units-v1": True,
    "zarr_convention_foobar-v2": True
}
martindurant commented 1 year ago

Strongly support this concept.

Question: you mention the currently uncodified (by zarr) conventions in the wild. Is there something to be done about conventions that arise organically and are not described in the zarr docs?

martindurant commented 1 year ago

I do believe it's useful that, once a convention is listed and given a name, it is explicitly mentioned in the attributes of the data that uses it.

rabernat commented 1 year ago

Is there something to be done about conventions that arise organically and are not described in the zarr docs?

I think that it's natural for conventions to arise organically. Once there is sufficient alignment and adoption, they can be proposed as conventions.

For conventions created in the wild, or borrowed from other formats (e.g. CF Conventions), it could be hard to require the presence of the zarr_conventions attribute. (I'm thinking about e.g. converting from NetCDF to Zarr.) There needs to be a way to simply document existing conventions, without prescribing new attributes to be present.

martindurant commented 1 year ago

it could be hard to require the presence

Recommended, but not required?

jstriebel commented 1 year ago

I'd propose to remove the zarr_convention key, and simply have a document which defines metadata convention (+ the process to add them as laid out in this ZEP :heart:). The user attributes could follow the metadata conventions (or not^^), e.g.:

{
  "units": "m^2",
  "writer": "zarr-python",
  "origin": [12300, 45600],
  "convention-key": "valuenotfollowingtheconvention",
  "some-other-key": "foo",
  …
}

I see the conventions as a good place to discuss & establish standards between implementations, not as a strict mechanism that must be enforced. Also, IMO it can't be enforced, having the zarr_convention key doesn't avoid misusages of such conventions (e.g. other interpretations, using inches when only metric units are defined, bugs, …).

ivirshup commented 1 year ago

About specifying the convention: I think it's really quite useful to know which specific convention is being used and to version them.

For example, if two groups want to use the units field, how should I know how to interpret that? What if you want to update the convention?

In anndata, we specify conventions for our data with an encoding-type and encoding-version field in .attrs (both in hdf5 and zarr). We used to not, and it kinda sucked to figure out people's IO errors or making any updates.

I'd also agree with the point made above that only allowing a single convention may be limiting. Maybe instead conventions could be stored like:

z.attrs["conventions"] = {"convention": "version", ...}
joshmoore commented 1 year ago

I tend to agree with Isaac. Maybe it helps to think through (and/or specify) what processing looks like. I've been debating whether to bring up my favorite soapbox (JSON-LD) as a way of specifying such metadata that already exists, has processing rules, etc. e.g.:

| Context                                          | Field    | Interpretation                |   |
|--------------------------------------------------|----------|-------------------------------|---|
| N/A                                              | units    | this-file#units               |   |
| {@context: {units: example.com/}}                | units    | example.com/units             |   |
| {@context: http://some-file.jsonld}              | units    | whatever-some-file-says/units |   |
| {@context: {ns: https://some-other-file.jsonld}} | ns:units | some-other-namespace/units    |   |

Obviously, there are a lot of different edge cases there, but I do like the idea of not building our own.

rabernat commented 1 year ago

From a practical point of view, it may be simply impossible to impose a hard requirement for convention identifiers. A big use case for us is transcoding NetCDF / HDF5 data that already exists into Zarr. This data was written 10+ years ago and the metadata is what it is.

I think way forward is for me to put together the template referred to above. This should have a section on "How to identify this convention".

jbms commented 1 year ago

From a practical point of view, it may be simply impossible to impose a hard requirement for convention identifiers. A big use case for us is transcoding NetCDF / HDF5 data that already exists into Zarr. This data was written 10+ years ago and the metadata is what it is.

Is it not an option to convert the metadata at the same time as the data conversion happens? It seems that during this data conversion is when you would have the most context for decoding any metadata.

ivirshup commented 1 year ago

From a practical point of view, it may be simply impossible to impose a hard requirement for convention identifiers.

Totally fair. Without any sort of identifier or format requirement this seems to me like a listing of conventions used with zarr.

If this is the direction, I wonder if even the "consensus" requirement for new conventions could be softened to "noteworthy" or removed. If no namespace is being reserved, then I don't think the zarr team needs to take on the responsibility of figuring out if a field has consensus on a file format. Especially since it's so easy to break consensus. For instance, of the given examples aren't Xarray Zarr, GDAL, and GeoZarr competing?

I think there's definitely value in collecting lists of conventions building on top of zarr, and making that visible. However, I wonder if doing more than that (like establishing credentials based on consensus/ use) is something better left to standards repositories like fairsharing.org?

rabernat commented 1 year ago

Really good points @ivirshup. Yes, we definitely don't want to give ourselves (zarr developers) the job of mediating standards in different scientific domains. The only intention here is to provide a means to document an existing convention, not do any sort of evaluation or approval. I'll modify to reflect that.

jbms commented 1 year ago

On my end, I'm most interested in attributes that are relevant to general purpose tools like Neuroglancer, e.g. things like units, different types of labels. If there is no way to unambiguously identify the metadata then it is much more complicated to make use of it.

normanrz commented 1 year ago

I think namespacing would be a good idea. In the OME-Zarr context, we are thinking about wrapping all the OME-specific metadata under the ome key in the attributes https://github.com/ome/ngff/issues/182. I think that would be useful to allow metadata from different metadata conventions to exist in the same group/array.

christophenoel commented 1 year ago

Is there any existing work on that ? I don't see item ZEP4 in https://github.com/zarr-developers/zeps/tree/main/draft

I didn't realised it was a pull request sorry :) :)

christophenoel commented 1 year ago

My main concern is the lack of a concept for grouping conventions for a specific purpose, profile, or topic. This would provide greater flexibility for client applications to selectively support conventions, while still enabling interoperability with other Zarr implementations.

In some domains (e.g. Earth Observation), there may be hundreds of conventions, and client applications may only address subsets of those conventions. Similar to how OGC APIs Implementation Standards are written, I suggest introducing "requirement-classes" (or "convention-classes") as a means of decoupling a set of domain conventions into groups that can be advertised in Zarr as supported or not. For example:

conventions-classes: ["eo-core", "eo-multispectral","eo-multiscale", "eo-quicklook", "eo-symbology"]

Furthermore, I do not see a clear indication in the process of how the conventions become listed on https://zarr-specs.readthedocs.io/ under a specific domain section.

Regards,

rabernat commented 1 year ago

Thanks for everyone's patience, and apologies for being slow to finish up this draft.

I plan to prioritize this over the next few weeks. I'll respond to the comments above and push a new draft that incorporates the feedback.

rabernat commented 1 year ago

I have finally updated this ZEP. Thanks everyone for the patience. In my update, I incorporated the following changes

My goal here is to include the very good ideas that have been proposed in the discussion above as recommended best practices while retaining the ability to support legacy conventions and practices already in use in the community.

MSanKeys963 commented 1 year ago

Thanks for completing it, @rabernat, and everyone for reviewing this. Merging this as discussed in the ZEP Meeting today.

ZEP0004 is live here: https://zarr.dev/zeps/draft/ZEP0004.html.

ivirshup commented 1 year ago

@MSanKeys963, did a discussion get opened for this?

rabernat commented 1 year ago

Hi Isaac! I believe it's on me to open a PR where the discussion will happen. I will try to do that today.

ivirshup commented 1 year ago

👍

Is it not meant to be a Discussion (as opposed to a PR/ Issue)? I think this kind of discussion heavily benefits from threading.

Maybe this changed: https://github.com/zarr-developers/zeps/pull/27 ?

rabernat commented 1 year ago

TBH I think the ZEP process still has a lot of details to be ironed out. I agree a discussion makes sense.

tasansal commented 1 year ago

This idea is excellent; I would love to help push this forward. What is the best way to collaborate?

rabernat commented 1 year ago

@tasansal - the discussion is continuing in https://github.com/zarr-developers/zarr-specs/pull/262

The best way to collaborate would be to share your use cases there.