Extension for separate user attributes

alimanfoo commented 4 years ago

In the current v3 protocol spec draft, an array metadata document contains both core metadata (like array data type, shape, chunk shape, etc.) and user attributes. E.g.:

{
    "shape": [10000, 1000],
    "data_type": "<f8",
    "chunk_grid": {
        "type": "regular",
        "chunk_shape": [1000, 100]
    },
    "chunk_memory_layout": "C",
    "compressor": {
        "codec": "https://purl.org/zarr/spec/codec/gzip/1.0",
        "configuration": {
            "level": 1
        }
    },
    "fill_value": "NaN",
    "extensions": [],
    "attributes": {
        "foo": 42,
        "bar": "apples",
        "baz": [1, 2, 3, 4]
    }
}

The same is true for groups, i.e., both core metadata and user attributes are stored together in a single group metata document, e.g.:

{
    "extensions": [],
    "attributes": {
        "spam": "ham",
        "eggs": 42,
    }
}

The zarr v3 approach of storing core metadata and user attributes together is similar to N5.

Note that this is different from the zarr v2 protocol, where the core metadata and user attributes are stored in separate documents.

Raising this issue to surface any comments or discussion on this approach.

Some possible reasons for bringing core metadata and user attributes together into a single document:

Creating an array with user attributes requires only a single storage request.
Reading all metadata for an array requires only a single storage request.

Some possible reasons for separating core metadata and user attributes into different documents:

If user attributes for an array are reasonably large and rarely used, then there is some overhead when reading core metadata. I.e., you can't read just the core metadata without reading the user attributes too.

joshmoore commented 4 years ago

Some discussion around this on today's community meeting. Further discussion on the spec meeting needed.

DennisHeimbigner commented 2 years ago

I want to put in a word for keeping the attributes in a separate document. One persistent problem is that the spec developers do not seem to consider that data sets exist with very large amounts of metadata, including many groups, many variables, and many attributes. The netcdf group at unidata has been grappling with this for some time. We currently support lazy reading of groups, variables, and attributes just because of this. The more you combine the metadata, the slower it becomes to open a dataset. This issue has also come up in the context of version 2 where there is informal support for the ultimate case of putting all the metadata into a single object. Bottom line: when building this spec, always consider the case where the object you create is very large. merging all the metadata

jstriebel commented 1 year ago

I'm wondering if it should be a core feature to have the attributes separate, or if it could be an extension. Any opinions on that? If it's supposed to be a core feature, should it always be in a separate document, or can it be inlined into the other metadata as it currently is, too?

jstriebel commented 1 year ago

If there are no objections, I'd leave this as an extension (TBD) for now.

rabernat commented 1 year ago

@DennisHeimbigner - do you think that your issues could be addressed via an extension?

DennisHeimbigner commented 12 months ago

The meaning of "extension" is completely undefined, so I have no idea if it is possible or not. My guess is that attributes as a separate document is not doable within any reasonable and limited notion of "extension".

rabernat commented 12 months ago

Thanks for your feedback Dennis. Are you saying that the extension points in the V3 spec are not well defined enough? Any suggestions on how to make that better?

DennisHeimbigner commented 12 months ago

Short answer: no they are not well defined enough. But lets try to think about an extension for moving attributes to a separate object. As near as I can tell, non of the listed extension points try to create a new separate object. My first thoughts are that such an extension would need, at least, something like this:

an indication of where the object can be situated in the meta-data JSON tree.
a specification of the contents of that object
If, as in this case, the new object is some combination of information from elsewhere in the metadata tree, then a mapping of some sort between the the new object and the original metadata and an indication of whether or not the original metadata is to be deleted.

In the worst case, we are talking about an arbitrary rewrite of the JSON metadata, which is not desirable, so we also need to somehow limit what kind of transformations are needed. Perhaps we need something similar to XSLT.

Needs more thought.

DennisHeimbigner commented 12 months ago

Another possibility to address this. We define a notion of a "pointer" and then we allow dict key values be converted to a separate object in the zarr file tree. Then we allow a new kind of JSON object, more or less a "pointer" to another object in the zarr tree. The semantics of the pointer would be that in effect the contents of the pointed-to object is in effect inserted into the JSON tree in place of the pointer object.

DennisHeimbigner commented 12 months ago

Another thought about extensions. It seems like all of current set that address metadata could be subsumed by specifying an XSLT(?) like path template to a dictionary plus a template specifying a new key plus value that can be inserted at all points in the metadata that match the path template. So we could write a template that covers all zarr.json objects with node_type == array and then insert a new key into that zarr.json object along with some JSON value describing the specific extension. This approach would also allow extensions to build on other extensions.

DennisHeimbigner commented 12 months ago

This might be relevant: https://softwareengineering.stackexchange.com/questions/264706/xslt-equivalent-for-json

jbms commented 12 months ago

Agreed that none of the existing "extension points" are applicable for this purpose. However, at least as I see it, an "extension" more generally is really any future change to the spec that maintains some amount of compatibility with the existing spec (as otherwise we would just consider it an independent spec rather than an extension), and that would certainly be able to support this feature.

It sounds like you are proposing a more general mechanism for moving arbitrary subtrees of the JSON to separate keys in the underlying store; that is an interesting possibility. I can also imagine a more limited feature specifically to allow user-defined attributes to be specified separately.

DennisHeimbigner commented 12 months ago

My worry is that the notion of "extension" has no current definition. It needs at least a semi-formal (i.e. prose) definition. Perhaps we can start by saying who can implement new extensions. Do they have to be approved by the ZEP committee?

jbms commented 12 months ago

Yes, perhaps it should be clarified.

It seems to me that a typical approach for creating an extension would be to first create an (experimental) implementation, then create a ZEP proposal to standardize it.

For extensions outside the realm of the indicated "extension points", any changes to the zarr.json metadata file should take into account interoperability issues:

If you want other implementations that don't implement the extension to fail to open the array/group, then add a new metadata field without {"must_understand": false}.
If you want other implementations that don't implement the extension to ignore it, then add a new metadata field with {"must_understand": true}.

DennisHeimbigner commented 12 months ago

Actually, this raises a question. Suppose I want to add a non-standard key to, say, a configuration dictionary. Can I insert "must_understand: false" in that configuration dictionary, or must the must_understand occur at the top-level of zarr.json object?

jbms commented 12 months ago

That is a good point. Currently the spec does not specify this, and therefore it is left up to the definition of the specific configuration, and currently those aren't very explicit about this. Tensorstore currently only supports must_understand at the top level, but probably it would be better to support this at all levels. This would be an easy, backwards-compatible change.

zarr-developers / zarr-specs

Extension for separate user attributes #72