zarr-developers / zarr-specs

Zarr core protocol for storage and retrieval of N-dimensional typed arrays
https://zarr-specs.readthedocs.io/
Creative Commons Attribution 4.0 International
88 stars 28 forks source link

Extension for attribute datatypes? #229

Open ajelenak opened 1 year ago

ajelenak commented 1 year ago

Hello!

Hope this is a good place for my question: Has there been any interest before for more explicit Zarr attribute datatypes? My understanding of the Zarr v3 draft specification is that attribute values will be of any valid JSON datatype. Is the expectation that casting, for example, of an attribute scalar value of 1 into any specific software datatype like int32 or uint64 to be done by Zarr readers depending on the context?

Thanks!

rabernat commented 1 year ago

Hi Alex! There has been some discussion on this, yes. I'm going to transfer your issue to zarr-specs just because that's where most of the other discussion is. (Yes, I agree we have too many repo.)

Related

rabernat commented 1 year ago

Here's my personal idea for the best way to implement this.

We define a zarr v3 extension for an external attributes file (the V3 spec puts attributes in "zarr.json") and allow this to be a binary storage format like CBOR, msgpack, etc. That would solve both the "large amount of metadata" and the "explicit attribute datatypes" issue in one go.

jbms commented 1 year ago

As Ryan noted, we've had a lot of discussions about attributes and a number of solutions have been proposed but no consensus has yet been reached.

I think it may be helpful to distinguish between a number of different issues:

  1. Ability to represent a richer set of data values as attributes than is supported by JSON, e.g. byte strings, non-finite floating point numbers, timestamp, potentially any valid zarr v3 data type, potentially a multi-dimensional array
  2. Explicitly associating a data type, such as int32, uint64 (in some form) with attributes, making it possible to distinguish e.g. 42 of type int32 and 42 of type uint64. There is also the question of whether this typing should apply only to scalars, or if it should be supported for containers as well, e.g. is there just a generic array type or is there array<uint32_t> and array<map<unicode_string, uint32_t>>.

CBOR provides a way to encode a richer set of values (1) but does not by itself provide a way to distinguish between e.g. an int32 and a uint64. However, CBOR does provide a way to associate an arbitrary integer tag with a value, and in principle zarr could define tag values to indicate data types. I don't know how well that would work in practice, e.g. how well it would allow the zarr metadata to be read and written by other tools, but it might work reasonably well.

msgpack similarly provides a way to encode a richer set of values but does not particularly help with (2). It provides an extension mechanism but I don't think the extension mechanism would work well for the purpose of indicating a data type.

ajelenak commented 1 year ago

Thank you @rabernat for routing my question to a more appropriate place.

@jbms I agree with your breakdown of use cases. For the container cases like array<uint32_t> the approach could be the same as for Zarr arrays: a shape and a datatype.

Can this discussion be combined with one of the mentioned issues or better to keep it separate?

jbms commented 1 year ago

I think you will have to decide yourself whether your comments are closely related to one of the existing issues.

I don't think there is existing an issue specifically related to the idea of storing an explicit data type for each attribute. But I believe that is the approach taken by nczarr (netcdf zarr).

In general, while I can see that there may be some value in being explicit about data types, and that it provides better compatibility with the HDF5 data model, it also seems to me that it would introduce a lot of additional complexity and it is not clear exactly which use cases, other than HDF5 compatibility, benefit from it.

In contrast, merely extending the set of values that can be represented seems more promising.

But if you have a compelling proposal for how to add support for explicit data types, I'd certainly be interested.

joshmoore commented 1 year ago

Not finding a better issue so I'll cross-reference here an impending need for datetime from https://github.com/bluesky/tiled/issues/514

joshmoore commented 1 month ago

See also https://github.com/bluesky/tiled/pull/782