zarr-developers / zarr-specs

Zarr core protocol for storage and retrieval of N-dimensional typed arrays
https://zarr-specs.readthedocs.io/
Creative Commons Attribution 4.0 International
88 stars 28 forks source link

Raw data type fill value endianness and endian/bytes codec interaction #267

Closed LDeakin closed 1 year ago

LDeakin commented 1 year ago

In the fill value metadata, there is no mention of endianness. This is fine for numerical types, as they have a native endian representation. If a chunk is missing during retrieval, it can be populated with the native endian representation of the fill value without knowledge of the codecs. However, I think raw bytes fill values are problematic because the interpretation depends on the codecs applied. A user looking at the metadata has to understand how the codecs function to know how the fill value will be interpreted.

Consider the case of decoding an array with the gzip and then the endian (soon to be bytes) codec. Since the most significant raw byte in the fill value is not standardised in the specification, the bytes of the fill value would need to be reordered depending on whether the endian codec is big or little endian to be consistent. The raw bytes fill value indicates what the bytes should be after decoding with gzip but before decoding with the endian codec to native endian. I think the user and implementation should not be concerned with what the fill value is in this intermediate representation. Furthermore, consider an array-to-bytes codec that does compression (e.g. zfp) which does not have an endianness parameter; in this case, it is not clear which of the raw bytes is most significant.

I think the text under Raw data types in https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#fill-value should explicitly state which byte is most significant. Hopefully I haven't missed something in the spec about this..

My concern is not relevant. As @jbms pointed out, the bytes codec decodes raw bytes in order, regardless of the endian configuration.

LDeakin commented 1 year ago

For a custom data type represented as r40:

struct {
   float32_t a;
   uint8_t b;
}

maybe the endian (bytes) codec needs to support a mode where it will always pass through the data unmodified. In other words, the bytes codec should permit the endianness parameter to be omitted for a raw bytes or custom multi-byte data type so that a consumer of the decoded array can perform endianness corrections as required. EDIT: In this case, the order of bytes in the fill value is intrinsic to the custom data type and not a concern for zarr.

jbms commented 1 year ago

In fact it is our intention that endianness does not apply to the raw data types --- the endian (soon to be bytes) codec will just decode the bytes in order without regard to the endian configuration option, and indeed that is the behavior of the https://github.com/clbarnes/zarr3-rs/ implementation of zarr v3. I am not aware of any other implementations of zarr v3 that support the raw data type.

LDeakin commented 1 year ago

Okay, that makes sense. I see the bytes spec does indeed say that it only applies to fixed-size numeric types in the abstract. Listing r* as a supported data type added to my confusion, but does this mean custom data types are explicitly unsupported by the bytes codec?

Let's say someone adds a data type extension to a zarr implementation for use with their own datasets, which is fixed size and actually just a uint64. E.g.

"name": "datetime", "configuration": { "unit": "ns" }

It seems it is not spec-compliant to decode this with the endian/bytes codec and a custom codec would be required, even though it is actually a fixed-size numeric type. This was not an issue back when fallback data types were in the spec.

Do you think it would be reasonable to relax the restrictions on the bytes codec so that it supports any fixed-size data type and:

EDIT: To clarify, I have no present use for raw bytes or custom data types. I am just trying to make sure my implementation of zarr V3 zarrs (rust) is spec-compliant.

jbms commented 1 year ago

I imagined that as additional data types are added to the specification, the specification for the bytes codec and any other codecs that support the data type would also need to be updated as needed.

If an implementation adds additional functionality outside of the zarr v3 specification, then it is up to that implementation to figure out what to do. But to avoid interoperability issues it would be better to try to standardize the additional functionality.

I think trying to come up with language for the specification now that will cover future extensions may be difficult and it is not clear to me exactly what benefit it would provide.

For the endian codec in particular, it is not just a matter of reversing all the bytes or not. E.g. for complex64 we need to reverse the bytes of each float32 component separately, and similarly we need to do something special for a data type like your custom struct { float32_t a; uint8_t b; }. The revised text for the bytes codec states that the endian option must be specified for data types where endianness is relevant, and is ignored for data types where it is not. Presumably for new data types added in the future, the same requirements would make sense.

jbms commented 1 year ago

We removed the fallback data type support precisely because, for multiple reasons, it is not really possible for an implementation to deal with unknown data types. Instead, I expect that an implementation should just return an error if an attempt is made to open an array with an unsupported data type. Therefore I don't think you really need to worry about the behavior for new data types that may be added in the future, until you actually add support for those data types.

LDeakin commented 1 year ago

Thanks @jbms! That clarifies everything.