zarr-developers / zarr-specs

Zarr core protocol for storage and retrieval of N-dimensional typed arrays
https://zarr-specs.readthedocs.io/
Creative Commons Attribution 4.0 International
88 stars 28 forks source link

Storing quiet and signalling NaNs in Zarr #194

Open tomwhite opened 2 years ago

tomwhite commented 2 years ago

A IEEE 754 NaN is not a single value, but a set of possible values. While it's possible to store different NaN values in Zarr, there are some subtleties, particularly with fill values.

In sgkit's VCF Zarr format we use a quiet NaN to indicate that a float value is missing, and a signalling NaN for padding to encode variable length (ragged) arrays. Similarly tskit uses a quiet NaN to indicate missing.

Since the Zarr fill value encoding only allows a single NaN value, we can't specify a fill value since it's not possible to make it a quiet or signalling NaN.

In Zarr 2.11.0 there was a change where chunks with data equal to the fill value are no longer written to disk by default. This doesn't work for applications using quiet or signalling NaNs, since Zarr doesn't distinguish NaN values when determining if all the elements of a chunk are equal. So a chunk that has a mixture of regular and quiet/signalling NaNs will not be stored. The workaround is to set write_empty_chunks=True. (However, this is not possible in xarray until the next release after v2022.03.0 https://github.com/pydata/xarray/pull/6348. Also I'm not sure it's possible to set the fill value for floats in xarray.)

Are there changes that could be made to Zarr to make working with different NaN values easier?

joshmoore commented 1 year ago

Sorry for the quiet here, @tomwhite. That should very much be interpreted as "nice question!" :smile:

In discussing today during https://zarr.dev/zeps/meetings/, the decision was to transfer this to the zarr-spec repo and try to deal with it as a part of v3. That likely doesn't fix your immediate issue, but it has a lot more chance of actually getting solved.

I'll defer to @jstriebel for detailing the proposed solution.

jstriebel commented 1 year ago

Hey @tomwhite,

Thanks for bringing up that question! In the current v3 spec, the fill_value can also be a binary blob. Would this suffice to be able to represent both NaNs for the fill value?

For core data types for which fill values are not permitted in JSON or for which decimal representation could be lossy, a string representing of the binary (starting with 0b) or hexadecimal value (starting with 0x) is accepted. This string must include all leading or trailing zeroes necessary to match the given type size.

This would also allow to define any NaN in the fill value (even specifying other bits in the NaN is possible then). The comparison with provided values should probably be done on a binary level by the implementation in this case. Does this suffice for your use-case?

tomwhite commented 1 year ago

Thanks for picking this up @joshmoore and @jstriebel.

In the current v3 spec, the fill_value can also be a binary blob. Would this suffice to be able to represent both NaNs for the fill value?

This sounds perfect!

joshmoore commented 1 year ago

@jstriebel: what labels are appropriate here then?