zarr-developers / numcodecs

A Python package providing buffer compression and transformation codecs for use in data storage and communication applications.
http://numcodecs.readthedocs.io
MIT License
125 stars 87 forks source link

Codec metadata #572

Open juntyr opened 3 weeks ago

juntyr commented 3 weeks ago

Codecs sometimes need to save metadata that is not part of their config and not part of the array shape/type. For instance, a standardisation codec would need to save the mean and standard deviation of the data.

Byte codecs can get away with simply adding a byte header that includes such information, since the general expectation is that the bytes are opaque and should only be losslessly compressed. However, a codec that performs a transformation should ideally retain the data shape and dtype (unless it’s part of what the codec does), so adding a header becomes really awkward.

Is there an established practice for how to handle such metadata?

If not, how possible would a future API evolution be (that is compatible with usage in Zarr) where each codec can add some metadata (that is JSON-serialisable) on encoding and receives it back on decoding (and for compatibility, no metadata would remain the default)?

d-v-b commented 3 weeks ago

Thanks for raising this issue, I think this point is pretty important. I wonder if it's helpful to first solve this problem without thinking about Zarr, then figure out what changes we would need to make to that solution, or to zarr, to make this work.

Suppose normalize_mean.encode(data_in) is a function that takes a contiguous stream of bytes, which can be interpreted as an N-dimensional array with some dtype, and returns a header / footer containing the mean of the input concatenated to an N-dimensional array with the same shape and dtype as the input, minus the mean of the input. The result contains two pieces: a header (or footer, but I will just use header from now on), and an N-dimensional array.

To reverse this process, normalize_mean.decode(data_out) must use the value in the header to reconstruct data_in, and can return just the original N-dimensional array as a bytestream.

If normalize_mean is the only transformation we are applying, then I don't think there's any composition problems -- we can serialize [header, array] to disk directly, or apply some generic compression routine like gzip to the whole bytestream first.

Now lets introduce a second function, normalize_std, which is just like normalize_mean but instead of subtracting the mean, it divides the data by its standard deviation. What if we call normalize_std.encode(normalize_mean.encode(data_in))? Each application of normalize_*.encode adds a header (and normalize_*.decode will remove that header), so calling encode twice would result in two headers. This seems fine, because it's reversible, and the array values are always accessible by indexing backwards from the end of the bytestream. I think this last point is key -- as long as a specific operation (e.g., "read the last dtype * size bytes from the end of the chunk") always returns the contiguous array frame, then codecs like normalizecompose just fine.

But this breaks if normalize_std appends a footer, while normalize_mean appends a header, because then we can't find the array values any more.

Bringing this to zarr, I think the last point is key: we can allow codecs to return arrays + metadata, as long as every codec appends, or every codec prepends, that metadata. If we took this approach, then I think we would need to clarify the language of the spec -- instead of stating that "x -> array codecs must return an N-dimensional array", we would state that "x -> array codecs must return bytestreams that contain an N-dimensional array in the last array_size * dtype_size bytes". Someone should check my assumptions here.

Without chunk headers, I don't think there's any other place to put this information. We don't want to store O(num_chunks) metadata in zarr.json / .zarray because many arrays have a lots of chunks.

juntyr commented 3 weeks ago

Thank you for your detailed reply!

While such a byte-level embedding of optional metadata alongside the array might work, I think that it should only be framed as such once you get to a storage layer. On the API level, metadata should be fully separate from the array or byte buffer data so that no codec accidentally transforms or lossily compresses metadata. Conceptually, each codec should have a metadata type (None by default, it should also have some binary embedding) take data and return a tuple[data, meta] on encoding, where tuple[data, None] is special-cased to equal data for compatibility, and take tuple[data, meta] on decoding and return data. A compression pipeline would then store a stack of metas and push on encoding and pop on decoding, which it could embed as byte stream headers or footers.

juntyr commented 3 weeks ago

Maybe with some typing and class wrapping shenanigans the API could also be written so that passing tuple[data, metaA] into encode would result in tuple[data, metaB, metaA] so that method calls could be chained easily (with special handling for None's again such that all existing implementations would have no metadata)