zarr-developers / numcodecs

A Python package providing buffer compression and transformation codecs for use in data storage and communication applications.
http://numcodecs.readthedocs.io
MIT License
128 stars 88 forks source link

Enriching warning and handling when compression and decompression is done in the different `numcodecs` version #654

Open ehgus opened 5 days ago

ehgus commented 5 days ago

I noticed that data compressed with Zstd with numcodecs=0.14.0 cannot decompressed with numcodecs=0.12.1 even though they look compatible except for the checksum. I successfully decompress the data by manually erasing the checksum argument.

The current error message does not tell that the error is due to the use of different version of numcodecs:

  File "numcodecs\\zstd.pyx", line 210, in numcodecs.zstd.Zstd.__init__
TypeError: __init__() got an unexpected keyword argument 'checksum'

I want them to guarantee compatibility for using lower versions or show a more informative error message. My idea is to ignore such additional arguments with a warning and give a clue to the error message when decompression fails. Thank you in advance for your feedback.

jakirkham commented 4 days ago

Looks like this was introduced in PR: https://github.com/zarr-developers/numcodecs/pull/519

@normanrz do you have thoughts on this issue?

normanrz commented 4 days ago

If I understand correctly, 0.14 adds the checksum metadata which means that 0.12 cannot read it anymore.

I think it could be acceptable that data compressed with a newer version would not be compatible with an older version. Since, we are still in 0.x releases, minor versions can be breaking.

However, we could also write out the checksum metadata only if it deviates from the previous default. Are there other cases like this in numcodecs?

dstansby commented 4 days ago

So I think there's a couple of issues to disentangle here:

  1. API backwards compatibility in numcodecs

It looks like your error message is about trying to using the numcodecs 0.12 API the same way as the numcodecs 0.14 API. Since we're < version 1.0, we still allow ourselves to make breaking changes in this API every minor version. I think this is what your orignal issue is about?

  1. Decompression backwards compatibility Again, since we're < version 1.0, we don't guarentee that data compressed with a new version of numcodecs will be able to be decompressed with an older version.

  2. Decompression fowards compatibility Again (again 😄), since we're < version 1.0, we don't have to guarentee that data compressed with an old version of numcodecs will be able to be decompressed with a newer version. In practice I think we really should and have to be forwards compatible, because so much data in the wild depends on numcodecs.

So I think that unfortunately your error is not something we can fix or improve (because we cannot go back to modify older releases), and unfortunately a consequence of numcodecs not having a stable release yet.

Note that the points above are my interpretation of our compatibility guarentees, so might not be what others agree on. We should agree and write these down in the docs. And numcodecs really needs a stable release, after which we're careful about breaking changes, since it's so widely used now.

jakirkham commented 4 days ago

Thanks Norman and David for your feedback! 🙏

Am guessing the focus of the question is on reading data that was previously written consistently.

To that end, think Norman's question above is the one we should focus on here

However, we could also write out the checksum metadata only if it deviates from the previous default. Are there other cases like this in numcodecs?

It wouldn't surprise me if this issue had happened before. Though am not remembering one like this one atm

There have been issues between different libraries/languages before (like zlib/gzip compression: https://github.com/zarr-developers/numcodecs/pull/87 & https://github.com/zarr-developers/numcodecs/issues/169 ). This in turn motivated the following body of work to test reading/writing data between different implementations ( https://github.com/zarr-developers/zarr_implementations ), which in turn led to spec work

That said, think Norman is asking the right question. If we simply leave out the config flag when it is unneeded, it should remain readable with old versions. What are the tradeoffs of doing or not doing this?