zarr-developers / numcodecs

A Python package providing buffer compression and transformation codecs for use in data storage and communication applications.
http://numcodecs.readthedocs.io
MIT License
125 stars 87 forks source link

Zstd: ZSTD_getDecompressedSize is obsolete and used incorrectly. #499

Open mkitti opened 9 months ago

mkitti commented 9 months ago

numcodecs currently treats a return value of 0 from ZSTD_getDecompressedSize as an input error. A value of zero could mean one of the following.

  1. empty
  2. unknown
  3. error

https://github.com/zarr-developers/numcodecs/blob/366318f3b82403fe56db5ae647f8747e7a4aaf38/numcodecs/zstd.pyx#L151-L153

Rather numcodecs should use ZSTD_getFrameContentSize which the return value can be differentiated.

  1. 0 means empty
  2. 0xffffffffffffffff, ZSTD_CONTENTSIZE_UNKNOWN, means unknown
  3. 0xfffffffffffffffe, ZSTD_CONTENTSIZE_ERROR, means error

See zstd.h or the manual for a reference. https://github.com/facebook/zstd/blob/7cf62bc274105f5332bf2d28c57cb6e5669da4d8/lib/zstd.h#L195-L203 https://facebook.github.io/zstd/zstd_manual.html

This error arose during the implementation of Zstandard in n5-zarr: https://github.com/saalfeldlab/n5-zarr/pull/35

There the compressor was producing blocks which would return ZSTD_CONTENTSIZE_UNKNOWN. ZSTD_getDecompressedSize would return 0 and numcodecs would incorrectly interpret this as an error.

Handling ZSTD_CONTENTSIZE_UNKNOWN may be difficult.

  1. If a dest buffer is provided, then perhaps that should we set as the expected decompressed size and an error should occur if the decompressed size is not that.
  2. If a dest buffer is not provided, we may need to either use a default or use the streaming API to build an growing buffer until all the data is decompressed.
mkitti commented 9 months ago

We might borrow from python-zstandard here: https://github.com/indygreg/python-zstandard/blob/1dc0596ccce5a3e84467183b35aaab96e445a047/c-ext/decompressor.c#L315-L325

d-v-b commented 9 months ago

what if we just added python-zstandard as a dependency

mkitti commented 9 months ago

We probably should. My suggestion is that we break numcodecs up into smaller packages focusing on individual upstream codecs and then combine them via some metapackage.

It would also be good if we had a mechanism to manage multiple implementations of the same codec.

martindurant commented 9 months ago

I have mentioned this before, but cramjam includes many of these standard compressors, and already compiles and builds everything in latest rust packages. Simply depending on them would simplify this repo a lot.

mkitti commented 9 months ago

cramjam does not do blosc though. I suppose we could just depend on https://github.com/Blosc/python-blosc2 .

I think we may need a new repository or Github organization for this.

Could someone sketch out an optional dependency package structure and how we make that work both in pip and conda or prefix?

martindurant commented 9 months ago

cramjam does not do blosc though

Correct, but it does deal with issues like this specific one. And indeed, I'd be happy to delegate blosc to someone else too ( or push on https://github.com/milesgranger/pyrus-cramjam/issues/110 ).

joshmoore commented 9 months ago

This is sounding a bit as if a WG like the zarr-python-refactoring WG might also be appropriate here.