zarr-developers / numcodecs

A Python package providing buffer compression and transformation codecs for use in data storage and communication applications.
http://numcodecs.readthedocs.io
MIT License
121 stars 82 forks source link

Adds checksum flag to zstd codec #519

Closed normanrz closed 1 week ago

normanrz commented 2 months ago

This PR adds the checksum flag to the zstd codec. This is necessary to support the proposed zstd codec for Zarr3. We need it for the v3 refactoring of zarr-python.

TODO:

normanrz commented 1 month ago

I would love a review on this so that we can ship this in the upcoming zarr-python 3 release.

d-v-b commented 1 month ago

@mkitti would you mind giving this a look?

martindurant commented 1 month ago

A good time to switch to cramjam? Does this zstd provide anything that that one doesn't?

normanrz commented 1 month ago

A good time to switch to cramjam? Does this zstd provide anything that that one doesn't?

cramjam also does not expose the checksum option.

mkitti commented 1 month ago

I'm looking. We should normalize the implementation here such that negative compression levels are passed on to the C library and that the default compression level is the default compression level of the underlying C library.

mkitti commented 1 month ago

The default CLEVEL here should be changed to 0. That should be passed on the C library directly without further logic.

DEFAULT_CLEVEL = 0

normanrz commented 1 month ago

The default CLEVEL here should be changed to 0. That should be passed on the C library directly without further logic.

DEFAULT_CLEVEL = 0

I made that change. This is a breaking change, though.

mkitti commented 2 weeks ago

I made that change. This is a breaking change, though.

The change is that we used to default to compression level 1 explicitly, overriding the configuration of the underlying Zstandard C library.

Now we pass 0 as the compression level, which uses the default compression level of the C library. The default compression level is usually 3.

For reference see https://facebook.github.io/zstd/zstd_manual.html#Chapter5

Do we provide access to ZSTD_defaultCLevel(void)?

normanrz commented 2 weeks ago

Do we provide access to ZSTD_defaultCLevel(void)?

We do not. For what would that be useful?

normanrz commented 2 weeks ago

Are there any objections for changing the default compression level from 1 to 3 (i.e. default of the underlying zstd c library) @zarr-developers/python-core-devs?

codecov[bot] commented 2 weeks ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 99.91%. Comparing base (bef2e16) to head (696e582).

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #519 +/- ## ======================================= Coverage 99.91% 99.91% ======================================= Files 59 59 Lines 2312 2319 +7 ======================================= + Hits 2310 2317 +7 Misses 2 2 ``` | [Files](https://app.codecov.io/gh/zarr-developers/numcodecs/pull/519?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=zarr-developers) | Coverage Δ | | |---|---|---| | [numcodecs/tests/test\_zstd.py](https://app.codecov.io/gh/zarr-developers/numcodecs/pull/519?src=pr&el=tree&filepath=numcodecs%2Ftests%2Ftest_zstd.py&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=zarr-developers#diff-bnVtY29kZWNzL3Rlc3RzL3Rlc3RfenN0ZC5weQ==) | `100.00% <100.00%> (ø)` | |
mkitti commented 2 weeks ago

We do not. For what would that be useful?

That would be useful to checking the configuration of the Zstandard C Library. In other words, it would indicate what a compression level of 0 actually corresponds to. By default it is 3, but someone could have configured it differently at compile time:

https://github.com/facebook/zstd/blob/17b531501670781f37fc3e5070a29eede09bca3b/lib/zstd.h#L128-L130

normanrz commented 2 weeks ago

But that would mean that people have a modified version of numcodecs with a modified zstd library within blosc. Given that a main purpose of numcodecs is to deliver wheels, that seems pretty unlikely to me.

mkitti commented 1 week ago

Note that conda-forge builds Blosc with an independently configured zstandard library: https://github.com/conda-forge/blosc-feedstock/blob/main/recipe%2Fmeta.yaml#L27

normanrz commented 1 week ago

@mkitti I added the Zstd.{default,min,max}_level properties.

mkitti commented 1 week ago

Currently, we depend on Zstd frame content size being encoded. However, we do know what the uncompressed size of a chunk should be from array information. Perhaps we should not generate an error if the frame content size is not "known" to Zstandard because it is, in fact, known to us.

The frame content size may not be encoded if the encoder is using streaming mode and did not pledge the full size of the content in beginning.

martindurant commented 1 week ago

we do know what the uncompressed size of a chunk should be from array information

Only if this is the only stage in the decompression pipeline. Multiple byte compression/encoding stages are allowed, or indeed, not shape-preserving array operations too.

Side note: I don't believe that zstd, zstandard or cramjam.zstd need the total size in order to decompress, but can use it as an optimization. Also, at least the latter can do decompress_into where the target could be the final array buffer (if it is a contiguous block).

normanrz commented 1 week ago

Currently, we depend on Zstd frame content size being encoded. However, we do know what the uncompressed size of a chunk should be from array information. Perhaps we should not generate an error if the frame content size is not "known" to Zstandard because it is, in fact, known to us.

The frame content size may not be encoded if the encoder is using streaming mode and did not pledge the full size of the content in beginning.

I think that change would be out of scope for this PR. We should continue this discussion in a new issue.