Closed normanrz closed 1 week ago
I would love a review on this so that we can ship this in the upcoming zarr-python 3 release.
@mkitti would you mind giving this a look?
A good time to switch to cramjam? Does this zstd provide anything that that one doesn't?
A good time to switch to cramjam? Does this zstd provide anything that that one doesn't?
cramjam
also does not expose the checksum
option.
I'm looking. We should normalize the implementation here such that negative compression levels are passed on to the C library and that the default compression level is the default compression level of the underlying C library.
The default CLEVEL here should be changed to 0
. That should be passed on the C library directly without further logic.
DEFAULT_CLEVEL = 0
The default CLEVEL here should be changed to
0
. That should be passed on the C library directly without further logic.
DEFAULT_CLEVEL = 0
I made that change. This is a breaking change, though.
I made that change. This is a breaking change, though.
The change is that we used to default to compression level 1 explicitly, overriding the configuration of the underlying Zstandard C library.
Now we pass 0
as the compression level, which uses the default compression level of the C library. The default compression level is usually 3
.
For reference see https://facebook.github.io/zstd/zstd_manual.html#Chapter5
Do we provide access to ZSTD_defaultCLevel(void)
?
Do we provide access to
ZSTD_defaultCLevel(void)
?
We do not. For what would that be useful?
Are there any objections for changing the default compression level from 1 to 3 (i.e. default of the underlying zstd c library) @zarr-developers/python-core-devs?
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 99.91%. Comparing base (
bef2e16
) to head (696e582
).
We do not. For what would that be useful?
That would be useful to checking the configuration of the Zstandard C Library. In other words, it would indicate what a compression level of 0
actually corresponds to. By default it is 3
, but someone could have configured it differently at compile time:
https://github.com/facebook/zstd/blob/17b531501670781f37fc3e5070a29eede09bca3b/lib/zstd.h#L128-L130
But that would mean that people have a modified version of numcodecs with a modified zstd library within blosc. Given that a main purpose of numcodecs is to deliver wheels, that seems pretty unlikely to me.
Note that conda-forge builds Blosc with an independently configured zstandard library: https://github.com/conda-forge/blosc-feedstock/blob/main/recipe%2Fmeta.yaml#L27
@mkitti I added the Zstd.{default,min,max}_level
properties.
Currently, we depend on Zstd frame content size being encoded. However, we do know what the uncompressed size of a chunk should be from array information. Perhaps we should not generate an error if the frame content size is not "known" to Zstandard because it is, in fact, known to us.
The frame content size may not be encoded if the encoder is using streaming mode and did not pledge the full size of the content in beginning.
we do know what the uncompressed size of a chunk should be from array information
Only if this is the only stage in the decompression pipeline. Multiple byte compression/encoding stages are allowed, or indeed, not shape-preserving array operations too.
Side note: I don't believe that zstd, zstandard or cramjam.zstd need the total size in order to decompress, but can use it as an optimization. Also, at least the latter can do decompress_into where the target could be the final array buffer (if it is a contiguous block).
Currently, we depend on Zstd frame content size being encoded. However, we do know what the uncompressed size of a chunk should be from array information. Perhaps we should not generate an error if the frame content size is not "known" to Zstandard because it is, in fact, known to us.
The frame content size may not be encoded if the encoder is using streaming mode and did not pledge the full size of the content in beginning.
I think that change would be out of scope for this PR. We should continue this discussion in a new issue.
This PR adds the
checksum
flag to thezstd
codec. This is necessary to support the proposedzstd
codec for Zarr3. We need it for the v3 refactoring of zarr-python.TODO: