Open jhamman opened 1 month ago
thoughts on defaulting to Zstandard?
Does compressor
mean "codec" in the v3 example? If so, then I think the "default" codec is implicitly the bytes codec since the codec pipeline of v3 requires that exactly one array->bytes
codec must be present in the pipeline.
the v3 example is creating a zarr v2 array, so the "compressor" concept still applies there.
but it's also worth considering what the default codec pipeline should be when creating zarr v3 arrays. If it was easy to specify, then I think defaulting to a sharding codec with an inner chunk size equal to the outer chunk size (i.e., a minimal shard) would actually be a good default, but would have to see how this looks.
I think we should make the defaults exactly the same as they are in v2.
I do not think we should make sharding the default until we have spent more time optimizing and debugging it.
I think we should make the defaults exactly the same as they are in v2.
The default compressor in v2 was Blosc()
with no arguments, if Blosc
is available, falling back to Zlib
; Blosc
itself takes parameters to determine which compression scheme to use, so I think leaving everything default in v2 is a mistake that we don't want to repeat.
I'd be fine with keeping Blosc
as the default in zarr-python v3 but we should explicitly configure it, and if we are doing that it might be worth considering the pros / cons of using something other than the numcodecs default blosc configuration. In numcodecs, the default configuration for Blosc
uses the lz4
compressor, which iirc is rather fast but doesn't compress well. I understand the urge to preserve v2 behavior, but on the other hand if lz4 has significantly worse performance than another blosc compressor (e.g., zstd), then we should consider the value of giving zarr users a good default.
cc @mkitti
Before defaulting to Blosc in Zarr v3, we should really fix the issue in https://github.com/zarr-developers/zarr-python/issues/2171 . That is there should probably be a ArrayBytesBloscCodec that can actually transmit dtype / typesize information correctly to Blosc. Perhaps one could follow the zarr-java implementation and default the typesize to the dtype.
Also, continuing to encode new data with Blosc v1 is unwise given the current stage-of-life of that package. Blosc v1 is in what I will term "community maintenance mode". Unless you are actively thinking about it, I would not assume that anyone is actively maintaining the package.
My recommendations from most favored to least are:
I would be reluctant to pick Blosc, given it isn't very actively maintained and the blosc maintainers would rather folks tranistioned to Blosc2. In this context I think we have a responsibility to move away from providing Blosc (version 1) as the default compressor, and zarr-python v3 seems like a good point to do that.
I'd be 👍 to defaulting to no compression. This would then force users to learn about compression and choose a compressor that works well for them.
I think we should make the defaults exactly the same as they are in v2.
@rabernat can you expand a bit on why you think we should keep the same defaults?
This would then force users to learn about compression and choose a compressor that works well for them.
Instead of learning how zarr works, people might just get frustrated that their data isn't compressed and conclude that the format isn't worth the trouble. I like the idea but I don't think a "pedagogical default" beats a "good default" here. My preference would be that we pick a solid compressor that offers good all-around performance, and I think Zstd (sans blosc) clears that bar.
@rabernat can you expand a bit on why you think we should keep the same defaults?
Just for the sake of maintaining consistency and not having to make a decision. However, if there are serious drawbacks (as it appears there are), I'm fine with zstd.
Do we have a sense of the performance implications of this choice?
Here are blosc's internal benchmarks: https://www.blosc.org/posts/zstd-has-just-landed-in-blosc/
"Not the fastest, but a nicely balanced one" is a good summary. Basically, the default settings balance compression ratio and speed. Lower or negative levels provide more speed. Higher levels provide more compression ratio.
One note here that's relevant for https://github.com/zarr-developers/zarr-python/pull/2036 and https://github.com/pydata/xarray/issues/9515, the default codec can depend on the dtype of the array:
# zarr-python 2.18.3
>>> g = zarr.group(store={})
>>> g.create(name="b", shape=(3,), dtype=str).filters
[VLenUTF8()]
good point @TomAugspurger. In that case we probably should default to a string literal like "auto" for compressor / filters, and then use functions like auto_compressor(dtype) -> Codec
/ auto_filters(dtype) -> list[Codec]
to transform "auto" into a concrete configuration, given the dtype (and whatever else we deem relevant)
Big 👍 to that idea.
We now have automatic detection of the ArrayBytes codec based on dtype:
Next step for this issue is to just add a default BytesBytes compressor.
Based on the discussion today, we landed on:
Bytes
+ Zstd
VLenBytesCodec
Related to this is how to set the default. In 2.x, we told folks to set zarr.storage.default_compressor
. Going forward, I'd like us to use zarr.config
for this.
+1 for using Zstd or Zlib potentially as default compressors. It would be ideal to use a format that is standardized, has a strongly specified stream definition and is widely used and supported.
Based on the discussion today, we landed on:
* for most dtypes, use `Bytes` + `Zstd` * for string dtypes, `VLenBytesCodec`
Related to this is how to set the default. In 2.x, we told folks to set
zarr.storage.default_compressor
. Going forward, I'd like us to usezarr.config
for this.
Based on this, I started working on #2470. Feel free to leave a comment.
Zarr version
3.0.0.alpha6
Numcodecs version
N/A
Python Version
N/A
Operating System
N/A
Installation
N/A
Description
In Zarr-Python 2.x, Zarr provided default compressors for most (all?) datatypes. As of now, in 3.0, we don't provide any defaults.
Steps to reproduce
In 2.18:
In 3.0.0.alpha6
Additional output
No response