[v3] default compressor / codec pipeline

jhamman commented 1 month ago

Zarr version

3.0.0.alpha6

Numcodecs version

N/A

Python Version

N/A

Operating System

N/A

Installation

N/A

Description

In Zarr-Python 2.x, Zarr provided default compressors for most (all?) datatypes. As of now, in 3.0, we don't provide any defaults.

Steps to reproduce

In 2.18:

In [1]: import zarr

In [2]: arr = zarr.create(shape=(10, ), path='foo', store={})

In [3]: arr.compressor
Out[3]: Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)

In 3.0.0.alpha6

In [4]: arr = zarr.create(shape=(10, ), path='foo', store={}, zarr_format=2)

In [5]: arr.metadata
Out[5]: ArrayV2Metadata(shape=(10,), fill_value=0.0, chunk_grid=RegularChunkGrid(chunk_shape=(10,)), attributes={}, zarr_format=2, data_type=dtype('float64'), order='C', filters=None, dimension_separator='.', compressor=None)

Additional output

No response

d-v-b commented 1 month ago

thoughts on defaulting to Zstandard?

zoj613 commented 1 month ago

Does compressor mean "codec" in the v3 example? If so, then I think the "default" codec is implicitly the bytes codec since the codec pipeline of v3 requires that exactly one array->bytes codec must be present in the pipeline.

d-v-b commented 1 month ago

the v3 example is creating a zarr v2 array, so the "compressor" concept still applies there.

but it's also worth considering what the default codec pipeline should be when creating zarr v3 arrays. If it was easy to specify, then I think defaulting to a sharding codec with an inner chunk size equal to the outer chunk size (i.e., a minimal shard) would actually be a good default, but would have to see how this looks.

rabernat commented 1 month ago

I think we should make the defaults exactly the same as they are in v2.

I do not think we should make sharding the default until we have spent more time optimizing and debugging it.

d-v-b commented 1 month ago

I think we should make the defaults exactly the same as they are in v2.

The default compressor in v2 was Blosc() with no arguments, if Blosc is available, falling back to Zlib; Blosc itself takes parameters to determine which compression scheme to use, so I think leaving everything default in v2 is a mistake that we don't want to repeat.

I'd be fine with keeping Blosc as the default in zarr-python v3 but we should explicitly configure it, and if we are doing that it might be worth considering the pros / cons of using something other than the numcodecs default blosc configuration. In numcodecs, the default configuration for Blosc uses the lz4 compressor, which iirc is rather fast but doesn't compress well. I understand the urge to preserve v2 behavior, but on the other hand if lz4 has significantly worse performance than another blosc compressor (e.g., zstd), then we should consider the value of giving zarr users a good default.

cc @mkitti

mkitti commented 1 month ago

Before defaulting to Blosc in Zarr v3, we should really fix the issue in https://github.com/zarr-developers/zarr-python/issues/2171 . That is there should probably be a ArrayBytesBloscCodec that can actually transmit dtype / typesize information correctly to Blosc. Perhaps one could follow the zarr-java implementation and default the typesize to the dtype.

Also, continuing to encode new data with Blosc v1 is unwise given the current stage-of-life of that package. Blosc v1 is in what I will term "community maintenance mode". Unless you are actively thinking about it, I would not assume that anyone is actively maintaining the package.

My recommendations from most favored to least are:

No default compression. Picking a particular compressor now is probably not going to age well. Users should make explicit choices about compression given what they know about their data.
Zstandard. Zstandard is designed to be a flexible compressor that can handle a wide variety of data. The compressor is widely available.
Blosc v2. Blosc v2 has a superior msgpack based contiguous frame format vs Blosc v1. It is also seeing active development including the delta and bytedelta filters as well as improved SIMD support.

dstansby commented 1 month ago

I would be reluctant to pick Blosc, given it isn't very actively maintained and the blosc maintainers would rather folks tranistioned to Blosc2. In this context I think we have a responsibility to move away from providing Blosc (version 1) as the default compressor, and zarr-python v3 seems like a good point to do that.

I'd be 👍 to defaulting to no compression. This would then force users to learn about compression and choose a compressor that works well for them.

dstansby commented 1 month ago

I think we should make the defaults exactly the same as they are in v2.

@rabernat can you expand a bit on why you think we should keep the same defaults?

d-v-b commented 1 month ago

This would then force users to learn about compression and choose a compressor that works well for them.

Instead of learning how zarr works, people might just get frustrated that their data isn't compressed and conclude that the format isn't worth the trouble. I like the idea but I don't think a "pedagogical default" beats a "good default" here. My preference would be that we pick a solid compressor that offers good all-around performance, and I think Zstd (sans blosc) clears that bar.

rabernat commented 1 month ago

@rabernat can you expand a bit on why you think we should keep the same defaults?

Just for the sake of maintaining consistency and not having to make a decision. However, if there are serious drawbacks (as it appears there are), I'm fine with zstd.

Do we have a sense of the performance implications of this choice?

mkitti commented 1 month ago

Here are blosc's internal benchmarks: https://www.blosc.org/posts/zstd-has-just-landed-in-blosc/

"Not the fastest, but a nicely balanced one" is a good summary. Basically, the default settings balance compression ratio and speed. Lower or negative levels provide more speed. Higher levels provide more compression ratio.

TomAugspurger commented 1 month ago

One note here that's relevant for https://github.com/zarr-developers/zarr-python/pull/2036 and https://github.com/pydata/xarray/issues/9515, the default codec can depend on the dtype of the array:

# zarr-python 2.18.3
>>> g = zarr.group(store={})
>>> g.create(name="b", shape=(3,), dtype=str).filters
[VLenUTF8()]

d-v-b commented 1 month ago

good point @TomAugspurger. In that case we probably should default to a string literal like "auto" for compressor / filters, and then use functions like auto_compressor(dtype) -> Codec / auto_filters(dtype) -> list[Codec] to transform "auto" into a concrete configuration, given the dtype (and whatever else we deem relevant)

rabernat commented 1 month ago

Big 👍 to that idea.

rabernat commented 1 month ago

We now have automatic detection of the ArrayBytes codec based on dtype:

https://github.com/zarr-developers/zarr-python/blob/395604dcc86b9154ea673d15d9c9bb9a74db1538/src/zarr/codecs/__init__.py#L37-L46

Next step for this issue is to just add a default BytesBytes compressor.

jhamman commented 3 weeks ago

Based on the discussion today, we landed on:

for most dtypes, use Bytes + Zstd
for string dtypes, VLenBytesCodec

Related to this is how to set the default. In 2.x, we told folks to set zarr.storage.default_compressor. Going forward, I'd like us to use zarr.config for this.

akshaysubr commented 3 weeks ago

+1 for using Zstd or Zlib potentially as default compressors. It would be ideal to use a format that is standardized, has a strongly specified stream definition and is widely used and supported.

brokkoli71 commented 6 days ago

Based on the discussion today, we landed on:
* for most dtypes, use `Bytes` + `Zstd`

* for string dtypes, `VLenBytesCodec`
Related to this is how to set the default. In 2.x, we told folks to set zarr.storage.default_compressor. Going forward, I'd like us to use zarr.config for this.

Based on this, I started working on #2470. Feel free to leave a comment.

zarr-developers / zarr-python