zarr-developers / zarr-python

An implementation of chunked, compressed, N-dimensional arrays for Python.
https://zarr.readthedocs.io
MIT License
1.52k stars 282 forks source link

Poor blosc compression ratios compared to v2 #2171

Open agoodm opened 1 month ago

agoodm commented 1 month ago

Zarr version

v3

Numcodecs version

n/a

Python Version

n/a

Operating System

n/a

Installation

n/a

Description

While playing around with v3 a bit, I noticed that the blosc codec wasn't compressing my sample data as well as I expected to, and I have confirmed after multiple comparisons that I am getting different results between v3 and v2, in some cases v3 blosc compressed chunks end up being 10-20x larger than in v2 (see below).

Steps to reproduce

import numpy as np
import zarr
import zarr.v2 as zarr_v2
import numpy as np
from numcodecs import Blosc

a = np.random.rand(1000000)

# v3
codecs = [zarr.codecs.BytesCodec(), zarr.codecs.BloscCodec()]
z3 = zarr.array(a, chunks=(10000), codecs=codecs)
v3_size = len(z3.store_path.store._store_dict["c/0"]) # 75136

# v2
compressor = Blosc("zstd") # Should be equivalent to default settings for v3 Blosc codec
z2 = zarr_v2.array(a, chunks=z3.chunks, compressor=compressor)
v2_size = len(z2.store["0"]) 

print(f"v2 compressed chunk size: {v2_size}") 
print(f"v3 compressed chunk size: {v3_size}") 
# v2 compressed chunk size: 70113
# v3 compressed chunk size: 75136

The difference isn't huge in this case, but it can be much more noticeable in others. For example:

b = np.arange(1000000)
z3[:] = b
v3_size = len(z3.store_path.store._store_dict["c/0"]) 

z2[:] = b
v2_size = len(z2.store["0"]) 

print(f"v2 compressed chunk size: {v2_size}") 
print(f"v3 compressed chunk size: {v3_size}") 

# v2 compressed chunk size: 1383
# v3 compressed chunk size: 11348

Cause and Possible Solution

In numcodecs, the blosc compressor is able to improve compression ratios by inferring the item size through the input buffer's numpy array dtype. But in v3, the blosc codec is implemented as a BytesBytesCodec and requires each chunk to be fed as bytes on encode (hence, BytesCodec() is required in the list of codecs in my example) and thus numcodecs infers an item size of 1.

A simple fix for this is to make the following change in blosc.py:

    async def _encode_single(
        self,
        chunk_bytes: Buffer,
        chunk_spec: ArraySpec,
    ) -> Buffer | None:
        # Since blosc only support host memory, we convert the input and output of the encoding
        # between numpy array and buffer
        return await to_thread(
            lambda chunk: chunk_spec.prototype.buffer.from_bytes(
                self._blosc_codec.encode(chunk.as_numpy_array().view(chunk_spec.dtype))
            ),
            chunk_bytes,

Thoughts

I am still just getting started with v3, but this has made me curious about one thing. Why is the blosc codec implemented as a BytesBytesCodec rather than as an ArrayBytesCodec, considering that it can accept (and is optimized for) numpy array input? Although the above solution does work, because I need to include the BytesCodec first when specifying my codecs in v3, it essentially first encodes each chunk into bytes, then decodes it back into an array in its original dtype, making the bytes codec effectively a pointless noop in this case.

TomAugspurger commented 1 month ago

Nice find.

Why is the blosc codec implemented as a BytesBytesCodec rather than as an ArrayBytesCodec

I'm also spinning up on codecs, but I wonder if it's related to a pipeline only being able to have one ArrayBytes codec, but many BytesBytes codecs? By writing it as a BytesBytesCodec it composes a bit better with other codecs.

That said... trying to view the bytes as chunk_spec.dtype will I think only work if the BloscCodec is the first BytesBytes codec (which is another way of saying that it could / should be an ArrayBytes codec). If you have [BytesCodec(), GzipCodec(), BloscCodec()] for whatever reason, the bytes after gzipping wouldn't be viewable as dtype.

rabernat commented 1 month ago

If blosc is aware of the dtype, I think it makes sense to redefine it as an ArrayBytesCodec.

d-v-b commented 1 month ago

Is it a solution if we define BloscArrayBytesCodec and BloscBytesBytesCodec as two separate codecs that both use blosc. The former takes an array, the latter takes bytes. Extremely inelegant, but maybe the best we can do without changing the v3 spec.

d-v-b commented 1 month ago

the basic problem is that arrays are a particular arrangement of bytes. so drawing a hard line between arrays and bytes will inevitably lead to situations like this, because they are not two separate categories.

rabernat commented 1 month ago

Extremely inelegant, but maybe the best we can do without changing the v3 spec.

I mean, if a consequence of strictly following the V3 spec is that users experience a 20x degradation in compression rates, then it seems like we should change the spec, no?

drawing a hard line between arrays and bytes will inevitably lead to situations like this, because they are not two separate categories.

All array buffers can be interpreted as bytes. The reverse is not true. To me, an ArrayBytesCodec gets additional metadata about how to interpret the raw bytes as an array. That sounds like exactly what we want here.

d-v-b commented 1 month ago

I mean, if a consequence of strictly following the V3 spec is that users experience a 20x degradation in compression rates, then it seems like we should change the spec, no?

Believe me, I'm not opposed to changing the v3 spec :) But if we reclassify blosc as an ArrayBytesCodec, then it must occur at the array -> bytes boundary in the list of codecs, which means blosc can no longer be used as a BytesBytes codec.

Maybe this is OK if the performance is always terrible in this situation, but I thought blosc was designed to work on arbitrary binary data, including arrays. As I understand it, typesize only has an effect if shuffling is used, but shuffling is itself a free parameter. This means blosc is sometimes a BytesBytesCodec, and other times an ArrayBytesCodec. Hence my point that the array / bytes distinction is blurry. Would love to hear from a blosc expert here.

TomAugspurger commented 1 month ago

I think having both an ArrayBytesBloscCodec and BytesBytesBloscCodec is fine.

The usability issues can be mitigated a bit by having good defaults. In Zarr v2 I typically didn't dig into the encoding / compressor stuff. If we can match that behavior by setting the default codec pipeline to include an BloscArrayBytesCodec first I think most users will be fine, and advanced users can dig into the thorny details if they want.

d-v-b commented 1 month ago

If we make a version of blosc that fits in the ArrayBytes category, then there are few considerations to be mindful of:

LDeakin commented 1 month ago

What about just automatically setting the typesize of the blosc codec if it is unspecified and identified as following the bytes codec? This is what the spec already suggests:

Zarr implementations MAY allow users to leave this unspecified and have the implementation choose a value automatically based on the array data type and previous codecs in the chain

I don't think an array->bytes blosc codec is a great idea, because it would then need to take on the role of the bytes codec for fixed-size data types (and need an endian parameter as @d-v-b noted). This would worsen with variable-sized data types in the future. Too much responsibility on one codec.

If so, do all non-sharding ArrayBytesCodec instances take an endian parameter?

Potential array->bytes codecs like zfp and pcodec do not need an endian parameter.

agoodm commented 1 month ago

Happy to see this garnered a lot of interesting discussion! Here are my own thoughts on some points raised here:

I think having both an ArrayBytesBloscCodec and BytesBytesBloscCodec is fine.

The usability issues can be mitigated a bit by having good defaults. In Zarr v2 I typically didn't dig into the encoding / compressor stuff. If we can match that behavior by setting the default codec pipeline to include an BloscArrayBytesCodec first I think most users will be fine, and advanced users can dig into the thorny details if they want.

Even if we switch to making Blosc the default codec in v3 (as it is in v2), there are still many times I find myself needing to specify codecs manually when using Blosc since it's a meta-compressor with many different options. So in the situation where I do need to use a configuration of blosc that's different from the default, having two flavors will add some confusion from the end-user perspective. And for the maintainers of this library, that means you now have the weight of two separate implementations which fundamentally do the same thing. As far as the actual blosc compressor is concerned, the only real difference between the two is that one infers the typesize and the other doesn't, so it seems kind of overkill.

Should the BytesBytesBloscCodec allow shuffling? If so, then it's de facto an ArrayBytesCodec. if not, then we are not really using blosc effectively.

I don't see why not. Array inputs are ideal for shuffling since the typesize gets inferred, but with the bit shuffle filter you can still see marginal improvements in the compression ratio even for bytes input.

What about just automatically setting the typesize of the blosc codec if it is unspecified and identified as following the bytes codec? This is what the spec already suggests:

Right now from what I can see setting the typesize in the BloscCodec constructor typesize does nothing as far as encode/decode is concerned but on inspection it already automatically infers the correct value from the input dtype. The issue is the numcodecs wrapper to blosc only infers the typesize correctly when the input is an array, otherwise it assumes a typesize of 1. So perhaps a good first solution would be to modify the numcodecs blosc wrapper to override the value of typesize, as this is what the the python-blosc library already does.

I don't think an array->bytes blosc codec is a great idea, because it would then need to take on the role of the bytes codec for fixed-size data types (and need an endian parameter as @d-v-b noted). This would worsen with https://github.com/zarr-developers/zeps/pull/47#issuecomment-1710505141 in the future. Too much responsibility on one codec.

My personal feeling is that keeping blosc as a BytesBytesCodec still seems like a bit of an interface regression to me. As a user of v2, I never had to think really hard when using the blosc codec since it seamlessly worked whether the input was an array or raw bytes. One of my remaining pet peeves I would still have as a result would be the need to always manually specify a BytesCodec at the start of a codec pipeline, which is kind of cumbersome. And on top of this, internally you are still doing a conversion from array to bytes and adding that extra bit of unnecessary processing overhead, however slight.

mkitti commented 1 month ago

The main reason Blosc needs to know the dtype is for byte or bitshuffling.

Under Zarr-python as used for Zarr v3 as above, if shuffle is not set and the dtype is a byte, then BloscCodec will default to bitshuffling 8-bits. For large dtypes, byte shuffling is used. https://github.com/zarr-developers/zarr-python/blob/726fdfbf569c144310893440a40ee8ee05e6524e/src/zarr/codecs/blosc.py#L142-L146

Meanwhile, numcodecs will default to byte shuffle (just called shuffle) for Zarr v2: https://github.com/zarr-developers/numcodecs/blob/main/numcodecs/blosc.pyx#L548

mkitti commented 1 month ago

The 11348 compressed length correspond to compressing with either

A compressed length of 1383 corresponds to compressing with

In [1]: import blosc, numpy as np

In [2]: c = np.arange(10000, dtype="float64")

In [3]: len(blosc.compress(c, cname="zstd", typesize=1, clevel=5, shuffle=blosc.NOSHUFFLE))
Out[3]: 11348

In [4]: len(blosc.compress(c, cname="zstd", typesize=1, clevel=5, shuffle=blosc.SHUFFLE))
Out[4]: 11348

In [5]: len(blosc.compress(c, cname="zstd", typesize=1, clevel=5, shuffle=blosc.BITSHUFFLE))
Out[5]: 979

In [6]: len(blosc.compress(c, cname="zstd", typesize=8, clevel=5, shuffle=blosc.NOSHUFFLE))
Out[6]: 11348

In [7]: len(blosc.compress(c, cname="zstd", typesize=8, clevel=5, shuffle=blosc.SHUFFLE))
Out[7]: 1383

In [8]: len(blosc.compress(c, cname="zstd", typesize=8, clevel=5, shuffle=blosc.BITSHUFFLE))
Out[8]: 391
d-v-b commented 1 month ago

Does the typesize parameter have to correspond to the size of the elements of an array? For example, when compressing arrays with variable length types (like strings) with blosc, it might make sense to pick a typesize, even though the size of the array dtype is undefined. To phrase it differently, requiring that the blosc typesize be set to the "size" of a dtype will fail for variable length types.

mkitti commented 1 month ago

Does the typesize parameter have to correspond to the size of the elements of an array? For example, when compressing arrays with variable length types (like strings) with blosc, it might make sense to pick a typesize, even though the size of the array dtype is undefined.

Technically, typesize could be anything. It does not have to correspond to the size of the elements of the array. I'm unclear what might happen if the typesize does not evenly divide the number of of bytes in the array though. What typesize does here is parameterize the shuffle.

Using a pre-filter such as shuffle only makes sense if you are providing information that the compessor does not already have.

To phrase it differently, requiring that the blosc typesize be set to the "size" of a dtype will fail for variable length types.

I have no idea why you would try to shuffle variable length types.

d-v-b commented 1 month ago

I have no idea why you would try to shuffle variable length types.

What if the variable length type was something like list[uint32]? I.e., a variable-length collection of some numeric type.

mkitti commented 1 month ago

Are you expecting some correlation to still occur every four bytes in that case? Are the values in the list related somehow?

d-v-b commented 1 month ago

Are you expecting some correlation to still occur every four bytes in that case? Are the values in the list related somehow?

Yes that's what I'm thinking. Suppose you have a data type that's a variable-length collection of points in space, and the points tend to have similar values. Would shuffling be useful here? Honest question, since I don't know much about it.

mkitti commented 1 month ago

Consider a simple run length encoding (RLE) compression scheme where I first provide the length of a run and then the content of the run.

For example, [0x1 0x1 0x1 0x1 0x2 0x2 0x2] would be encoded as [0x4 0x1 0x3 0x2] since there is a run of 4 0x1s followed by 3 0x2s. Terrific, I have encoded 7 bytes into 4 bytes.

Now imagine if had the byte sequence [0x1 0x0 0x0 0x0 0x2 0x0 0x0 0x0]. The encoding would be [0x1 0x1 0x3 0x0 0x1 0x2 0x03 0x00]. Now I have encoded 8 bytes into ... 8 bytes: no compression.

Now let's say I shuffle the bytes as follows. [0x1 0x2 0x0 0x0 0x0 0x0 0x0 0x0 0x0]. These are the same bytes, but shuffled a different order. I encode them as [0x1 0x1 0x1 0x2 0x6 0x0]. Now I have some compression again. In this case, the variable byte occurred every 4 bytes since these were 32-bit numbers, and the numbers were of similar magnitude. We took advantage that the higher order bytes were not changing.

If the bytes were [0x1 0x1 0x1 0x1 0x2 0x2 0x2 0x2], I could encode this as [0x4 0x1 0x4 0x1]. If I shuffled the bytes to be [0x1 0x2 0x1 0x2 0x1 0x2 0x1 0x2], then RLE is a disaster. There would be 16 encoded bytes, twice the number we started with. Shuffling the bytes made the compression worse!

All compressors use RLE at some stage, so shuffling can be really useful if you expect runs of multibyte values that are related. By putting the rarely changing higher order bytes together, you can increase runs for encoding. Lz4 is just a fancy RLE encoder so shuffling really helps if the numbers are of similar magnitude. Zstd also has an entropy coding stage, so shuffling does not have as much.

Thus you probably want to shuffle if you know your number are correlated somehow, as in an image. If your numbers are random, shuffling might not help. If your numbers are fluctuating across magnitudes then, it is quite possible that shuffling could hurt.

d-v-b commented 1 month ago

thanks for that explanation. my takeaway then remains the same: the typesize attribute of the blosc codec is not perfectly equivalent to the dtype of an N-dimensional array, and so forcing blosc to be an array -> bytes codec would limit the utility of blosc.

mkitti commented 1 month ago

There is utility in array to bytes codec in that 1) in many scenarios typesize = sizeof(dtype) is a useful default 2) this would match the Zarr v2 behavior

There is also utility in a bytes to bytes codec where typesize is a free parameter.

All of this is content dependent. To set defaults here assumes the content follows certain patterns, which it may not. I would discourage making assumptions about other people's content.

normanrz commented 1 day ago

What about just automatically setting the typesize of the blosc codec if it is unspecified and identified as following the bytes codec? This is what the spec already suggests:

This is actually implemented already: https://github.com/zarr-developers/zarr-python/blob/680142f6f370d862f305a9b37f3c0ce2ce802e76/src/zarr/codecs/blosc.py#L141-L142

Meanwhile, numcodecs will default to byte shuffle (just called shuffle) for Zarr v2: https://github.com/zarr-developers/numcodecs/blob/main/numcodecs/blosc.pyx#L548

Should we just default to shuffle to match the behavior in v2?