Open agoodm opened 1 month ago
Nice find.
Why is the blosc codec implemented as a BytesBytesCodec rather than as an ArrayBytesCodec
I'm also spinning up on codecs, but I wonder if it's related to a pipeline only being able to have one ArrayBytes codec, but many BytesBytes codecs? By writing it as a BytesBytesCodec
it composes a bit better with other codecs.
That said... trying to view the bytes as chunk_spec.dtype
will I think only work if the BloscCodec
is the first BytesBytes codec (which is another way of saying that it could / should be an ArrayBytes codec). If you have [BytesCodec(), GzipCodec(), BloscCodec()]
for whatever reason, the bytes after gzipping wouldn't be viewable as dtype
.
If blosc is aware of the dtype, I think it makes sense to redefine it as an ArrayBytesCodec.
Is it a solution if we define BloscArrayBytesCodec
and BloscBytesBytesCodec
as two separate codecs that both use blosc. The former takes an array, the latter takes bytes. Extremely inelegant, but maybe the best we can do without changing the v3 spec.
the basic problem is that arrays are a particular arrangement of bytes. so drawing a hard line between arrays and bytes will inevitably lead to situations like this, because they are not two separate categories.
Extremely inelegant, but maybe the best we can do without changing the v3 spec.
I mean, if a consequence of strictly following the V3 spec is that users experience a 20x degradation in compression rates, then it seems like we should change the spec, no?
drawing a hard line between arrays and bytes will inevitably lead to situations like this, because they are not two separate categories.
All array buffers can be interpreted as bytes. The reverse is not true. To me, an ArrayBytesCodec gets additional metadata about how to interpret the raw bytes as an array. That sounds like exactly what we want here.
I mean, if a consequence of strictly following the V3 spec is that users experience a 20x degradation in compression rates, then it seems like we should change the spec, no?
Believe me, I'm not opposed to changing the v3 spec :) But if we reclassify blosc as an ArrayBytesCodec
, then it must occur at the array -> bytes boundary in the list of codecs, which means blosc can no longer be used as a BytesBytes
codec.
Maybe this is OK if the performance is always terrible in this situation, but I thought blosc was designed to work on arbitrary binary data, including arrays. As I understand it, typesize
only has an effect if shuffling is used, but shuffling is itself a free parameter. This means blosc is sometimes a BytesBytesCodec
, and other times an ArrayBytesCodec
. Hence my point that the array / bytes distinction is blurry. Would love to hear from a blosc expert here.
I think having both an ArrayBytesBloscCodec
and BytesBytesBloscCodec
is fine.
The usability issues can be mitigated a bit by having good defaults. In Zarr v2 I typically didn't dig into the encoding / compressor stuff. If we can match that behavior by setting the default codec pipeline to include an BloscArrayBytesCodec
first I think most users will be fine, and advanced users can dig into the thorny details if they want.
If we make a version of blosc that fits in the ArrayBytes
category, then there are few considerations to be mindful of:
ArrayBytesBlosc
would probably have to have shuffling enabled unconditionally. BytesBytesBloscCodec
allow shuffling? If so, then it's de facto an ArrayBytesCodec
. if not, then we are not really using blosc effectively.BytesCodec
(the only non-sharding ArrayBytesCodec
) takes a required endian
parameter, which specifies the endianness of the output bytes for multi-byte dtypes. Would we need to add this parameter to the ArrayBytesBlosc
codec? If so, do all non-sharding ArrayBytesCodec
instances take an endian
parameter?What about just automatically setting the typesize
of the blosc codec if it is unspecified and identified as following the bytes
codec? This is what the spec already suggests:
Zarr implementations MAY allow users to leave this unspecified and have the implementation choose a value automatically based on the array data type and previous codecs in the chain
I don't think an array->bytes
blosc
codec is a great idea, because it would then need to take on the role of the bytes
codec for fixed-size data types (and need an endian
parameter as @d-v-b noted). This would worsen with variable-sized data types in the future. Too much responsibility on one codec.
If so, do all non-sharding ArrayBytesCodec instances take an endian parameter?
Potential array->bytes
codecs like zfp
and pcodec
do not need an endian
parameter.
Happy to see this garnered a lot of interesting discussion! Here are my own thoughts on some points raised here:
I think having both an ArrayBytesBloscCodec and BytesBytesBloscCodec is fine.
The usability issues can be mitigated a bit by having good defaults. In Zarr v2 I typically didn't dig into the encoding / compressor stuff. If we can match that behavior by setting the default codec pipeline to include an BloscArrayBytesCodec first I think most users will be fine, and advanced users can dig into the thorny details if they want.
Even if we switch to making Blosc the default codec in v3 (as it is in v2), there are still many times I find myself needing to specify codecs manually when using Blosc since it's a meta-compressor with many different options. So in the situation where I do need to use a configuration of blosc that's different from the default, having two flavors will add some confusion from the end-user perspective. And for the maintainers of this library, that means you now have the weight of two separate implementations which fundamentally do the same thing. As far as the actual blosc compressor is concerned, the only real difference between the two is that one infers the typesize and the other doesn't, so it seems kind of overkill.
Should the BytesBytesBloscCodec allow shuffling? If so, then it's de facto an ArrayBytesCodec. if not, then we are not really using blosc effectively.
I don't see why not. Array inputs are ideal for shuffling since the typesize gets inferred, but with the bit shuffle filter you can still see marginal improvements in the compression ratio even for bytes input.
What about just automatically setting the typesize of the blosc codec if it is unspecified and identified as following the bytes codec? This is what the spec already suggests:
Right now from what I can see setting the typesize in the BloscCodec
constructor typesize does nothing as far as encode/decode is concerned but on inspection it already automatically infers the correct value from the input dtype. The issue is the numcodecs wrapper to blosc only infers the typesize correctly when the input is an array, otherwise it assumes a typesize of 1. So perhaps a good first solution would be to modify the numcodecs blosc wrapper to override the value of typesize, as this is what the the python-blosc library already does.
I don't think an array->bytes blosc codec is a great idea, because it would then need to take on the role of the bytes codec for fixed-size data types (and need an endian parameter as @d-v-b noted). This would worsen with https://github.com/zarr-developers/zeps/pull/47#issuecomment-1710505141 in the future. Too much responsibility on one codec.
My personal feeling is that keeping blosc as a BytesBytesCodec
still seems like a bit of an interface regression to me. As a user of v2, I never had to think really hard when using the blosc codec since it seamlessly worked whether the input was an array or raw bytes. One of my remaining pet peeves I would still have as a result would be the need to always manually specify a BytesCodec
at the start of a codec pipeline, which is kind of cumbersome. And on top of this, internally you are still doing a conversion from array to bytes and adding that extra bit of unnecessary processing overhead, however slight.
The main reason Blosc needs to know the dtype is for byte or bitshuffling.
Under Zarr-python as used for Zarr v3 as above, if shuffle is not set and the dtype is a byte, then BloscCodec will default to bitshuffling 8-bits. For large dtypes, byte shuffling is used. https://github.com/zarr-developers/zarr-python/blob/726fdfbf569c144310893440a40ee8ee05e6524e/src/zarr/codecs/blosc.py#L142-L146
Meanwhile, numcodecs will default to byte shuffle (just called shuffle) for Zarr v2: https://github.com/zarr-developers/numcodecs/blob/main/numcodecs/blosc.pyx#L548
The 11348 compressed length correspond to compressing with either
typesize=1
with no shuffle or byte shuffle ORtypesize=8
with no shuffle.A compressed length of 1383 corresponds to compressing with
typesize = 8
with byte shuffle.In [1]: import blosc, numpy as np
In [2]: c = np.arange(10000, dtype="float64")
In [3]: len(blosc.compress(c, cname="zstd", typesize=1, clevel=5, shuffle=blosc.NOSHUFFLE))
Out[3]: 11348
In [4]: len(blosc.compress(c, cname="zstd", typesize=1, clevel=5, shuffle=blosc.SHUFFLE))
Out[4]: 11348
In [5]: len(blosc.compress(c, cname="zstd", typesize=1, clevel=5, shuffle=blosc.BITSHUFFLE))
Out[5]: 979
In [6]: len(blosc.compress(c, cname="zstd", typesize=8, clevel=5, shuffle=blosc.NOSHUFFLE))
Out[6]: 11348
In [7]: len(blosc.compress(c, cname="zstd", typesize=8, clevel=5, shuffle=blosc.SHUFFLE))
Out[7]: 1383
In [8]: len(blosc.compress(c, cname="zstd", typesize=8, clevel=5, shuffle=blosc.BITSHUFFLE))
Out[8]: 391
Does the typesize
parameter have to correspond to the size of the elements of an array? For example, when compressing arrays with variable length types (like strings) with blosc, it might make sense to pick a typesize
, even though the size of the array dtype is undefined. To phrase it differently, requiring that the blosc typesize be set to the "size" of a dtype will fail for variable length types.
Does the
typesize
parameter have to correspond to the size of the elements of an array? For example, when compressing arrays with variable length types (like strings) with blosc, it might make sense to pick atypesize
, even though the size of the array dtype is undefined.
Technically, typesize
could be anything. It does not have to correspond to the size of the elements of the array. I'm unclear what might happen if the typesize
does not evenly divide the number of of bytes in the array though. What typesize
does here is parameterize the shuffle.
Using a pre-filter such as shuffle only makes sense if you are providing information that the compessor does not already have.
To phrase it differently, requiring that the blosc typesize be set to the "size" of a dtype will fail for variable length types.
I have no idea why you would try to shuffle variable length types.
I have no idea why you would try to shuffle variable length types.
What if the variable length type was something like list[uint32]
? I.e., a variable-length collection of some numeric type.
Are you expecting some correlation to still occur every four bytes in that case? Are the values in the list related somehow?
Are you expecting some correlation to still occur every four bytes in that case? Are the values in the list related somehow?
Yes that's what I'm thinking. Suppose you have a data type that's a variable-length collection of points in space, and the points tend to have similar values. Would shuffling be useful here? Honest question, since I don't know much about it.
Consider a simple run length encoding (RLE) compression scheme where I first provide the length of a run and then the content of the run.
For example, [0x1 0x1 0x1 0x1 0x2 0x2 0x2]
would be encoded as [0x4 0x1 0x3 0x2]
since there is a run of 4 0x1
s followed by 3 0x2
s. Terrific, I have encoded 7 bytes into 4 bytes.
Now imagine if had the byte sequence [0x1 0x0 0x0 0x0 0x2 0x0 0x0 0x0]
. The encoding would be [0x1 0x1 0x3 0x0 0x1 0x2 0x03 0x00]
. Now I have encoded 8 bytes into ... 8 bytes: no compression.
Now let's say I shuffle the bytes as follows. [0x1 0x2 0x0 0x0 0x0 0x0 0x0 0x0 0x0]
. These are the same bytes, but shuffled a different order. I encode them as [0x1 0x1 0x1 0x2 0x6 0x0]
. Now I have some compression again. In this case, the variable byte occurred every 4 bytes since these were 32-bit numbers, and the numbers were of similar magnitude. We took advantage that the higher order bytes were not changing.
If the bytes were [0x1 0x1 0x1 0x1 0x2 0x2 0x2 0x2]
, I could encode this as [0x4 0x1 0x4 0x1]
. If I shuffled the bytes to be [0x1 0x2 0x1 0x2 0x1 0x2 0x1 0x2]
, then RLE is a disaster. There would be 16 encoded bytes, twice the number we started with. Shuffling the bytes made the compression worse!
All compressors use RLE at some stage, so shuffling can be really useful if you expect runs of multibyte values that are related. By putting the rarely changing higher order bytes together, you can increase runs for encoding. Lz4 is just a fancy RLE encoder so shuffling really helps if the numbers are of similar magnitude. Zstd also has an entropy coding stage, so shuffling does not have as much.
Thus you probably want to shuffle if you know your number are correlated somehow, as in an image. If your numbers are random, shuffling might not help. If your numbers are fluctuating across magnitudes then, it is quite possible that shuffling could hurt.
thanks for that explanation. my takeaway then remains the same: the typesize
attribute of the blosc
codec is not perfectly equivalent to the dtype of an N-dimensional array, and so forcing blosc
to be an array -> bytes
codec would limit the utility of blosc
.
There is utility in array to bytes codec in that 1) in many scenarios typesize = sizeof(dtype) is a useful default 2) this would match the Zarr v2 behavior
There is also utility in a bytes to bytes codec where typesize is a free parameter.
All of this is content dependent. To set defaults here assumes the content follows certain patterns, which it may not. I would discourage making assumptions about other people's content.
What about just automatically setting the
typesize
of the blosc codec if it is unspecified and identified as following thebytes
codec? This is what the spec already suggests:
This is actually implemented already: https://github.com/zarr-developers/zarr-python/blob/680142f6f370d862f305a9b37f3c0ce2ce802e76/src/zarr/codecs/blosc.py#L141-L142
Meanwhile, numcodecs will default to byte shuffle (just called shuffle) for Zarr v2: https://github.com/zarr-developers/numcodecs/blob/main/numcodecs/blosc.pyx#L548
Should we just default to shuffle
to match the behavior in v2?
Zarr version
v3
Numcodecs version
n/a
Python Version
n/a
Operating System
n/a
Installation
n/a
Description
While playing around with v3 a bit, I noticed that the blosc codec wasn't compressing my sample data as well as I expected to, and I have confirmed after multiple comparisons that I am getting different results between v3 and v2, in some cases v3 blosc compressed chunks end up being 10-20x larger than in v2 (see below).
Steps to reproduce
The difference isn't huge in this case, but it can be much more noticeable in others. For example:
Cause and Possible Solution
In numcodecs, the blosc compressor is able to improve compression ratios by inferring the item size through the input buffer's numpy array dtype. But in v3, the blosc codec is implemented as a
BytesBytesCodec
and requires each chunk to be fed as bytes on encode (hence,BytesCodec()
is required in the list of codecs in my example) and thus numcodecs infers an item size of 1.A simple fix for this is to make the following change in
blosc.py
:Thoughts
I am still just getting started with v3, but this has made me curious about one thing. Why is the blosc codec implemented as a
BytesBytesCodec
rather than as anArrayBytesCodec
, considering that it can accept (and is optimized for) numpy array input? Although the above solution does work, because I need to include theBytesCodec
first when specifying my codecs in v3, it essentially first encodes each chunk into bytes, then decodes it back into an array in its original dtype, making the bytes codec effectively a pointless noop in this case.