Open d-v-b opened 2 months ago
:+1: I like this as an API.
Not sure I am a fan. Codecs are a main difference between the v2 and v3 spec. I think the v3 codec pipeline is superior to the v2 compressors+filters. I would like to see a v3-first interface instead of shoehorning the new pipeline in the legacy interface.
I think this will be more confusing than helpful. All typical compressors for v2 arrays (e.g. blosc, zstd, gzip) are actually BytesBytesCodecs
. Now, that would need to go in compressor
for v2 arrays and post_compressor
for v3 arrays? Essentially, the only compressors
for v3 would be bytes
and sharding_indexed
. That just doesn't make sense.
I don't mind adding compressor
and filters
as optional kwargs to the v3 create interface. If set, they could be used to build a codec pipeline: filters + [BytesCodec(), compressor]
. But, I would like to keep the codecs
kwarg (which should take precedence over compressor+filters) and promote that as the recommended way of setting codecs for v3.
I think this will be more confusing than helpful. All typical compressors for v2 arrays (e.g. blosc, zstd, gzip) are actually
BytesBytesCodecs
. Now, that would need to go incompressor
for v2 arrays andpost_compressor
for v3 arrays?
This is a good point -- I had forgotten that the old compressors are not actually ArrayBytesCodec
, and I do think this fact would be a source of confusion in the proposed API. But I also think users will find the codecs
list confusing. It has structure, but we do not expose that structure in the user-facing API; I think users will find it frustrating to intuit an API that we could make explicit.
Essentially, the only
compressors
for v3 would bebytes
andsharding_indexed
. That just doesn't make sense.
Given that there are only 2 options for the ArrayBytesCodec
, and one of them is sharding, then I wonder if it would make sense to have a top-level sharding
keyword argument that defaults to None
(resulting in the BytesCodec
getting used). I'm not sure exactly how users would customize the BytesCodec
here, but this would at least guide users to sharding more easily than the opaque codecs
list.
Are any other ideas for what a "v3 first" API would look like here? I think codecs: tuple[Codec, ...]
will not be popular.
But I also think users will find the codecs list confusing. It has structure, but we do not expose that structure in the user-facing API; I think users will find it frustrating to intuit an API that we could make explicit.
It is closest to the spec, though.
Given that there are only 2 options for the ArrayBytesCodec, and one of them is sharding, then I wonder if it would make sense to have a top-level sharding keyword argument that defaults to None (resulting in the BytesCodec getting used).
Adding sharding
as a kwarg sounds interesting.
Are any other ideas for what a "v3 first" API would look like here? I think codecs: tuple[Codec, ...] will not be popular.
I think it could work well with some with some UX tweaks such as automatic BytesCodecs
insertion and single value support. That would automatically turn codecs=BloscCodec()
into codecs=(BytesCodec(), BloscCodec())
or codecs=TransposeCodec()
into codecs=(TransposeCodec(), BytesCodec())
or codecs=(TransposeCodec(), ZstdCodec())
into codecs=(TransposeCodec(), BytesCodecs(), ZstdCodec())
.
It is closest to the spec, though.
I do not think that we need to constrain the Python API so closely to the spec. We should think about what would be most clear and convenient for our users. Specs are invisible implementation details to 99% of users. They are necessary for interoperability but not something users need to be exposed to directly. Do you think about the HTTP spec when you submit a comment on GitHub? 😆
I definitely think we need an API compatibility layer with the V2 syntax ("compressor", "codecs").
Might we consider experimenting with the top level API (e.g. #1884) rather than the Array class constructors? I've been thinking of separate signatures (a la mypy overloads) for v2 and v3 arrays. I suspect we may find a reasonable path there but if not, we could always provide a different API that abstracts over the two sets of spec-specific keywords.
That sounds very reasonable to me. Maybe the class constructors can adhere more strictly to the spec and internal structure, while the top-level API provides backwards compatibility and syntactic sugar.
The main downside is that these two APIs violate the Zen of Python: "There should be one-- and preferably only one --obvious way to do it."
I still believe that there is a model that can express v2 and v3, see the following table:
v2 | v3 |
---|---|
filters : tuple[ArrayArrayCodec] |
filters: tuple[ArrayArrayCodec] |
N/A | pre_compressor: ArrayBytesCodec |
compressor : BytesBytesCodec |
compressor: tuple[BytesBytesCodec] |
BytesBytesCodec
, so we go with that, and remove the "post_compressor" kwarg.Thoughts? I will update this PR along these lines. I really want this API to be good. If it's painful or opaque for users, they will make mistakes or fail to use features in the library.
I am also thinking about how we can make the sharding conceptualization simple. One idea would be to express unsharded arrays as simply a special case of sharded arrays.
Thoughts? I will update this PR along these lines. I really want this API to be good. If it's painful or opaque for users, they will make mistakes or fail to use features in the library.
pre_compressor
could just be array_bytes_codec
. I doubt most users will care about it, if there is a dedicated way to set sharding.
I am also thinking about how we can make the sharding conceptualization simple. One idea would be to express unsharded arrays as simply a special case of sharded arrays.
How would that look like? filters and compressor could be used for the internal codecs. How would things like index_location
or index_codecs
(or chunk_layout
in the future) be controlled? While a bit of an edge case, there is also the possibility to have nested sharding.
I don't think the API needs to be the codecs
tuple we have right now, but it shouldn't be more confusing and less expressive.
One idea would be to express unsharded arrays as simply a special case of sharded arrays.
I think that's a great idea.
When it's time to fetch data to satisfy a user query, we have a data structures kind of like this:
class ChunkReference:
store_path: StorePath
range: tuple[int, int] | None # optional range within the path to fetch
ChunkRequest # type: dict[ChunkKey, ChunkReference]
We can produce this data structure after scanning the shard index. It's also the same sort of information that is generated by kerchunk-style virtual Zarr datasets. For non-sharded data, the range
would be unknown.
Once we have this data structure, we can make two potential optimizations:
Is this at all compatible with how the sharding codec is currently implemented?
Is this at all compatible with how the sharding codec is currently implemented?
Currently, the codec issues separate partial get requests, but it could be turned into batched fetching.
I think we could use compressor: tuple[ArrayBytesCodec | BytesBytesCodec, ...] | ArrayBytesCodec | BytesBytesCodec | None
with runtime validation that there is max 1 ArrayBytesCodec
. If no ArrayBytesCodec
is supplied, we can auto-add a BytesCodec
.
@d-v-b Instead, we use the kwarg "filters" to denote ArrayArray codecs
I've just noticed that not all "filters" in Zarr V2 are ArrayArray codecs. For example, vlen-*
codecs are ArrayBytes
. So, I want to echo @normanrz original sentiments that the proposed abstraction is likely to be more confusing than helpful.
Perhaps it makes the most sense to:
codecs
with array_to_array_codecs
, array_to_bytes_codec
, and bytes_to_bytes_codecs
argumentsfilters
and compressor
for backwards compatibility I think we should move this conversation over to https://github.com/zarr-developers/zarr-python/discussions/2052
In terms of abstraction levels, this pushes the
codecs
kwarg below the array creation API. Instead, we use the kwarg "filters" to denote ArrayArray codecs, "compressor" to denote the ArrayBytes codec, and "post_compressors" to denote the BytesBytesCodecs. This makes the top-level array creation API more explicit AND more similar to v2. Implementation of ideas expressed in #1943.