use compressor, filters, post_compressor for Array v3 create

d-v-b commented 2 months ago

In terms of abstraction levels, this pushes the codecs kwarg below the array creation API. Instead, we use the kwarg "filters" to denote ArrayArray codecs, "compressor" to denote the ArrayBytes codec, and "post_compressors" to denote the BytesBytesCodecs. This makes the top-level array creation API more explicit AND more similar to v2. Implementation of ideas expressed in #1943.

rabernat commented 2 months ago

:+1: I like this as an API.

normanrz commented 2 months ago

Not sure I am a fan. Codecs are a main difference between the v2 and v3 spec. I think the v3 codec pipeline is superior to the v2 compressors+filters. I would like to see a v3-first interface instead of shoehorning the new pipeline in the legacy interface.

I think this will be more confusing than helpful. All typical compressors for v2 arrays (e.g. blosc, zstd, gzip) are actually BytesBytesCodecs. Now, that would need to go in compressor for v2 arrays and post_compressor for v3 arrays? Essentially, the only compressors for v3 would be bytes and sharding_indexed. That just doesn't make sense.

I don't mind adding compressor and filters as optional kwargs to the v3 create interface. If set, they could be used to build a codec pipeline: filters + [BytesCodec(), compressor]. But, I would like to keep the codecs kwarg (which should take precedence over compressor+filters) and promote that as the recommended way of setting codecs for v3.

d-v-b commented 2 months ago

I think this will be more confusing than helpful. All typical compressors for v2 arrays (e.g. blosc, zstd, gzip) are actually BytesBytesCodecs. Now, that would need to go in compressor for v2 arrays and post_compressor for v3 arrays?

This is a good point -- I had forgotten that the old compressors are not actually ArrayBytesCodec, and I do think this fact would be a source of confusion in the proposed API. But I also think users will find the codecs list confusing. It has structure, but we do not expose that structure in the user-facing API; I think users will find it frustrating to intuit an API that we could make explicit.

Essentially, the only compressors for v3 would be bytes and sharding_indexed. That just doesn't make sense.

Given that there are only 2 options for the ArrayBytesCodec, and one of them is sharding, then I wonder if it would make sense to have a top-level sharding keyword argument that defaults to None (resulting in the BytesCodec getting used). I'm not sure exactly how users would customize the BytesCodec here, but this would at least guide users to sharding more easily than the opaque codecs list.

Are any other ideas for what a "v3 first" API would look like here? I think codecs: tuple[Codec, ...] will not be popular.

normanrz commented 2 months ago

But I also think users will find the codecs list confusing. It has structure, but we do not expose that structure in the user-facing API; I think users will find it frustrating to intuit an API that we could make explicit.

It is closest to the spec, though.

Given that there are only 2 options for the ArrayBytesCodec, and one of them is sharding, then I wonder if it would make sense to have a top-level sharding keyword argument that defaults to None (resulting in the BytesCodec getting used).

Adding sharding as a kwarg sounds interesting.

Are any other ideas for what a "v3 first" API would look like here? I think codecs: tuple[Codec, ...] will not be popular.

I think it could work well with some with some UX tweaks such as automatic BytesCodecs insertion and single value support. That would automatically turn codecs=BloscCodec() into codecs=(BytesCodec(), BloscCodec()) or codecs=TransposeCodec() into codecs=(TransposeCodec(), BytesCodec()) or codecs=(TransposeCodec(), ZstdCodec()) into codecs=(TransposeCodec(), BytesCodecs(), ZstdCodec()).

rabernat commented 2 months ago

It is closest to the spec, though.

I do not think that we need to constrain the Python API so closely to the spec. We should think about what would be most clear and convenient for our users. Specs are invisible implementation details to 99% of users. They are necessary for interoperability but not something users need to be exposed to directly. Do you think about the HTTP spec when you submit a comment on GitHub? 😆

I definitely think we need an API compatibility layer with the V2 syntax ("compressor", "codecs").

jhamman commented 2 months ago

Might we consider experimenting with the top level API (e.g. #1884) rather than the Array class constructors? I've been thinking of separate signatures (a la mypy overloads) for v2 and v3 arrays. I suspect we may find a reasonable path there but if not, we could always provide a different API that abstracts over the two sets of spec-specific keywords.

rabernat commented 2 months ago

That sounds very reasonable to me. Maybe the class constructors can adhere more strictly to the spec and internal structure, while the top-level API provides backwards compatibility and syntactic sugar.

The main downside is that these two APIs violate the Zen of Python: "There should be one-- and preferably only one --obvious way to do it."

d-v-b commented 2 months ago

I still believe that there is a model that can express v2 and v3, see the following table:

v2	v3
filters : `tuple[ArrayArrayCodec]`	filters: `tuple[ArrayArrayCodec]`
N/A	pre_compressor: `ArrayBytesCodec`
compressor : `BytesBytesCodec`	compressor: `tuple[BytesBytesCodec]`

as @normanrz pointed out, "compressor" actually maps better to BytesBytesCodec, so we go with that, and remove the "post_compressor" kwarg.
V2 compressor will be internally normalized to a tuple with a single element, so that it plays nice with the v3 compression machinery
We introduce "pre_compressor", which we can rename if we want. "pre_compressor" must be empty for v2 arrays. Otherwise, this API handles both v2 and v3.

Thoughts? I will update this PR along these lines. I really want this API to be good. If it's painful or opaque for users, they will make mistakes or fail to use features in the library.

d-v-b commented 2 months ago

I am also thinking about how we can make the sharding conceptualization simple. One idea would be to express unsharded arrays as simply a special case of sharded arrays.

normanrz commented 2 months ago

Thoughts? I will update this PR along these lines. I really want this API to be good. If it's painful or opaque for users, they will make mistakes or fail to use features in the library.

pre_compressor could just be array_bytes_codec. I doubt most users will care about it, if there is a dedicated way to set sharding.

I am also thinking about how we can make the sharding conceptualization simple. One idea would be to express unsharded arrays as simply a special case of sharded arrays.

How would that look like? filters and compressor could be used for the internal codecs. How would things like index_location or index_codecs (or chunk_layout in the future) be controlled? While a bit of an edge case, there is also the possibility to have nested sharding.

I don't think the API needs to be the codecs tuple we have right now, but it shouldn't be more confusing and less expressive.

rabernat commented 2 months ago

One idea would be to express unsharded arrays as simply a special case of sharded arrays.

I think that's a great idea.

When it's time to fetch data to satisfy a user query, we have a data structures kind of like this:

class ChunkReference:
    store_path: StorePath
    range: tuple[int, int] | None  # optional range within the path to fetch

ChunkRequest  # type: dict[ChunkKey, ChunkReference]

We can produce this data structure after scanning the shard index. It's also the same sort of information that is generated by kerchunk-style virtual Zarr datasets. For non-sharded data, the range would be unknown.

Once we have this data structure, we can make two potential optimizations:

Coalescing: combining multiple small requests for the same StorePath into a single request
Splitting: taking large requests (e.g. > 8 MB) and splitting them into smaller requests (to improve throughput)

Is this at all compatible with how the sharding codec is currently implemented?

normanrz commented 2 months ago

Is this at all compatible with how the sharding codec is currently implemented?

Currently, the codec issues separate partial get requests, but it could be turned into batched fetching.

LDeakin commented 2 months ago

I think it is a good idea to simplify the API with something like filters/compressor, but compressor should not strictly map to BytesBytesCodec. Possible future compressors like zfp and pcodec make sense as ArrayBytesCodecs, as they compress arrays/elements, rather than bytes directly.

normanrz commented 2 months ago

I think we could use compressor: tuple[ArrayBytesCodec | BytesBytesCodec, ...] | ArrayBytesCodec | BytesBytesCodec | None with runtime validation that there is max 1 ArrayBytesCodec. If no ArrayBytesCodec is supplied, we can auto-add a BytesCodec.

LDeakin commented 1 month ago

@d-v-b Instead, we use the kwarg "filters" to denote ArrayArray codecs

I've just noticed that not all "filters" in Zarr V2 are ArrayArray codecs. For example, vlen-* codecs are ArrayBytes. So, I want to echo @normanrz original sentiments that the proposed abstraction is likely to be more confusing than helpful.

Perhaps it makes the most sense to:

Replace codecs with array_to_array_codecs, array_to_bytes_codec, and bytes_to_bytes_codecs arguments
Retain filters and compressor for backwards compatibility

normanrz commented 1 month ago

I think we should move this conversation over to https://github.com/zarr-developers/zarr-python/discussions/2052

zarr-developers / zarr-python

use compressor, filters, post_compressor for Array v3 create #1944