Should we statically associate `Store` instances with `Buffer` types?

d-v-b commented 6 days ago

It seems a bit odd that the signature of Store.get takes a prototype parameter, which ultimately defines the return type of that method. Beyond cluttering the .get method, which will be used 99.9% of the time for CPU buffers, this makes it rather hard to reason about the data copying behavior of store.get, because depending on the prototype argument data might get copied, or it might not.

There is a simple alternative -- make store.get statically return instances of a fixed Buffer class, and require callers of store.get who want a different Buffer type to cast as needed. This would require associating a Buffer class with Store classes instances, which I think aligns with how the stores actually work -- e.g., LocalStore and RemoteStore are always going to return buffers backed by CPU memory. We can imagine alternative file system or s3 implementations that return GPU-backed buffers, but IMO those would be best expressed as different store classes.

One objection to associating Store with Buffer that I can think of is the possibility that someone wants one store that mixes cpu-backed buffers and gpu-backed buffers. I would emphasize that to my knowledge this is just a hypothetical, but I think this hypothetically could be accomplished by defining a store that wraps two sub-stores, Store[GPUBuffer] for gpu stuff, and Store[CPUBuffer] for cpu stuff.

Thoughts?

cc @kylebarron

d-v-b commented 6 days ago

cc @akshaysubr @madsbk

akshaysubr commented 5 days ago

@d-v-b The arguments you make are compelling, but we should think through this fully before committing to this strategy of static association of Buffers and Stores.

require callers of store.get who want a different Buffer type to cast as needed

I think this is the critical part of this proposal. How do we ensure the callers of store.get cast Buffer types appropriately. The current design is mostly to allow for easy switching between CPU and GPU by just changing one config parameter. But it is also an all or nothing approach. I think having something that is more fine-grained would be a good idea, but there is also significant risk that we end up in a situation where data is constantly being copied around adding a lot of overhead.

d-v-b commented 5 days ago

I do think it's important to factor in that every Zarr array is composed of two distinct pieces of data: the array schema / metadata, which is JSON, and the array chunks, which are little arrays. I suspect users will mostly care about processing the chunks on specialized hardware (I don't think GPUs or TPUs solve vital performance problems associated with parsing tiny JSON documents). That's not to say that metadata should never be stored outside CPU memory, but only that a basic scenario might be "device pluralism", where array metadata is stored separately from the array chunks. In zarr-python 2, arrays had a chunk_store, which was an optional store instance that was purely for fetching chunks. I wonder if we should build an equivalent abstraction in zarr-python 3.

@jhamman has often advocated for making our storage APIs more "zarr-aware". I think the proposal to partition metadata storage and chunk storage fits with that vision. What isn't clear to me is whether we:

create a zarr-format-aware layer of abstraction between stores and arrays / groups
narrow the scope of all our storage classes to be zarr-format-specific (e.g., make the stores aware of names like zarr.json)
widen the scope of our array / group classes to know more about storage (e.g., by giving the array the option to store metadata separately from chunks)

d-v-b commented 5 days ago

to put the point differently, with our current design of a global default_buffer_prototype, changing that from cpu buffers to gpu buffers will force storing the contents of zarr.json in GPU memory, regardless of the actual final storage backend. This is pointless when the goal is to run GPU-accelerated compute on chunks. So we need a way to route metadata documents and chunks to different storage media

madsbk commented 5 days ago

IMHO, the key feature we need to support is the ability to specify output data location (e.g., CPU, GPU) across the entire stack. Preferably on a per chunk basis. This is useful for several reasons:

We don’t have to handle metadata as a special case. The user (or a frontend like AsyncArray) can specify CPU memory for metadata and GPU memory for chunk data. This is what AsyncArray does in the current implementation (see later in this comment).
The user (or a specialized frontend) can use CPU memory for special purposes. E.g., one very helpful feature in a CUDA workflow is spilling. Because of limited GPU-memory, an array frontend could choose to read into CPU memory initially and then copy to GPU-memory on-demand.
Codices and stores can support GPU-memory seamlessly. E.g., TransposeCodec and MemoryStore are memory type agnostic. If MemoryStore didn’t take a prototype, we would need multiple implementations of the same basic functionality such as CpuMemoryStore and GpuMemoryStore etc.

to put the point differently, with our current design of a global default_buffer_prototype, changing that from cpu buffers to gpu buffers will force storing the contents of zarr.json in GPU memory, regardless of the actual final storage backend. This is pointless when the goal is to run GPU-accelerated compute on chunks. So we need a way to route metadata documents and chunks to different storage media.

Agree, you do not want to change the global default_buffer_prototype. This was never my intention, maybe we should rename it tocpu_buffer_prototype?.

If you want to read into GPU-memory, you should specify the GPU prototype when calling getitem(), like we do in our tests: https://github.com/zarr-developers/zarr-python/blob/693e11c60b6f7883169b9b6070d19e43dc86eda1/tests/test_buffer.py#L88-L93

This uses the default_buffer_prototype for metadata and gpu.buffer_prototype for the data chunks.

Now, if we want a more centralized way to set the prototype, we could add an extra prototype argument to AsyncArray.__init__() that sets the default prototype for getitem() and setitem() for that instance, or maybe have a static variable AsyncArray.default_buffer_prototype, for all instances?

d-v-b commented 5 days ago

Thanks for the clarification @madsbk. I'm not sure I fully understand the link between our current Store API and the need to support dynamic data locations -- to me, it seems like these dynamic decisions about putting data on CPU or GPU memory happen at the array level, is that accurate? Given that stores like LocalStore and RemoteStore have no option but to write into CPU memory, what's the difference between result = store.get('key', prototype=gpu.buffer_prototype) and result = gpu.buffer_prototype.from_buffer(store.get('key'))?

In both cases the caller is responsible for moving the buffer retrieved from storage to the desired medium, but in the second case this is not baked into the Store API, which seems cleaner to me.

madsbk commented 3 days ago

Given that stores like LocalStore and RemoteStore have no option but to write into CPU memory, what's the difference between result = store.get('key', prototype=gpu.buffer_prototype) and result = gpu.buffer_prototype.from_buffer(store.get('key'))?

In both cases the caller is responsible for moving the buffer retrieved from storage to the desired medium, but in the second case this is not baked into the Store API, which seems cleaner to me.

In cuda, we have cpu memory allocations that improve CPU<->GPU transfer performance significantly such as page-locked and managed memory. Thus, if the final destination is gpu memory, it would be very useful to let the LocalStore read directly into page-locked or managed memory. Another example could be a memory pool implementation; an user could reduce the memory allocation overhead by provide a prototype that uses a centralized memory pool.

Thanks for the clarification @madsbk. I'm not sure I fully understand the link between our current Store API and the need to support dynamic data locations -- to me, it seems like these dynamic decisions about putting data on CPU or GPU memory happen at the array level, is that accurate?

In principle, you are right. We could allow AsyncArray to have multiple store-codecs backends, one for each scenario like:

Retrieving metadata or chunk data into numpy: use cpu memory.
Retrieving data to cupy: use gpu memory.
Retrieving cpu-compressed data to cupy:
- use cpu memory for disk IO and (de-)compression.
- use gpu memory in the rest of the stack.
Retrieving nvCOMP-compressed data to numpy:
- use gpu memory if large chunks.
- use cpu memory if small chunks.
etc.

However, I don't think it will reduce code complexity.

zarr-developers / zarr-python

Should we statically associate `Store` instances with `Buffer` types? #2473