Open d-v-b opened 6 days ago
cc @akshaysubr @madsbk
@d-v-b The arguments you make are compelling, but we should think through this fully before committing to this strategy of static association of Buffer
s and Store
s.
require callers of
store.get
who want a differentBuffer
type to cast as needed
I think this is the critical part of this proposal. How do we ensure the callers of store.get
cast Buffer
types appropriately. The current design is mostly to allow for easy switching between CPU and GPU by just changing one config parameter. But it is also an all or nothing approach. I think having something that is more fine-grained would be a good idea, but there is also significant risk that we end up in a situation where data is constantly being copied around adding a lot of overhead.
I do think it's important to factor in that every Zarr array is composed of two distinct pieces of data: the array schema / metadata, which is JSON, and the array chunks, which are little arrays. I suspect users will mostly care about processing the chunks on specialized hardware (I don't think GPUs or TPUs solve vital performance problems associated with parsing tiny JSON documents). That's not to say that metadata should never be stored outside CPU memory, but only that a basic scenario might be "device pluralism", where array metadata is stored separately from the array chunks. In zarr-python
2, arrays had a chunk_store
, which was an optional store
instance that was purely for fetching chunks. I wonder if we should build an equivalent abstraction in zarr-python
3.
@jhamman has often advocated for making our storage APIs more "zarr-aware". I think the proposal to partition metadata storage and chunk storage fits with that vision. What isn't clear to me is whether we:
zarr.json
)to put the point differently, with our current design of a global default_buffer_prototype
, changing that from cpu buffers to gpu buffers will force storing the contents of zarr.json
in GPU memory, regardless of the actual final storage backend. This is pointless when the goal is to run GPU-accelerated compute on chunks. So we need a way to route metadata documents and chunks to different storage media
IMHO, the key feature we need to support is the ability to specify output data location (e.g., CPU, GPU) across the entire stack. Preferably on a per chunk basis. This is useful for several reasons:
We don’t have to handle metadata as a special case. The user (or a frontend like AsyncArray
) can specify CPU memory for metadata and GPU memory for chunk data. This is what AsyncArray
does in the current implementation (see later in this comment).
The user (or a specialized frontend) can use CPU memory for special purposes. E.g., one very helpful feature in a CUDA workflow is spilling. Because of limited GPU-memory, an array frontend could choose to read into CPU memory initially and then copy to GPU-memory on-demand.
Codices and stores can support GPU-memory seamlessly. E.g., TransposeCodec
and MemoryStore
are memory type agnostic. If MemoryStore
didn’t take a prototype, we would need multiple implementations of the same basic functionality such as CpuMemoryStore
and GpuMemoryStore
etc.
to put the point differently, with our current design of a global
default_buffer_prototype
, changing that from cpu buffers to gpu buffers will force storing the contents ofzarr.json
in GPU memory, regardless of the actual final storage backend. This is pointless when the goal is to run GPU-accelerated compute on chunks. So we need a way to route metadata documents and chunks to different storage media.
Agree, you do not want to change the global default_buffer_prototype
. This was never my intention, maybe we should rename it tocpu_buffer_prototype
?.
If you want to read into GPU-memory, you should specify the GPU prototype when calling getitem()
, like we do in our tests: https://github.com/zarr-developers/zarr-python/blob/693e11c60b6f7883169b9b6070d19e43dc86eda1/tests/test_buffer.py#L88-L93
This uses the default_buffer_prototype
for metadata and gpu.buffer_prototype
for the data chunks.
Now, if we want a more centralized way to set the prototype, we could add an extra prototype
argument to AsyncArray.__init__()
that sets the default prototype for getitem()
and setitem()
for that instance, or maybe have a static variable AsyncArray.default_buffer_prototype
, for all instances?
Thanks for the clarification @madsbk. I'm not sure I fully understand the link between our current Store
API and the need to support dynamic data locations -- to me, it seems like these dynamic decisions about putting data on CPU or GPU memory happen at the array level, is that accurate? Given that stores like LocalStore
and RemoteStore
have no option but to write into CPU memory, what's the difference between
result = store.get('key', prototype=gpu.buffer_prototype)
and
result = gpu.buffer_prototype.from_buffer(store.get('key'))
?
In both cases the caller is responsible for moving the buffer retrieved from storage to the desired medium, but in the second case this is not baked into the Store
API, which seems cleaner to me.
Given that stores like
LocalStore
andRemoteStore
have no option but to write into CPU memory, what's the difference betweenresult = store.get('key', prototype=gpu.buffer_prototype)
andresult = gpu.buffer_prototype.from_buffer(store.get('key'))
?In both cases the caller is responsible for moving the buffer retrieved from storage to the desired medium, but in the second case this is not baked into the
Store
API, which seems cleaner to me.
In cuda, we have cpu memory allocations that improve CPU<->GPU transfer performance significantly such as page-locked and managed memory. Thus, if the final destination is gpu memory, it would be very useful to let the LocalStore
read directly into page-locked or managed memory.
Another example could be a memory pool implementation; an user could reduce the memory allocation overhead by provide a prototype
that uses a centralized memory pool.
Thanks for the clarification @madsbk. I'm not sure I fully understand the link between our current
Store
API and the need to support dynamic data locations -- to me, it seems like these dynamic decisions about putting data on CPU or GPU memory happen at the array level, is that accurate?
In principle, you are right. We could allow AsyncArray
to have multiple store-codecs backends, one for each scenario like:
However, I don't think it will reduce code complexity.
It seems a bit odd that the signature of
Store.get
takes aprototype
parameter, which ultimately defines the return type of that method. Beyond cluttering the.get
method, which will be used 99.9% of the time for CPU buffers, this makes it rather hard to reason about the data copying behavior ofstore.get
, because depending on theprototype
argument data might get copied, or it might not.There is a simple alternative -- make
store.get
statically return instances of a fixedBuffer
class, and require callers ofstore.get
who want a differentBuffer
type to cast as needed. This would require associating aBuffer
class withStore
classes instances, which I think aligns with how the stores actually work -- e.g.,LocalStore
andRemoteStore
are always going to return buffers backed by CPU memory. We can imagine alternative file system or s3 implementations that return GPU-backed buffers, but IMO those would be best expressed as different store classes.One objection to associating
Store
withBuffer
that I can think of is the possibility that someone wants one store that mixes cpu-backed buffers and gpu-backed buffers. I would emphasize that to my knowledge this is just a hypothetical, but I think this hypothetically could be accomplished by defining a store that wraps two sub-stores,Store[GPUBuffer]
for gpu stuff, andStore[CPUBuffer]
for cpu stuff.Thoughts?
cc @kylebarron