Open jasonkena opened 3 months ago
The problem seems to be that the cached metadata is not updated after the shape is resized in another thread/process, leading to dropped rows.
I found two workarounds:
def fixed_append(arr, data, axis=0): def fixed_append_nosync(data, axis=0): arr._load_metadata_nosync() return arr._append_nosync(data, axis=axis) return arr._write_op(fixed_append_nosync, data, axis=axis)
- specifying `cache_metadata=False` to force reloading at all data accesses
Perhaps the default value for `cache_metadata` should be `False` when `synchronizer` is specified to prevent this behavior?
I believe this resolves these StackOverflow questions:
- https://stackoverflow.com/questions/61929796/parallel-appending-to-a-zarr-store-via-xarray-to-zarr-and-dask
- https://stackoverflow.com/questions/61799664/how-can-one-write-lock-a-zarr-store-during-append
Oddly enough, both workarounds fail when working with in-memory zarr arrays (initialized with zarr.zeros(...)
)
@jasonkena - thanks for the report. Your diagnosis seems correct but I'm not sure what we want to do about it. Its quite expensive to be always reloading metadata to protect against metadata modifications by another writer.
Finally, I should note that we haven't settled on whether or not to keep the synchronizer API around for the 3.0 release (it is not currently included).
@jasonkena - thanks for the report. Your diagnosis seems correct but I'm not sure what we want to do about it. Its quite expensive to be always reloading metadata to protect against metadata modifications by another writer.
Finally, I should note that we haven't settled on whether or not to keep the synchronizer API around for the 3.0 release (it is not currently included).
is there any data you can share to support the bolded?
is there any data you can share to support the bolded?
@zoj613 - not any specific data but if every chunk IO op requires first checking if the metadata has changed, you can imagine how this would be expensive. In my view, the bigger issue is actually around consistency. One of the design tradeoffs in Zarr is that by splitting the dataset into many objects/files, you can act concurrently on individual components. However, the cost of this is that the writer is required to coordinate updates among multiple writers. (you might be interested in reading the Consistency Problems with Zarr in the Arraylake documentation)
Zarr version
v2.18.2
Numcodecs version
v0.13.0
Python Version
3.10.11
Operating System
Linux
Installation
pip
Description
Appending to zarr arrays is not safe, even with ProcessSynchronizer.
Steps to reproduce
Code:
Output:
Additional output
No response