Open jasonkena opened 1 month ago
The problem seems to be that the cached metadata is not updated after the shape is resized in another thread/process, leading to dropped rows.
I found two workarounds:
def fixed_append(arr, data, axis=0): def fixed_append_nosync(data, axis=0): arr._load_metadata_nosync() return arr._append_nosync(data, axis=axis) return arr._write_op(fixed_append_nosync, data, axis=axis)
- specifying `cache_metadata=False` to force reloading at all data accesses
Perhaps the default value for `cache_metadata` should be `False` when `synchronizer` is specified to prevent this behavior?
I believe this resolves these StackOverflow questions:
- https://stackoverflow.com/questions/61929796/parallel-appending-to-a-zarr-store-via-xarray-to-zarr-and-dask
- https://stackoverflow.com/questions/61799664/how-can-one-write-lock-a-zarr-store-during-append
Oddly enough, both workarounds fail when working with in-memory zarr arrays (initialized with zarr.zeros(...)
)
@jasonkena - thanks for the report. Your diagnosis seems correct but I'm not sure what we want to do about it. Its quite expensive to be always reloading metadata to protect against metadata modifications by another writer.
Finally, I should note that we haven't settled on whether or not to keep the synchronizer API around for the 3.0 release (it is not currently included).
Zarr version
v2.18.2
Numcodecs version
v0.13.0
Python Version
3.10.11
Operating System
Linux
Installation
pip
Description
Appending to zarr arrays is not safe, even with ProcessSynchronizer.
Steps to reproduce
Code:
Output:
Additional output
No response