zarr-developers / zarr-python

An implementation of chunked, compressed, N-dimensional arrays for Python.
https://zarr.readthedocs.io
MIT License
1.45k stars 273 forks source link

data wiped from `MemoryStore` after `store.close` + `array[:]` #2067

Open d-v-b opened 1 month ago

d-v-b commented 1 month ago

Over in #1746 @dcherian discovered the following bug:

import zarr
from zarr.array import Array
from zarr.group import Group
from zarr.store import MemoryStore
import numpy as np

store = MemoryStore(mode="w")
root = Group.create(store)
nparray = np.array([1], dtype=np.int8)

a = root.create_array(
    "/0/0",
    shape=nparray.shape,
    chunks=(1,),
    dtype=nparray.dtype.str,
    attributes={},
    # compressor=compressor,  # TODO: FIXME
    fill_value=nparray.dtype.type(0),
)
a[:] = nparray
print(a[:]) # [1]
store.close()
print(a[:]) # [0]

Setting data with a[:] = nparray, followed by a.store_path.store.close(), followed by a[:], results in the store getting wiped. This is not intended behavior.

this PR adds:

TODO:

d-v-b commented 1 month ago

so i figured out where this is coming from. i'm not sure if it's a bug or not.

  1. we close the store, which sets is_open to False.
  2. a[:] invokes MemoryStore.get(), which hits this line:
    if not self._is_open:
    await self._open()
  3. Store._open() does different things depending on its mode parameter. it seems that the default mode is AccessMode(readonly=False, overwrite=True, create=True, update=False). Because overwrite is True here, Store._open wipes the store before fetching data.

I don't fully understand the open / closed semantics here. My intuition is that re-opening a closed store feels weird, and having the store._open() routine invoked outside of Store.__init__ feels incorrect.

@brokkoli71 what's the logic for invoking Store._open in the get and set methods? I feel like it might be cleaner to only open the store in 1 place (__init__), and attempting to call Store.get/set on a closed store should simply error.

d-v-b commented 1 month ago

tests are now passing with this PR. Instead of conditionally calling store._open() in store.get or store.set, a RuntimeError is raised when attempting to use these methods on a store that is closed. I also adjusted all the tests to use sync(Store.open()) instead of Store().

I think we should not merge this until we have a conversation about the store design. I would really like to be able to create store classes in synchronous code without needing sync, but we can't do that now because (some) store initialization requires IO, which is async.

@jhamman and @normanrz any thoughts about this issue (i.e., that we cannot fully initialize a store class in synchronous code without sync)?

dcherian commented 1 month ago

So is the bug here thati was reusing the array object I wrote to? But instead I should create a new one?

d-v-b commented 1 month ago

So is the bug here thati was reusing the array object I wrote to? But instead I should create a new one?

The bug was that the store class self-destructs if a) it was created with overwrite=True, and b) you attempt to do any array indexing after closing the store.

brokkoli71 commented 1 month ago

@d-v-b @dcherian

  1. Store._open() does different things depending on its mode parameter. it seems that the default mode is AccessMode(readonly=False, overwrite=True, create=True, update=False). Because overwrite is True here, Store._open wipes the store before fetching data.

In this example the mode was explicitly set to "w" which will overwrite the data when opening a store. The default value of MemoryStore() is mode="r" which will raise a ValueError in this code snippet as it does not allow writing. I think, what the code snippet was trying to achieve was for reusing a closed array which can be done with mode "r+" or "a". I'd say this illustrates that the effect of the AccessModeLiterals should be communicated better. Currently only the asynchronous open method contains docstrings explaining:

 mode : {'r', 'r+', 'a', 'w', 'w-'}, optional
        Persistence mode: 'r' means read only (must exist); 'r+' means
        read/write (must exist); 'a' means read/write (create if doesn't
        exist); 'w' means create (overwrite if exists); 'w-' means create
        (fail if exists).

Furthermore, the examples in the docstrings of array.py suggest using mode="w". I think overwriting stores on opening them has it's usecase but should not be communicated as a general good idea.

The other problem with implicitly opening a store is a problem, too. I agree with the comments above.

Edit: Also, in store/core.py::make_store_path the mode will fallback to "w":

   elif store_like is None:
        if mode is None:
            mode = "w"  # exception to the default mode = 'r'

I think it make sense here to fallback to "a", too

brokkoli71 commented 1 month ago

I had a similar thought process while working on https://github.com/zarr-developers/zarr-python/pull/2000. On the one hand sync(Store.open()) seems a bit much for the API, on the other hand implicitly calling _open might cause confusion. Currently the only modification of _open on the array is clearing it if overwrite is enabled. Maybe if we make that behavior more clear to the user that might not be a problem? what do you think? @d-v-b