zarr-developers / zarr-python

An implementation of chunked, compressed, N-dimensional arrays for Python.
https://zarr.readthedocs.io
MIT License
1.51k stars 280 forks source link

[v3]: TypeError when updating Array.attrs, thanks to incorrect inferred fill_value #2153

Closed TomAugspurger closed 3 weeks ago

TomAugspurger commented 1 month ago

Zarr version

v3

Numcodecs version

n/a

Python Version

n/a

Operating System

n/a

Installation

v3

Description

Discovered while looking at the xarray unit tests for Zarr.

Steps to reproduce

import zarr
import numpy as np
from zarr.api import asynchronous

store = zarr.store.MemoryStore({}, mode="w")
arr = zarr.open_array(store=store, path="a", shape=(), dtype=np.dtype("S6"))
arr.update_attributes({"key": "value"})

The .put call raises with

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[1], line 7
      5 store = zarr.store.MemoryStore({}, mode="w")
      6 arr = zarr.open_array(store=store, path="a", shape=(), dtype=np.dtype("S6"))
----> 7 arr.update_attributes({"k": "v"})

File ~/gh/zarr-developers/zarr-python/src/zarr/core/array.py:2055, in Array.update_attributes(self, new_attributes)
   2053 def update_attributes(self, new_attributes: dict[str, JSON]) -> Array:
   2054     return type(self)(
-> 2055         sync(
   2056             self._async_array.update_attributes(new_attributes),
   2057         )
   2058     )

File ~/gh/zarr-developers/zarr-python/src/zarr/core/sync.py:91, in sync(coro, loop, timeout)
     88 return_result = next(iter(finished)).result()
     90 if isinstance(return_result, BaseException):
---> 91     raise return_result
     92 else:
     93     return return_result

File ~/gh/zarr-developers/zarr-python/src/zarr/core/sync.py:50, in _runner(coro)
     45 """
     46 Await a coroutine and return the result of running it. If awaiting the coroutine raises an
     47 exception, the exception will be returned.
     48 """
     49 try:
---> 50     return await coro
     51 except Exception as ex:
     52     return ex

File ~/gh/zarr-developers/zarr-python/src/zarr/core/array.py:601, in AsyncArray.update_attributes(self, new_attributes)
    600 async def update_attributes(self, new_attributes: dict[str, JSON]) -> AsyncArray:
--> 601     new_metadata = self.metadata.update_attributes(new_attributes)
    603     # Write new metadata
    604     await self._save_metadata(new_metadata)

File ~/gh/zarr-developers/zarr-python/src/zarr/core/metadata.py:323, in ArrayV3Metadata.update_attributes(self, attributes)
    322 def update_attributes(self, attributes: dict[str, JSON]) -> Self:
--> 323     return replace(self, attributes=attributes)

File ~/mambaforge/lib/python3.10/dataclasses.py:1454, in replace(obj, **changes)
   1447         changes[f.name] = getattr(obj, f.name)
   1449 # Create the new object, which calls __init__() and
   1450 # __post_init__() (if defined), using all of the init fields we've
   1451 # added and/or left in 'changes'.  If there are values supplied in
   1452 # changes that aren't fields, this will correctly raise a
   1453 # TypeError.
-> 1454 return obj.__class__(**changes)

File ~/gh/zarr-developers/zarr-python/src/zarr/core/metadata.py:189, in ArrayV3Metadata.__init__(self, shape, data_type, chunk_grid, chunk_key_encoding, fill_value, codecs, attributes, dimension_names)
    187 chunk_key_encoding_parsed = ChunkKeyEncoding.from_dict(chunk_key_encoding)
    188 dimension_names_parsed = parse_dimension_names(dimension_names)
--> 189 fill_value_parsed = parse_fill_value_v3(fill_value, dtype=data_type_parsed)
    190 attributes_parsed = parse_attributes(attributes)
    191 codecs_parsed_partial = parse_codecs(codecs)

File ~/gh/zarr-developers/zarr-python/src/zarr/core/metadata.py:656, in parse_fill_value_v3(fill_value, dtype)
    654             raise ValueError(msg)
    655     msg = f"Cannot parse non-string sequence {fill_value} as a scalar with type {dtype}."
--> 656     raise TypeError(msg)
    657 return dtype.type(fill_value)

TypeError: Cannot parse non-string sequence b'' as a scalar with type |S6.

Additional output

The Array.update_attributes call is a bit of a red herring. That just happens to go through Metadata.__init__, which validates the fill value for a given dtype. Only this type we use fill_value=np.bytes(b""), which is incompatible with the dtype np.dtype("S6").

So I think the shorter reproducer is something like

arr = zarr.open_array(store=store, path="a", shape=(), dtype=np.dtype("S6"))
assert arr.metadata.fill_value != np.bytes_("b"))

in other words, maybe(?) our inferred fill value is incorrect for this specific dtype.

TomAugspurger commented 1 month ago

Or maybe the issue is just that zarr-python v3 (and maybe the spec?) don't support this data type yet?

Looking through the spec, I suppose we could support it with the r* type. I'll look into this a bit.

TomAugspurger commented 1 month ago

As a short-term fix, I think we can update https://github.com/zarr-developers/zarr-python/blob/e8800b0d55596fc200d67ef2cb8e6f544dcbb519/src/zarr/core/metadata.py#L642 to skip by bytes and str, not just str.

I do want to spend some type figuring out what the appropriate fillvalue is for np.dtype("S6") (both in practice and according to the spec). Right now we use np.dtype("S6").type(0) to get `np.bytes(b"")as the fill value. Unfortunately, *that* has a dtype ofS, notS6. Isfill_value.dtype == dtypean invariant we want to hold? So we would need to pad thefill_value` to be length 6 in this case?

jhamman commented 1 month ago

This is now erroring much earlier with a reasonable error:

arr = zarr.open_array(store=store, path="a", shape=(), dtype=np.dtype("S6"))

ValueError: Invalid V3 data_type: |S6
rabernat commented 3 weeks ago

This appears to have been fixed by #2036

rabernat commented 3 weeks ago

There is another issue here, which is the fact that I can't figure out how to actually read arr!

 arr[:]
File ~/gh/zarr-developers/zarr-python/src/zarr/core/indexing.py:76, in err_too_many_indices(selection, shape)
     75 def err_too_many_indices(selection: Any, shape: ChunkCoords) -> None:
---> 76     raise IndexError(f"too many indices for array; expected {len(shape)}, got {len(selection)}")

IndexError: too many indices for array; expected 0, got 1
jhamman commented 3 weeks ago

👇

In [8]: np.asarray(arr)
Out[8]: array(b'0', dtype='|S6')

In [9]: np.asarray(arr).item()
Out[9]: b'0'