zarr-developers / VirtualiZarr

Create virtual Zarr stores from archival data files using xarray syntax
https://virtualizarr.readthedocs.io/en/stable/api.html
Apache License 2.0
105 stars 22 forks source link

MetadataError from ValueError: Could not convert object to NumPy datetime #201

Closed TomNicholas closed 2 months ago

TomNicholas commented 3 months ago

I'm trying to debug @thodson-usgs's example from https://github.com/cubed-dev/cubed/pull/520 (and originally https://github.com/zarr-developers/VirtualiZarr/pull/197).

He is doing a whole serverless reduction of virtual references to multiple files (!!! - relevant to #123), but there seem to be some more basic errors to be fixed first.

Specifically, if I try to use virtualizarr on just one of his files this happens:

import xarray as xr
from virtualizarr import open_virtual_dataset

vds = open_virtual_dataset(
    's3://wrf-se-ak-ar5/ccsm/rcp85/daily/2060/WRFDS_2060-01-01.nc',
    indexes={},
    loadable_variables=['Time'],
    cftime_variables=['Time'],
)
vds
<xarray.Dataset> Size: 31MB
Dimensions:        (Time: 1, south_north: 250, west_east: 320,
                    interp_levels: 9, soil_layers_stag: 4)
Coordinates:
    interp_levels  (interp_levels) float32 36B ManifestArray<shape=(9,), dtyp...
    Time           (Time) datetime64[ns] 8B 2060-01-01
Dimensions without coordinates: south_north, west_east, soil_layers_stag
Data variables: (12/39)
    SNOWH          (Time, south_north, west_east) float32 320kB ManifestArray...
    ACSNOW         (Time, south_north, west_east) float32 320kB ManifestArray...
    TSK            (Time, south_north, west_east) float32 320kB ManifestArray...
    XLONG          (south_north, west_east) float32 320kB ManifestArray<shape...
    T              (Time, interp_levels, south_north, west_east) float32 3MB ...
    XLAT           (south_north, west_east) float32 320kB ManifestArray<shape...
    ...             ...
    PSFC           (Time, south_north, west_east) float32 320kB ManifestArray...
    ALBEDO         (Time, south_north, west_east) float32 320kB ManifestArray...
    CLDFRA         (Time, interp_levels, south_north, west_east) float32 3MB ...
    SWDNB          (Time, south_north, west_east) float32 320kB ManifestArray...
    PW             (Time, south_north, west_east) float32 320kB ManifestArray...
    SH2O           (Time, soil_layers_stag, south_north, west_east) float32 1MB ManifestArray<shape=(1, 4, 250, 320), dtype=float32, chunks=(1, 4, 250, 32...
Attributes:
    contact:  rtladerjr@alaska.edu
    data:     Downscaled CCSM4
    date:     Mon Oct 21 11:37:23 AKDT 2019
    format:   version 2
    info:     Alaska CASC
ds = xr.open_dataset('combined.json', engine="kerchunk")
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File ~/miniconda3/envs/numpy2.0_released/lib/python3.11/site-packages/zarr/meta.py:127, in Metadata2.decode_array_metadata(cls, s)
    126 dimension_separator = meta.get("dimension_separator", None)
--> 127 fill_value = cls.decode_fill_value(meta["fill_value"], dtype, object_codec)
    128 meta = dict(
    129     zarr_format=meta["zarr_format"],
    130     shape=tuple(meta["shape"]),
   (...)
    136     filters=meta["filters"],
    137 )

File ~/miniconda3/envs/numpy2.0_released/lib/python3.11/site-packages/zarr/meta.py:260, in Metadata2.decode_fill_value(cls, v, dtype, object_codec)
    259 else:
--> 260     return np.array(v, dtype=dtype)[()]

ValueError: Could not convert object to NumPy datetime

The above exception was the direct cause of the following exception:

MetadataError                             Traceback (most recent call last)
Cell In[8], line 1
----> 1 ds = xr.open_dataset('combined.json', engine="kerchunk")

File ~/miniconda3/envs/numpy2.0_released/lib/python3.11/site-packages/xarray/backends/api.py:571, in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, decode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, inline_array, chunked_array_type, from_array_kwargs, backend_kwargs, **kwargs)
    559 decoders = _resolve_decoders_kwargs(
    560     decode_cf,
    561     open_backend_dataset_parameters=backend.open_dataset_parameters,
   (...)
    567     decode_coords=decode_coords,
    568 )
    570 overwrite_encoded_chunks = kwargs.pop("overwrite_encoded_chunks", None)
--> 571 backend_ds = backend.open_dataset(
    572     filename_or_obj,
    573     drop_variables=drop_variables,
    574     **decoders,
    575     **kwargs,
    576 )
    577 ds = _dataset_from_backend_dataset(
    578     backend_ds,
    579     filename_or_obj,
   (...)
    589     **kwargs,
    590 )
    591 return ds

File ~/miniconda3/envs/numpy2.0_released/lib/python3.11/site-packages/kerchunk/xarray_backend.py:12, in KerchunkBackend.open_dataset(self, filename_or_obj, storage_options, open_dataset_options, **kw)
      8 def open_dataset(
      9     self, filename_or_obj, *, storage_options=None, open_dataset_options=None, **kw
     10 ):
     11     open_dataset_options = (open_dataset_options or {}) | kw
---> 12     ref_ds = open_reference_dataset(
     13         filename_or_obj,
     14         storage_options=storage_options,
     15         open_dataset_options=open_dataset_options,
     16     )
     17     return ref_ds

File ~/miniconda3/envs/numpy2.0_released/lib/python3.11/site-packages/kerchunk/xarray_backend.py:46, in open_reference_dataset(filename_or_obj, storage_options, open_dataset_options)
     42     open_dataset_options = {}
     44 m = fsspec.get_mapper("reference://", fo=filename_or_obj, **storage_options)
---> 46 return xr.open_dataset(m, engine="zarr", consolidated=False, **open_dataset_options)

File ~/miniconda3/envs/numpy2.0_released/lib/python3.11/site-packages/xarray/backends/api.py:571, in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, decode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, inline_array, chunked_array_type, from_array_kwargs, backend_kwargs, **kwargs)
    559 decoders = _resolve_decoders_kwargs(
    560     decode_cf,
    561     open_backend_dataset_parameters=backend.open_dataset_parameters,
   (...)
    567     decode_coords=decode_coords,
    568 )
    570 overwrite_encoded_chunks = kwargs.pop("overwrite_encoded_chunks", None)
--> 571 backend_ds = backend.open_dataset(
    572     filename_or_obj,
    573     drop_variables=drop_variables,
    574     **decoders,
    575     **kwargs,
    576 )
    577 ds = _dataset_from_backend_dataset(
    578     backend_ds,
    579     filename_or_obj,
   (...)
    589     **kwargs,
    590 )
    591 return ds

File ~/miniconda3/envs/numpy2.0_released/lib/python3.11/site-packages/xarray/backends/zarr.py:1182, in ZarrBackendEntrypoint.open_dataset(self, filename_or_obj, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, use_cftime, decode_timedelta, group, mode, synchronizer, consolidated, chunk_store, storage_options, stacklevel, zarr_version, store, engine)
   1180 store_entrypoint = StoreBackendEntrypoint()
   1181 with close_on_error(store):
-> 1182     ds = store_entrypoint.open_dataset(
   1183         store,
   1184         mask_and_scale=mask_and_scale,
   1185         decode_times=decode_times,
   1186         concat_characters=concat_characters,
   1187         decode_coords=decode_coords,
   1188         drop_variables=drop_variables,
   1189         use_cftime=use_cftime,
   1190         decode_timedelta=decode_timedelta,
   1191     )
   1192 return ds

File ~/miniconda3/envs/numpy2.0_released/lib/python3.11/site-packages/xarray/backends/store.py:43, in StoreBackendEntrypoint.open_dataset(self, filename_or_obj, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, use_cftime, decode_timedelta)
     29 def open_dataset(  # type: ignore[override]  # allow LSP violation, not supporting **kwargs
     30     self,
     31     filename_or_obj: str | os.PathLike[Any] | BufferedIOBase | AbstractDataStore,
   (...)
     39     decode_timedelta=None,
     40 ) -> Dataset:
     41     assert isinstance(filename_or_obj, AbstractDataStore)
---> 43     vars, attrs = filename_or_obj.load()
     44     encoding = filename_or_obj.get_encoding()
     46     vars, attrs, coord_names = conventions.decode_cf_variables(
     47         vars,
     48         attrs,
   (...)
     55         decode_timedelta=decode_timedelta,
     56     )

File ~/miniconda3/envs/numpy2.0_released/lib/python3.11/site-packages/xarray/backends/common.py:221, in AbstractDataStore.load(self)
    199 def load(self):
    200     """
    201     This loads the variables and attributes simultaneously.
    202     A centralized loading function makes it easier to create
   (...)
    218     are requested, so care should be taken to make sure its fast.
    219     """
    220     variables = FrozenDict(
--> 221         (_decode_variable_name(k), v) for k, v in self.get_variables().items()
    222     )
    223     attributes = FrozenDict(self.get_attrs())
    224     return variables, attributes

File ~/miniconda3/envs/numpy2.0_released/lib/python3.11/site-packages/xarray/backends/zarr.py:563, in ZarrStore.get_variables(self)
    562 def get_variables(self):
--> 563     return FrozenDict(
    564         (k, self.open_store_variable(k, v)) for k, v in self.zarr_group.arrays()
    565     )

File ~/miniconda3/envs/numpy2.0_released/lib/python3.11/site-packages/xarray/core/utils.py:443, in FrozenDict(*args, **kwargs)
    442 def FrozenDict(*args, **kwargs) -> Frozen:
--> 443     return Frozen(dict(*args, **kwargs))

File ~/miniconda3/envs/numpy2.0_released/lib/python3.11/site-packages/xarray/backends/zarr.py:563, in <genexpr>(.0)
    562 def get_variables(self):
--> 563     return FrozenDict(
    564         (k, self.open_store_variable(k, v)) for k, v in self.zarr_group.arrays()
    565     )

File ~/miniconda3/envs/numpy2.0_released/lib/python3.11/site-packages/zarr/hierarchy.py:691, in Group._array_iter(self, keys_only, method, recurse)
    689 if contains_array(self._store, path):
    690     _key = key.rstrip("/")
--> 691     yield _key if keys_only else (_key, self[key])
    692 elif recurse and contains_group(self._store, path):
    693     group = self[key]

File ~/miniconda3/envs/numpy2.0_released/lib/python3.11/site-packages/zarr/hierarchy.py:467, in Group.__getitem__(self, item)
    465 path = self._item_path(item)
    466 try:
--> 467     return Array(
    468         self._store,
    469         read_only=self._read_only,
    470         path=path,
    471         chunk_store=self._chunk_store,
    472         synchronizer=self._synchronizer,
    473         cache_attrs=self.attrs.cache,
    474         zarr_version=self._version,
    475         meta_array=self._meta_array,
    476     )
    477 except ArrayNotFoundError:
    478     pass

File ~/miniconda3/envs/numpy2.0_released/lib/python3.11/site-packages/zarr/core.py:170, in Array.__init__(self, store, path, read_only, chunk_store, synchronizer, cache_metadata, cache_attrs, partial_decompress, write_empty_chunks, zarr_version, meta_array)
    167     self._metadata_key_suffix = self._hierarchy_metadata["metadata_key_suffix"]
    169 # initialize metadata
--> 170 self._load_metadata()
    172 # initialize attributes
    173 akey = _prefix_to_attrs_key(self._store, self._key_prefix)

File ~/miniconda3/envs/numpy2.0_released/lib/python3.11/site-packages/zarr/core.py:193, in Array._load_metadata(self)
    191 """(Re)load metadata from store."""
    192 if self._synchronizer is None:
--> 193     self._load_metadata_nosync()
    194 else:
    195     mkey = _prefix_to_array_key(self._store, self._key_prefix)

File ~/miniconda3/envs/numpy2.0_released/lib/python3.11/site-packages/zarr/core.py:207, in Array._load_metadata_nosync(self)
    204     raise ArrayNotFoundError(self._path) from e
    205 else:
    206     # decode and store metadata as instance members
--> 207     meta = self._store._metadata_class.decode_array_metadata(meta_bytes)
    208     self._meta = meta
    209     self._shape = meta["shape"]

File ~/miniconda3/envs/numpy2.0_released/lib/python3.11/site-packages/zarr/meta.py:141, in Metadata2.decode_array_metadata(cls, s)
    139         meta["dimension_separator"] = dimension_separator
    140 except Exception as e:
--> 141     raise MetadataError("error decoding metadata") from e
    142 else:
    143     return meta

MetadataError: error decoding metadata

At first I assumed there was something wrong with our handling of the loaded cftime_variables, but actually even if I drop the 'Time' variable I still get exactly the same error:

vds = open_virtual_dataset(
    's3://wrf-se-ak-ar5/ccsm/rcp85/daily/2060/WRFDS_2060-01-01.nc',
    indexes={},
    drop_variables=['Time'],
)

I don't know why it's even trying to convert anything to a datetime - none of the other variables have units of time.

What's also weird is that this is raised from within meta.py:260, in Metadata2.decode_fill_value(cls, v, dtype, object_codec), which suggests a problem with the fill_value. But I checked and all of the variables in this virtual dataset have a fill_value of either a float or nan in their .encoding, again nothing about a datetime.

TomNicholas commented 3 months ago

@jsignell summoning you in case you have any thoughts / ideas here

TomNicholas commented 2 months ago

@thodson-usgs got a similar looking error in https://github.com/zarr-developers/VirtualiZarr/pull/203#issue-2436462556, but only on more recent versions of virtualizarr. There must be some kind of regression, which we should narrow down using git bisect.

jsignell commented 2 months ago

I am taking a look. Are you sure you got the same error when you dropped the time component? I am seeing an s3 access issue when I do that (which I am taking to mean I made it passed the original error).

from virtualizarr import open_virtual_dataset

vds = open_virtual_dataset(
    's3://wrf-se-ak-ar5/ccsm/rcp85/daily/2060/WRFDS_2060-01-01.nc',
    indexes={},
    drop_variables=["Time"]
)

vds.virtualize.to_kerchunk("combined_no_t.json", format="json")
ds = xr.open_dataset('combined_no_t.json', engine="kerchunk")
Show more output ```python-traceback --------------------------------------------------------------------------- NoCredentialsError Traceback (most recent call last) ``` ```python-traceback File ~/micromamba/envs/virtualizarr/lib/python3.12/site-packages/fsspec/asyn.py:245, in _run_coros_in_chunks.._run_coro(coro, i) 244 try: --> 245 return await asyncio.wait_for(coro, timeout=timeout), i 246 except Exception as e: File ~/micromamba/envs/virtualizarr/lib/python3.12/asyncio/tasks.py:520, in wait_for(fut, timeout) 519 async with timeouts.timeout(timeout): --> 520 return await fut File ~/micromamba/envs/virtualizarr/lib/python3.12/site-packages/s3fs/core.py:1125, in S3FileSystem._cat_file(self, path, version_id, start, end) 1123 resp["Body"].close() -> 1125 return await _error_wrapper(_call_and_read, retries=self.retries) File ~/micromamba/envs/virtualizarr/lib/python3.12/site-packages/s3fs/core.py:142, in _error_wrapper(func, args, kwargs, retries) 141 err = translate_boto_error(err) --> 142 raise err File ~/micromamba/envs/virtualizarr/lib/python3.12/site-packages/s3fs/core.py:113, in _error_wrapper(func, args, kwargs, retries) 112 try: --> 113 return await func(*args, **kwargs) 114 except S3_RETRYABLE_ERRORS as e: File ~/micromamba/envs/virtualizarr/lib/python3.12/site-packages/s3fs/core.py:1112, in S3FileSystem._cat_file.._call_and_read() 1111 async def _call_and_read(): -> 1112 resp = await self._call_s3( 1113 "get_object", 1114 Bucket=bucket, 1115 Key=key, 1116 **version_id_kw(version_id or vers), 1117 **head, 1118 **self.req_kw, 1119 ) 1120 try: File ~/micromamba/envs/virtualizarr/lib/python3.12/site-packages/s3fs/core.py:362, in S3FileSystem._call_s3(self, method, *akwarglist, **kwargs) 361 additional_kwargs = self._get_s3_method_kwargs(method, *akwarglist, **kwargs) --> 362 return await _error_wrapper( 363 method, kwargs=additional_kwargs, retries=self.retries 364 ) File ~/micromamba/envs/virtualizarr/lib/python3.12/site-packages/s3fs/core.py:142, in _error_wrapper(func, args, kwargs, retries) 141 err = translate_boto_error(err) --> 142 raise err File ~/micromamba/envs/virtualizarr/lib/python3.12/site-packages/s3fs/core.py:113, in _error_wrapper(func, args, kwargs, retries) 112 try: --> 113 return await func(*args, **kwargs) 114 except S3_RETRYABLE_ERRORS as e: File ~/micromamba/envs/virtualizarr/lib/python3.12/site-packages/aiobotocore/client.py:388, in AioBaseClient._make_api_call(self, operation_name, api_params) 387 apply_request_checksum(request_dict) --> 388 http, parsed_response = await self._make_request( 389 operation_model, request_dict, request_context 390 ) 392 await self.meta.events.emit( 393 'after-call.{service_id}.{operation_name}'.format( 394 service_id=service_id, operation_name=operation_name (...) 399 context=request_context, 400 ) File ~/micromamba/envs/virtualizarr/lib/python3.12/site-packages/aiobotocore/client.py:416, in AioBaseClient._make_request(self, operation_model, request_dict, request_context) 415 try: --> 416 return await self._endpoint.make_request( 417 operation_model, request_dict 418 ) 419 except Exception as e: File ~/micromamba/envs/virtualizarr/lib/python3.12/site-packages/aiobotocore/endpoint.py:98, in AioEndpoint._send_request(self, request_dict, operation_model) 97 self._update_retries_context(context, attempts) ---> 98 request = await self.create_request(request_dict, operation_model) 99 success_response, exception = await self._get_response( 100 request, operation_model, context 101 ) File ~/micromamba/envs/virtualizarr/lib/python3.12/site-packages/aiobotocore/endpoint.py:86, in AioEndpoint.create_request(self, params, operation_model) 83 event_name = 'request-created.{service_id}.{op_name}'.format( 84 service_id=service_id, op_name=operation_model.name 85 ) ---> 86 await self._event_emitter.emit( 87 event_name, 88 request=request, 89 operation_name=operation_model.name, 90 ) 91 prepared_request = self.prepare_request(request) File ~/micromamba/envs/virtualizarr/lib/python3.12/site-packages/aiobotocore/hooks.py:66, in AioHierarchicalEmitter._emit(self, event_name, kwargs, stop_on_response) 65 # Await the handler if its a coroutine. ---> 66 response = await resolve_awaitable(handler(**kwargs)) 67 responses.append((handler, response)) File ~/micromamba/envs/virtualizarr/lib/python3.12/site-packages/aiobotocore/_helpers.py:15, in resolve_awaitable(obj) 14 if inspect.isawaitable(obj): ---> 15 return await obj 17 return obj File ~/micromamba/envs/virtualizarr/lib/python3.12/site-packages/aiobotocore/signers.py:24, in AioRequestSigner.handler(self, operation_name, request, **kwargs) 19 async def handler(self, operation_name=None, request=None, **kwargs): 20 # This is typically hooked up to the "request-created" event 21 # from a client's event emitter. When a new request is created 22 # this method is invoked to sign the request. 23 # Don't call this method directly. ---> 24 return await self.sign(operation_name, request) File ~/micromamba/envs/virtualizarr/lib/python3.12/site-packages/aiobotocore/signers.py:88, in AioRequestSigner.sign(self, operation_name, request, region_name, signing_type, expires_in, signing_name) 86 raise e ---> 88 auth.add_auth(request) File ~/micromamba/envs/virtualizarr/lib/python3.12/site-packages/botocore/auth.py:418, in SigV4Auth.add_auth(self, request) 417 if self.credentials is None: --> 418 raise NoCredentialsError() 419 datetime_now = datetime.datetime.utcnow() NoCredentialsError: Unable to locate credentials The above exception was the direct cause of the following exception: ReferenceNotReachable Traceback (most recent call last) Cell In[7], line 1 ----> 1 ds = xr.open_dataset('combined_no_t.json', engine="kerchunk") File ~/micromamba/envs/virtualizarr/lib/python3.12/site-packages/xarray/backends/api.py:571, in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, decode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, inline_array, chunked_array_type, from_array_kwargs, backend_kwargs, **kwargs) 559 decoders = _resolve_decoders_kwargs( 560 decode_cf, 561 open_backend_dataset_parameters=backend.open_dataset_parameters, (...) 567 decode_coords=decode_coords, 568 ) 570 overwrite_encoded_chunks = kwargs.pop("overwrite_encoded_chunks", None) --> 571 backend_ds = backend.open_dataset( 572 filename_or_obj, 573 drop_variables=drop_variables, 574 **decoders, 575 **kwargs, 576 ) 577 ds = _dataset_from_backend_dataset( 578 backend_ds, 579 filename_or_obj, (...) 589 **kwargs, 590 ) 591 return ds File ~/micromamba/envs/virtualizarr/lib/python3.12/site-packages/kerchunk/xarray_backend.py:12, in KerchunkBackend.open_dataset(self, filename_or_obj, storage_options, open_dataset_options, **kw) 8 def open_dataset( 9 self, filename_or_obj, *, storage_options=None, open_dataset_options=None, **kw 10 ): 11 open_dataset_options = (open_dataset_options or {}) | kw ---> 12 ref_ds = open_reference_dataset( 13 filename_or_obj, 14 storage_options=storage_options, 15 open_dataset_options=open_dataset_options, 16 ) 17 return ref_ds File ~/micromamba/envs/virtualizarr/lib/python3.12/site-packages/kerchunk/xarray_backend.py:46, in open_reference_dataset(filename_or_obj, storage_options, open_dataset_options) 42 open_dataset_options = {} 44 m = fsspec.get_mapper("reference://", fo=filename_or_obj, **storage_options) ---> 46 return xr.open_dataset(m, engine="zarr", consolidated=False, **open_dataset_options) File ~/micromamba/envs/virtualizarr/lib/python3.12/site-packages/xarray/backends/api.py:571, in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, decode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, inline_array, chunked_array_type, from_array_kwargs, backend_kwargs, **kwargs) 559 decoders = _resolve_decoders_kwargs( 560 decode_cf, 561 open_backend_dataset_parameters=backend.open_dataset_parameters, (...) 567 decode_coords=decode_coords, 568 ) 570 overwrite_encoded_chunks = kwargs.pop("overwrite_encoded_chunks", None) --> 571 backend_ds = backend.open_dataset( 572 filename_or_obj, 573 drop_variables=drop_variables, 574 **decoders, 575 **kwargs, 576 ) 577 ds = _dataset_from_backend_dataset( 578 backend_ds, 579 filename_or_obj, (...) 589 **kwargs, 590 ) 591 return ds File ~/micromamba/envs/virtualizarr/lib/python3.12/site-packages/xarray/backends/zarr.py:1182, in ZarrBackendEntrypoint.open_dataset(self, filename_or_obj, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, use_cftime, decode_timedelta, group, mode, synchronizer, consolidated, chunk_store, storage_options, stacklevel, zarr_version, store, engine) 1180 store_entrypoint = StoreBackendEntrypoint() 1181 with close_on_error(store): -> 1182 ds = store_entrypoint.open_dataset( 1183 store, 1184 mask_and_scale=mask_and_scale, 1185 decode_times=decode_times, 1186 concat_characters=concat_characters, 1187 decode_coords=decode_coords, 1188 drop_variables=drop_variables, 1189 use_cftime=use_cftime, 1190 decode_timedelta=decode_timedelta, 1191 ) 1192 return ds File ~/micromamba/envs/virtualizarr/lib/python3.12/site-packages/xarray/backends/store.py:58, in StoreBackendEntrypoint.open_dataset(self, filename_or_obj, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, use_cftime, decode_timedelta) 44 encoding = filename_or_obj.get_encoding() 46 vars, attrs, coord_names = conventions.decode_cf_variables( 47 vars, 48 attrs, (...) 55 decode_timedelta=decode_timedelta, 56 ) ---> 58 ds = Dataset(vars, attrs=attrs) 59 ds = ds.set_coords(coord_names.intersection(vars)) 60 ds.set_close(filename_or_obj.close) File ~/micromamba/envs/virtualizarr/lib/python3.12/site-packages/xarray/core/dataset.py:711, in Dataset.__init__(self, data_vars, coords, attrs) 708 if isinstance(coords, Dataset): 709 coords = coords._variables --> 711 variables, coord_names, dims, indexes, _ = merge_data_and_coords( 712 data_vars, coords 713 ) 715 self._attrs = dict(attrs) if attrs else None 716 self._close = None File ~/micromamba/envs/virtualizarr/lib/python3.12/site-packages/xarray/core/dataset.py:425, in merge_data_and_coords(data_vars, coords) 421 coords = create_coords_with_default_indexes(coords, data_vars) 423 # exclude coords from alignment (all variables in a Coordinates object should 424 # already be aligned together) and use coordinates' indexes to align data_vars --> 425 return merge_core( 426 [data_vars, coords], 427 compat="broadcast_equals", 428 join="outer", 429 explicit_coords=tuple(coords), 430 indexes=coords.xindexes, 431 priority_arg=1, 432 skip_align_args=[1], 433 ) File ~/micromamba/envs/virtualizarr/lib/python3.12/site-packages/xarray/core/merge.py:699, in merge_core(objects, compat, join, combine_attrs, priority_arg, explicit_coords, indexes, fill_value, skip_align_args) 696 for pos, obj in skip_align_objs: 697 aligned.insert(pos, obj) --> 699 collected = collect_variables_and_indexes(aligned, indexes=indexes) 700 prioritized = _get_priority_vars_and_indexes(aligned, priority_arg, compat=compat) 701 variables, out_indexes = merge_collected( 702 collected, prioritized, compat=compat, combine_attrs=combine_attrs 703 ) File ~/micromamba/envs/virtualizarr/lib/python3.12/site-packages/xarray/core/merge.py:362, in collect_variables_and_indexes(list_of_mappings, indexes) 360 append(name, variable, indexes[name]) 361 elif variable.dims == (name,): --> 362 idx, idx_vars = create_default_index_implicit(variable) 363 append_all(idx_vars, {k: idx for k in idx_vars}) 364 else: File ~/micromamba/envs/virtualizarr/lib/python3.12/site-packages/xarray/core/indexes.py:1404, in create_default_index_implicit(dim_variable, all_variables) 1402 else: 1403 dim_var = {name: dim_variable} -> 1404 index = PandasIndex.from_variables(dim_var, options={}) 1405 index_vars = index.create_variables(dim_var) 1407 return index, index_vars File ~/micromamba/envs/virtualizarr/lib/python3.12/site-packages/xarray/core/indexes.py:654, in PandasIndex.from_variables(cls, variables, options) 651 if level is not None: 652 data = var._data.array.get_level_values(level) --> 654 obj = cls(data, dim, coord_dtype=var.dtype) 655 assert not isinstance(obj.index, pd.MultiIndex) 656 # Rename safely 657 # make a shallow copy: cheap and because the index name may be updated 658 # here or in other constructors (cannot use pd.Index.rename as this 659 # constructor is also called from PandasMultiIndex) File ~/micromamba/envs/virtualizarr/lib/python3.12/site-packages/xarray/core/indexes.py:589, in PandasIndex.__init__(self, array, dim, coord_dtype, fastpath) 587 index = array 588 else: --> 589 index = safe_cast_to_index(array) 591 if index.name is None: 592 # make a shallow copy: cheap and because the index name may be updated 593 # here or in other constructors (cannot use pd.Index.rename as this 594 # constructor is also called from PandasMultiIndex) 595 index = index.copy() File ~/micromamba/envs/virtualizarr/lib/python3.12/site-packages/xarray/core/indexes.py:469, in safe_cast_to_index(array) 459 emit_user_level_warning( 460 ( 461 "`pandas.Index` does not support the `float16` dtype." (...) 465 category=DeprecationWarning, 466 ) 467 kwargs["dtype"] = "float64" --> 469 index = pd.Index(np.asarray(array), **kwargs) 471 return _maybe_cast_to_cftimeindex(index) File ~/micromamba/envs/virtualizarr/lib/python3.12/site-packages/xarray/core/indexing.py:509, in ExplicitlyIndexed.__array__(self, dtype) 507 def __array__(self, dtype: np.typing.DTypeLike = None) -> np.ndarray: 508 # Leave casting to an array up to the underlying array type. --> 509 return np.asarray(self.get_duck_array(), dtype=dtype) File ~/micromamba/envs/virtualizarr/lib/python3.12/site-packages/xarray/backends/common.py:181, in BackendArray.get_duck_array(self, dtype) 179 def get_duck_array(self, dtype: np.typing.DTypeLike = None): 180 key = indexing.BasicIndexer((slice(None),) * self.ndim) --> 181 return self[key] File ~/micromamba/envs/virtualizarr/lib/python3.12/site-packages/xarray/backends/zarr.py:104, in ZarrArrayWrapper.__getitem__(self, key) 102 elif isinstance(key, indexing.OuterIndexer): 103 method = self._oindex --> 104 return indexing.explicit_indexing_adapter( 105 key, array.shape, indexing.IndexingSupport.VECTORIZED, method 106 ) File ~/micromamba/envs/virtualizarr/lib/python3.12/site-packages/xarray/core/indexing.py:1014, in explicit_indexing_adapter(key, shape, indexing_support, raw_indexing_method) 992 """Support explicit indexing by delegating to a raw indexing method. 993 994 Outer and/or vectorized indexers are supported by indexing a second time (...) 1011 Indexing result, in the form of a duck numpy-array. 1012 """ 1013 raw_key, numpy_indices = decompose_indexer(key, shape, indexing_support) -> 1014 result = raw_indexing_method(raw_key.tuple) 1015 if numpy_indices.tuple: 1016 # index the loaded np.ndarray 1017 indexable = NumpyIndexingAdapter(result) File ~/micromamba/envs/virtualizarr/lib/python3.12/site-packages/xarray/backends/zarr.py:94, in ZarrArrayWrapper._getitem(self, key) 93 def _getitem(self, key): ---> 94 return self._array[key] File ~/micromamba/envs/virtualizarr/lib/python3.12/site-packages/zarr/core.py:800, in Array.__getitem__(self, selection) 798 result = self.get_orthogonal_selection(pure_selection, fields=fields) 799 else: --> 800 result = self.get_basic_selection(pure_selection, fields=fields) 801 return result File ~/micromamba/envs/virtualizarr/lib/python3.12/site-packages/zarr/core.py:926, in Array.get_basic_selection(self, selection, out, fields) 924 return self._get_basic_selection_zd(selection=selection, out=out, fields=fields) 925 else: --> 926 return self._get_basic_selection_nd(selection=selection, out=out, fields=fields) File ~/micromamba/envs/virtualizarr/lib/python3.12/site-packages/zarr/core.py:968, in Array._get_basic_selection_nd(self, selection, out, fields) 962 def _get_basic_selection_nd(self, selection, out=None, fields=None): 963 # implementation of basic selection for array with at least one dimension 964 965 # setup indexer 966 indexer = BasicIndexer(selection, self) --> 968 return self._get_selection(indexer=indexer, out=out, fields=fields) File ~/micromamba/envs/virtualizarr/lib/python3.12/site-packages/zarr/core.py:1343, in Array._get_selection(self, indexer, out, fields) 1340 if math.prod(out_shape) > 0: 1341 # allow storage to get multiple items at once 1342 lchunk_coords, lchunk_selection, lout_selection = zip(*indexer) -> 1343 self._chunk_getitems( 1344 lchunk_coords, 1345 lchunk_selection, 1346 out, 1347 lout_selection, 1348 drop_axes=indexer.drop_axes, 1349 fields=fields, 1350 ) 1351 if out.shape: 1352 return out File ~/micromamba/envs/virtualizarr/lib/python3.12/site-packages/zarr/core.py:2177, in Array._chunk_getitems(self, lchunk_coords, lchunk_selection, out, lout_selection, drop_axes, fields) 2175 if not isinstance(self._meta_array, np.ndarray): 2176 contexts = ConstantMap(ckeys, constant=Context(meta_array=self._meta_array)) -> 2177 cdatas = self.chunk_store.getitems(ckeys, contexts=contexts) 2179 for ckey, chunk_select, out_select in zip(ckeys, lchunk_selection, lout_selection): 2180 if ckey in cdatas: File ~/micromamba/envs/virtualizarr/lib/python3.12/site-packages/zarr/storage.py:1435, in FSStore.getitems(self, keys, contexts) 1432 continue 1433 elif isinstance(v, Exception): 1434 # Raise any other exception -> 1435 raise v 1436 else: 1437 # The function calling this method may not recognize the transformed 1438 # keys, so we send the values returned by self.map.getitems back into 1439 # the original key space. 1440 results[keys_transformed[k]] = v ```
thodson-usgs commented 2 months ago

btw, git bisect led me to 10bd53dc3dae08303e57fe5aefe49804d9c4517d. Maybe I can find the pre-squash branch and dig further tomorrow.

thodson-usgs commented 2 months ago

Here's the bug: https://github.com/zarr-developers/VirtualiZarr/blob/179bb2ab42664546dd243a2e04fab8737846229a/virtualizarr/zarr.py#L70

Reverting this line back to https://github.com/zarr-developers/VirtualiZarr/blob/0ad4de5c612d1d632c2acb07ecfad071756eccf4/virtualizarr/zarr.py#L47 causes my test to pass.

I propose changing this to

fill_value: FillValueT = Field(default=np.nan, validate_default=True)

which also passes.

TomAugspurger commented 2 months ago

AFAICT, 0.0 is the appropriate default fill value. That matches what zarr-python does. The line raising an exception is I think something like

In [28]: np.array([0.0], dtype=np.dtype("datetime64[ns]"))
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[28], line 1
----> 1 np.array([0.0], dtype=np.dtype("datetime64[ns]"))

Called via

zarr.v2.meta.Metadata2.decode_fill_value(np.nan, np.dtype("datetime64[ns]"))

But that line fails with a fill value of np.nan and 0.0. @thodson-usgs would you be to get a debugger in there and see what the values of flil_value and dtype are both before and after https://github.com/zarr-developers/VirtualiZarr/commit/10bd53dc3dae08303e57fe5aefe49804d9c4517d? Or share a file somewhere public so I can take a look?

thodson-usgs commented 2 months ago

Thanks @TomAugspurger, I put an example back on #206. These might indeed be the same issue, but I want to be careful about crossing streams here.