Create virtual Zarr stores from archival data files using xarray syntax
MetadataError from ValueError: Could not convert object to NumPy datetime #201

TomNicholas commented 3 months ago

I'm trying to debug @thodson-usgs's example from (and originally

He is doing a whole serverless reduction of virtual references to multiple files (!!! - relevant to #123), but there seem to be some more basic errors to be fixed first.

Specifically, if I try to use virtualizarr on just one of his files this happens:

import xarray as xr
from virtualizarr import open_virtual_dataset

vds = open_virtual_dataset(
<xarray.Dataset> Size: 31MB
Dimensions:        (Time: 1, south_north: 250, west_east: 320,
                    interp_levels: 9, soil_layers_stag: 4)
    interp_levels  (interp_levels) float32 36B ManifestArray<shape=(9,), dtyp...
    Time           (Time) datetime64[ns] 8B 2060-01-01
Dimensions without coordinates: south_north, west_east, soil_layers_stag
Data variables: (12/39)
    SNOWH          (Time, south_north, west_east) float32 320kB ManifestArray...
    ACSNOW         (Time, south_north, west_east) float32 320kB ManifestArray...
    TSK            (Time, south_north, west_east) float32 320kB ManifestArray...
    XLONG          (south_north, west_east) float32 320kB ManifestArray<shape...
    T              (Time, interp_levels, south_north, west_east) float32 3MB ...
    XLAT           (south_north, west_east) float32 320kB ManifestArray<shape...
    ...             ...
    PSFC           (Time, south_north, west_east) float32 320kB ManifestArray...
    ALBEDO         (Time, south_north, west_east) float32 320kB ManifestArray...
    CLDFRA         (Time, interp_levels, south_north, west_east) float32 3MB ...
    SWDNB          (Time, south_north, west_east) float32 320kB ManifestArray...
    PW             (Time, south_north, west_east) float32 320kB ManifestArray...
    SH2O           (Time, soil_layers_stag, south_north, west_east) float32 1MB ManifestArray<shape=(1, 4, 250, 320), dtype=float32, chunks=(1, 4, 250, 32...
    data:     Downscaled CCSM4
    date:     Mon Oct 21 11:37:23 AKDT 2019
    format:   version 2
    info:     Alaska CASC
ds = xr.open_dataset('combined.json', engine="kerchunk")
At first I assumed there was something wrong with our handling of the loaded cftime_variables, but actually even if I drop the 'Time' variable I still get exactly the same error:

vds = open_virtual_dataset(

I don't know why it's even trying to convert anything to a datetime - none of the other variables have units of time.

What's also weird is that this is raised from within, in Metadata2.decode_fill_value(cls, v, dtype, object_codec), which suggests a problem with the fill_value. But I checked and all of the variables in this virtual dataset have a fill_value of either a float or nan in their .encoding, again nothing about a datetime.

TomNicholas commented 3 months ago

@jsignell summoning you in case you have any thoughts / ideas here

TomNicholas commented 2 months ago

@thodson-usgs got a similar looking error in, but only on more recent versions of virtualizarr. There must be some kind of regression, which we should narrow down using git bisect.

jsignell commented 2 months ago

I am taking a look. Are you sure you got the same error when you dropped the time component? I am seeing an s3 access issue when I do that (which I am taking to mean I made it passed the original error).

from virtualizarr import open_virtual_dataset

vds = open_virtual_dataset(

vds.virtualize.to_kerchunk("combined_no_t.json", format="json")
ds = xr.open_dataset('combined_no_t.json', engine="kerchunk")
thodson-usgs commented 2 months ago

btw, git bisect led me to 10bd53dc3dae08303e57fe5aefe49804d9c4517d. Maybe I can find the pre-squash branch and dig further tomorrow.

thodson-usgs commented 2 months ago

Here's the bug:

Reverting this line back to causes my test to pass.

I propose changing this to

fill_value: FillValueT = Field(default=np.nan, validate_default=True)

which also passes.

TomAugspurger commented 2 months ago

AFAICT, 0.0 is the appropriate default fill value. That matches what zarr-python does. The line raising an exception is I think something like

In [28]: np.array([0.0], dtype=np.dtype("datetime64[ns]"))
ValueError                                Traceback (most recent call last)
Cell In[28], line 1
----> 1 np.array([0.0], dtype=np.dtype("datetime64[ns]"))

Called via

zarr.v2.meta.Metadata2.decode_fill_value(np.nan, np.dtype("datetime64[ns]"))

But that line fails with a fill value of np.nan and 0.0. @thodson-usgs would you be to get a debugger in there and see what the values of flil_value and dtype are both before and after Or share a file somewhere public so I can take a look?

thodson-usgs commented 2 months ago

Thanks @TomAugspurger, I put an example back on #206. These might indeed be the same issue, but I want to be careful about crossing streams here.