xpublish-community / xpublish

Publish Xarray Datasets via a REST API.
https://xpublish.readthedocs.io
Apache License 2.0
167 stars 23 forks source link

Encoding chunks do not match inferred chunks #207

Open caiostringari opened 1 year ago

caiostringari commented 1 year ago

Hi,

I am having problems with the/zarr/get_zarr_metadata endpoint. I can start the server and I see my dataset but when I try to read data from the client side, I get ValueError: Encoding chunks do not match inferred chunks.

I tried to explicitly change the chunks / encoding but it did not seem to work.

My code looks something like this:

import zarr
import xarray as xr
from azure.storage.blob import ContainerClient

container_client = ContainerClient("some_url", container_name="some_name", credential="some_credentials")
store = zarr.ABSStore(client=container_client, prefix="file.zarr")

ds = xr.open_zarr(store, consolidated=True, overwrite_encoded_chunks=True)  # do I need overwrite_encoded_chunks?
ds = ds.chunk({"time": 2**12, "feature_id": 2**16})  # do I need this?
ds.encoding = {"time": 2**12, "feature_id": 2**16} # do I need this?

Any ideas?

Thank you very much

abkfenris commented 1 year ago

Hi @caiostringari I think you haven't heard from anyone, because we may need some more info to start helping you debug.

xaviernogueira commented 1 year ago

@abkfenris So I came to Issues to point out a similar problem. Here is more information.

Bug description

Call Stack

    zjson = jsonify_zmetadata(dataset, zmetadata)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\xrnogueira\Miniconda3\envs\catalog_to_xpublish_dev\Lib\site-packages\xpublish\utils\zarr.py", line 142, in jsonify_zmetadata
    compressor = zjson['metadata'][f'{key}/{array_meta_key}']['compressor']
                 ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyError: 'air/.zarray'

Background

Version

{
  "python": "3.11.4 | packaged by conda-forge | (main, Jun 10 2023, 17:59:51) [MSC v.1935 64 bit (AMD64)]",
  "python-bits": 64,
  "OS": "Windows",
  "OS-release": "10",
  "Version": "10.0.19043",
  "machine": "AMD64",
  "processor": "Intel64 Family 6 Model 140 Stepping 1, GenuineIntel",
  "byteorder": "little",
  "LC_ALL": "None",
  "LANG": "en_US.UTF-8",
  "LOCALE": "English_United States.1252",
  "libhdf5": "1.14.0",
  "libnetcdf": "4.9.2",
  "xarray": "2023.5.0",
  "zarr": "2.15.0",
  "numcodecs": "0.11.0",
  "fastapi": "0.97.0",
  "starlette": "0.27.0",
  "pandas": "2.0.2",
  "numpy": "1.25.0",
  "dask": "2023.6.0",
  "distributed": "2023.6.0",
  "uvicorn": "0.22.0"
}

Datasets I tested with

OSN Dataset (KeyError)

conus404-hourly-osn:
    driver: zarr
    description: "CONUS404 - OSN pod storage, 70 TB, 40 years of hourly data, CONUS extent with 4 km gridded spatial resolution, 157 variables"
    args:
      urlpath: 's3://rsignellbucket2/hytest/conus404/conus404_hourly_202302.zarr'
      consolidated: true
      storage_options:
        anon: true
        requester_pays: false
        client_kwargs:
          endpoint_url: https://renc.osn.xsede.org

OSN Dataset (Works as expected!)

alaska-et-2020-subset-osn:
    driver: zarr
    description: "Sample subset - OSN pod storage, 863M, Gridded 20km Daily Reference Evapotranspiration for the State of Alaska from 1979 to 2017/CCSM4 historical simulation"
    args:
      urlpath: 's3://rsignellbucket2/nhgf/sample_data/ccsm4.zarr'
      consolidated: true
      storage_options:
        anon: true
        requester_pays: false
        client_kwargs:
          endpoint_url: https://renc.osn.xsede.org

S3 Dataset (ValueError)

conus404-hourly-s3:
    driver: zarr
    description: "CONUS404 - s3 storage, 70 TB, 40 years of hourly data, CONUS extent with 4 km gridded spatial resolution, 157 variables"
    args:
      urlpath: 's3://nhgf-development/conus404/conus404_hourly_202209.zarr'
      consolidated: true
      storage_options:
        requester_pays: true
abkfenris commented 1 year ago

Thanks @xaviernogueira that let me dig into it some. https://gist.github.com/abkfenris/23fe268eb3f3479919a267efe392e4a5

I didn't end up trying with requester pays (or the OSN that works for that matter), but I was able to reproduce the ValueError with the OSN dataset.

It looks like an encoding may be set on time, even though it's a numpy array rather than a dask array under the hood, which appears to be causing a mismatch.

If I yoink the .zmetadata direct from OSN, it's got chunks on time/.zarray

{
    "metadata": {
        ...
        "time/.zarray": {
            "chunks": [
                46008
            ],
            "compressor": {
                "id": "zstd",
                "level": 9
            },
            "dtype": "<i8",
            "fill_value": null,
            "filters": null,
            "order": "C",
            "shape": [
                368064
            ],
            "zarr_format": 2
        },
        "time/.zattrs": {
            "_ARRAY_DIMENSIONS": [
                "time"
            ],
            "calendar": "proleptic_gregorian",
            "standard_name": "time",
            "units": "hours since 1979-10-01 00:00:00"
        },
        ...
    }
}

This is probably over my head for Zarr specifics so I'm not sure if we should go for the encoded/inferred chunks in this case, but maybe @jhamman has some thoughts.

xaviernogueira commented 1 year ago

@abkfenris

So it occured to me that xarray.Dataset.unify_chunks() could potentially address this. I just tried this and where I previously got the KeyError ("conus404-hourly-osn") I got the ValueError instead now. But then I realized I couldn't recreate the KeyError which is wierd.

caiostringari commented 1 year ago

@abkfenris,

How are you launching Xpublish for the datasets (ds.rest.serve() vs xpublish.Rest()...)? I am using

ds.rest(
    app_kws=dict(
        title="Some title here",
        description="Some description here.",
        openapi_url="/dataset.json",
    ),
    cache_kws=dict(available_bytes=1e9),  # this is 1 GB worth of cache.
)
`ds.rest.serve()`.

What version of Xpublish, supporting libraries, and any plugins are you using (the output of /plugins and /versions would be fantastic)?

{
  "dataset_info": {
    "path": "xpublish.plugins.included.dataset_info.DatasetInfoPlugin",
    "version": "0.3.0"
  },
  "module_version": {
    "path": "xpublish.plugins.included.module_version.ModuleVersionPlugin",
    "version": "0.3.0"
  },
  "plugin_info": {
    "path": "xpublish.plugins.included.plugin_info.PluginInfoPlugin",
    "version": "0.3.0"
  },
  "zarr": {
    "path": "xpublish.plugins.included.zarr.ZarrPlugin",
    "version": "0.3.0"
  }
}
{
  "python": "3.11.3 | packaged by conda-forge | (main, Apr  6 2023, 08:57:19) [GCC 11.3.0]",
  "python-bits": 64,
  "OS": "Linux",
  "OS-release": "5.15.90.1-microsoft-standard-WSL2",
  "Version": "#1 SMP Fri Jan 27 02:56:13 UTC 2023",
  "machine": "x86_64",
  "processor": "x86_64",
  "byteorder": "little",
  "LC_ALL": "None",
  "LANG": "C.UTF-8",
  "LOCALE": "en_US.UTF-8",
  "libhdf5": "1.12.2",
  "libnetcdf": null,
  "xarray": "2023.4.2",
  "zarr": "2.14.2",
  "numcodecs": "0.11.0",
  "fastapi": "0.95.1",
  "starlette": "0.26.1",
  "pandas": "2.0.1",
  "numpy": "1.24.3",
  "dask": "2023.4.1",
  "uvicorn": "0.22.0"
}

Does this occur with other datasets (say the Xarray tutorial datasets, or others that we can try without credentials)? If I use ds = xr.tutorial.load_dataset('air_temperature') as an example, it works as expected.

Have you tried without overwrite_encoded_chunks and ds.chunk (were those in the docs for the dataset)? Yes, I tried both with and without overwrite_encoded_chunks, ds.chunk and with unify_chunks(). Always got the same error.

Is the server throwing the ValueError on request, or is the client (and how is your client configured to connect to the server)? The server is throwing the errors:

Traceback (most recent call last):
  File "/home/cstringari/mambaforge/envs/nwm-api/lib/python3.11/site-packages/uvicorn/protocols/http/h11_impl.py", line 428, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cstringari/mambaforge/envs/nwm-api/lib/python3.11/site-packages/uvicorn/middleware/proxy_headers.py", line 78, in __call__
    return await self.app(scope, receive, send)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cstringari/mambaforge/envs/nwm-api/lib/python3.11/site-packages/fastapi/applications.py", line 276, in __call__
    await super().__call__(scope, receive, send)
  File "/home/cstringari/mambaforge/envs/nwm-api/lib/python3.11/site-packages/starlette/applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/cstringari/mambaforge/envs/nwm-api/lib/python3.11/site-packages/starlette/middleware/errors.py", line 184, in __call__
    raise exc
  File "/home/cstringari/mambaforge/envs/nwm-api/lib/python3.11/site-packages/starlette/middleware/errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File "/home/cstringari/mambaforge/envs/nwm-api/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
    raise exc
  File "/home/cstringari/mambaforge/envs/nwm-api/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
    await self.app(scope, receive, sender)
  File "/home/cstringari/mambaforge/envs/nwm-api/lib/python3.11/site-packages/fastapi/middleware/asyncexitstack.py", line 21, in __call__
    raise e
  File "/home/cstringari/mambaforge/envs/nwm-api/lib/python3.11/site-packages/fastapi/middleware/asyncexitstack.py", line 18, in __call__
    await self.app(scope, receive, send)
  File "/home/cstringari/mambaforge/envs/nwm-api/lib/python3.11/site-packages/starlette/routing.py", line 718, in __call__
    await route.handle(scope, receive, send)
  File "/home/cstringari/mambaforge/envs/nwm-api/lib/python3.11/site-packages/starlette/routing.py", line 276, in handle
    await self.app(scope, receive, send)
  File "/home/cstringari/mambaforge/envs/nwm-api/lib/python3.11/site-packages/starlette/routing.py", line 66, in app
    response = await func(request)
               ^^^^^^^^^^^^^^^^^^^
  File "/home/cstringari/mambaforge/envs/nwm-api/lib/python3.11/site-packages/fastapi/routing.py", line 237, in app
    raw_response = await run_endpoint_function(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cstringari/mambaforge/envs/nwm-api/lib/python3.11/site-packages/fastapi/routing.py", line 165, in run_endpoint_function
    return await run_in_threadpool(dependant.call, **values)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cstringari/mambaforge/envs/nwm-api/lib/python3.11/site-packages/starlette/concurrency.py", line 41, in run_in_threadpool
    return await anyio.to_thread.run_sync(func, *args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cstringari/mambaforge/envs/nwm-api/lib/python3.11/site-packages/anyio/to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cstringari/mambaforge/envs/nwm-api/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
           ^^^^^^^^^^^^
  File "/home/cstringari/mambaforge/envs/nwm-api/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 867, in run
    result = context.run(func, *args)
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cstringari/mambaforge/envs/nwm-api/lib/python3.11/site-packages/xpublish/plugins/included/zarr.py", line 39, in get_zarr_metadata
    zmetadata = get_zmetadata(dataset, cache, zvariables)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cstringari/mambaforge/envs/nwm-api/lib/python3.11/site-packages/xpublish/dependencies.py", line 97, in get_zmetadata
    zmeta = create_zmetadata(dataset)
            ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cstringari/mambaforge/envs/nwm-api/lib/python3.11/site-packages/xpublish/utils/zarr.py", line 126, in create_zmetadata
    zmeta['metadata'][f'{key}/{array_meta_key}'] = _extract_zarray(
                                                   ^^^^^^^^^^^^^^^^
  File "/home/cstringari/mambaforge/envs/nwm-api/lib/python3.11/site-packages/xpublish/utils/zarr.py", line 94, in _extract_zarray
    raise ValueError('Encoding chunks do not match inferred chunks')
ValueError: Encoding chunks do not match inferred chunks

Since my original post, I re-built by my zarr file from scratch but got the same errors. The data is hosted on Azure and time encoded as well.

abkfenris commented 1 year ago

@caiostringari sorry it's taken a few days for me to take another look. Thanks for all that info, but I think we might need a bit more info about the dataset, it's chunk geometry, and any encoding.

Can you try running this against your dataset? It's largely the guts of xpublish.utils.zarr.create_zmetadata that dumps the variables that are created for each variable if _extract_zarray throw an error.

from xpublish.utils import zarr

for key, dvar in ds.variables.items():
    da = ds[key]
    encoded_da = zarr.encode_zarr_variable(dvar, name=key)
    encoding = zarr.extract_zarr_variable_encoding(dvar)
    zattrs = zarr._extract_dataarray_zattrs(encoded_da)
    zattrs = zarr._extract_dataarray_coords(da, zattrs)
    try:
        extracted_zarray = zarr._extract_zarray(
            encoded_da, encoding, encoded_da.dtype
        )
    except ValueError:
        print(f"{key=}, {dvar=}")
        print(f"{da=}")
        print(f"{encoded_da=}")
        print(f"{encoding=}")
        print(f"{da.encoding=}")
        print(f"{zattrs=}")

The top level ds.encoding might also be useful to.

wachsylon commented 1 year ago

What helped in my case: After setting chunks with .chunk, instead of setting a new chunk encoding (~ds.encoding = {"time": 2**12, "feature_id": 2**16}~), just simply remove all chunk encoding:

for var in ds.data_vars:
    del ds[var].encoding["chunks"]
for var in ds.coords:
    if "chunks" in ds[var].encoding:
        del ds[var].encoding["chunks"]
abkfenris commented 1 year ago

@jhamman mentioned that setting the encoding after specifying chunks is problematic in Xarray anyways and is something they are trying to move away from, and to try ds = ds.reset_encoding().

caiostringari commented 1 year ago

Sorry for the delay,

@wachsylon solution works my dataset! =)

@abkfenris here are the outputs

key='feature_id', dvar=<xarray.IndexVariable 'feature_id' (feature_id: 2776738)>
array([       101,        179,        181, ..., 1180001802, 1180001803,
       1180001804], dtype=int32)
Attributes:
    cf_role:    timeseries_id
    comment:    NHDPlusv2 ComIDs within CONUS, arbitrary Reach IDs outside of...
    long_name:  Reach ID
da=<xarray.DataArray 'feature_id' (feature_id: 2776738)>
array([       101,        179,        181, ..., 1180001802, 1180001803,
       1180001804], dtype=int32)
Coordinates:
  * feature_id  (feature_id) int32 101 179 181 ... 1180001803 1180001804
Attributes:
    cf_role:    timeseries_id
    comment:    NHDPlusv2 ComIDs within CONUS, arbitrary Reach IDs outside of...
    long_name:  Reach ID
encoded_da=<xarray.Variable (feature_id: 2776738)>
array([       101,        179,        181, ..., 1180001802, 1180001803,
       1180001804], dtype=int32)
Attributes:
    cf_role:    timeseries_id
    comment:    NHDPlusv2 ComIDs within CONUS, arbitrary Reach IDs outside of...
    long_name:  Reach ID
encoding={'chunks': (173547,), 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None}
da.encoding={'chunks': (173547,), 'preferred_chunks': {'feature_id': 173547}, 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None, 'dtype': dtype('int32')}
zattrs={'cf_role': 'timeseries_id', 'comment': 'NHDPlusv2 ComIDs within CONUS, arbitrary Reach IDs outside of CONUS', 'long_name': 'Reach ID', '_ARRAY_DIMENSIONS': ['feature_id']}
key='nudge', dvar=<xarray.Variable (time: 7, feature_id: 2776738)>
dask.array<open_dataset-20bffbb6c974fc573210bff50b7b4738nudge, shape=(7, 2776738), dtype=float64, chunksize=(7, 65536), chunktype=numpy.ndarray>
Attributes:
    coordinates:   latitude longitude
    grid_mapping:  crs
    long_name:     Amount of stream flow alteration
    units:         m3 s-1
    valid_range:   [-5000000, 5000000]
da=<xarray.DataArray 'nudge' (time: 7, feature_id: 2776738)>
dask.array<open_dataset-20bffbb6c974fc573210bff50b7b4738nudge, shape=(7, 2776738), dtype=float64, chunksize=(7, 65536), chunktype=numpy.ndarray>
Coordinates:
  * feature_id  (feature_id) int32 101 179 181 ... 1180001803 1180001804
  * time        (time) datetime64[ns] 2023-07-18T09:00:00 ... 2023-07-18T15:0...
Attributes:
    coordinates:   latitude longitude
    grid_mapping:  crs
    long_name:     Amount of stream flow alteration
    units:         m3 s-1
    valid_range:   [-5000000, 5000000]
encoded_da=<xarray.Variable (time: 7, feature_id: 2776738)>
dask.array<open_dataset-20bffbb6c974fc573210bff50b7b4738nudge, shape=(7, 2776738), dtype=float64, chunksize=(7, 65536), chunktype=numpy.ndarray>
Attributes:
    coordinates:   latitude longitude
    grid_mapping:  crs
    long_name:     Amount of stream flow alteration
    units:         m3 s-1
    valid_range:   [-5000000, 5000000]
encoding={'chunks': (512, 65536), 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None}
da.encoding={'chunks': (512, 65536), 'preferred_chunks': {'time': 512, 'feature_id': 65536}, 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None, '_FillValue': nan, 'dtype': dtype('float64')}
zattrs={'coordinates': 'latitude longitude', 'grid_mapping': 'crs', 'long_name': 'Amount of stream flow alteration', 'units': 'm3 s-1', 'valid_range': [-5000000, 5000000], '_ARRAY_DIMENSIONS': ['time', 'feature_id']}
key='qBtmVertRunoff', dvar=<xarray.Variable (time: 7, feature_id: 2776738)>
dask.array<open_dataset-20bffbb6c974fc573210bff50b7b4738qBtmVertRunoff, shape=(7, 2776738), dtype=float64, chunksize=(7, 65536), chunktype=numpy.ndarray>
Attributes:
    coordinates:   latitude longitude
    grid_mapping:  crs
    long_name:     Runoff from bottom of soil to bucket
    units:         m3
    valid_range:   [0, 20000000]
da=<xarray.DataArray 'qBtmVertRunoff' (time: 7, feature_id: 2776738)>
dask.array<open_dataset-20bffbb6c974fc573210bff50b7b4738qBtmVertRunoff, shape=(7, 2776738), dtype=float64, chunksize=(7, 65536), chunktype=numpy.ndarray>
Coordinates:
  * feature_id  (feature_id) int32 101 179 181 ... 1180001803 1180001804
  * time        (time) datetime64[ns] 2023-07-18T09:00:00 ... 2023-07-18T15:0...
Attributes:
    coordinates:   latitude longitude
    grid_mapping:  crs
    long_name:     Runoff from bottom of soil to bucket
    units:         m3
    valid_range:   [0, 20000000]
encoded_da=<xarray.Variable (time: 7, feature_id: 2776738)>
dask.array<open_dataset-20bffbb6c974fc573210bff50b7b4738qBtmVertRunoff, shape=(7, 2776738), dtype=float64, chunksize=(7, 65536), chunktype=numpy.ndarray>
Attributes:
    coordinates:   latitude longitude
    grid_mapping:  crs
    long_name:     Runoff from bottom of soil to bucket
    units:         m3
    valid_range:   [0, 20000000]
encoding={'chunks': (512, 65536), 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None}
da.encoding={'chunks': (512, 65536), 'preferred_chunks': {'time': 512, 'feature_id': 65536}, 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None, '_FillValue': nan, 'dtype': dtype('float64')}
zattrs={'coordinates': 'latitude longitude', 'grid_mapping': 'crs', 'long_name': 'Runoff from bottom of soil to bucket', 'units': 'm3', 'valid_range': [0, 20000000], '_ARRAY_DIMENSIONS': ['time', 'feature_id']}
key='qBucket', dvar=<xarray.Variable (time: 7, feature_id: 2776738)>
dask.array<open_dataset-20bffbb6c974fc573210bff50b7b4738qBucket, shape=(7, 2776738), dtype=float64, chunksize=(7, 65536), chunktype=numpy.ndarray>
Attributes:
    coordinates:   latitude longitude
    grid_mapping:  crs
    long_name:     Flux from gw bucket
    units:         m3 s-1
    valid_range:   [0, 2000000000]
da=<xarray.DataArray 'qBucket' (time: 7, feature_id: 2776738)>
dask.array<open_dataset-20bffbb6c974fc573210bff50b7b4738qBucket, shape=(7, 2776738), dtype=float64, chunksize=(7, 65536), chunktype=numpy.ndarray>
Coordinates:
  * feature_id  (feature_id) int32 101 179 181 ... 1180001803 1180001804
  * time        (time) datetime64[ns] 2023-07-18T09:00:00 ... 2023-07-18T15:0...
Attributes:
    coordinates:   latitude longitude
    grid_mapping:  crs
    long_name:     Flux from gw bucket
    units:         m3 s-1
    valid_range:   [0, 2000000000]
encoded_da=<xarray.Variable (time: 7, feature_id: 2776738)>
dask.array<open_dataset-20bffbb6c974fc573210bff50b7b4738qBucket, shape=(7, 2776738), dtype=float64, chunksize=(7, 65536), chunktype=numpy.ndarray>
Attributes:
    coordinates:   latitude longitude
    grid_mapping:  crs
    long_name:     Flux from gw bucket
    units:         m3 s-1
    valid_range:   [0, 2000000000]
encoding={'chunks': (512, 65536), 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None}
da.encoding={'chunks': (512, 65536), 'preferred_chunks': {'time': 512, 'feature_id': 65536}, 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None, '_FillValue': nan, 'dtype': dtype('float64')}
zattrs={'coordinates': 'latitude longitude', 'grid_mapping': 'crs', 'long_name': 'Flux from gw bucket', 'units': 'm3 s-1', 'valid_range': [0, 2000000000], '_ARRAY_DIMENSIONS': ['time', 'feature_id']}
key='qSfcLatRunoff', dvar=<xarray.Variable (time: 7, feature_id: 2776738)>
dask.array<open_dataset-20bffbb6c974fc573210bff50b7b4738qSfcLatRunoff, shape=(7, 2776738), dtype=float64, chunksize=(7, 65536), chunktype=numpy.ndarray>
Attributes:
    coordinates:   latitude longitude
    grid_mapping:  crs
    long_name:     Runoff from terrain routing
    units:         m3 s-1
    valid_range:   [0, 2000000000]
da=<xarray.DataArray 'qSfcLatRunoff' (time: 7, feature_id: 2776738)>
dask.array<open_dataset-20bffbb6c974fc573210bff50b7b4738qSfcLatRunoff, shape=(7, 2776738), dtype=float64, chunksize=(7, 65536), chunktype=numpy.ndarray>
Coordinates:
  * feature_id  (feature_id) int32 101 179 181 ... 1180001803 1180001804
  * time        (time) datetime64[ns] 2023-07-18T09:00:00 ... 2023-07-18T15:0...
Attributes:
    coordinates:   latitude longitude
    grid_mapping:  crs
    long_name:     Runoff from terrain routing
    units:         m3 s-1
    valid_range:   [0, 2000000000]
encoded_da=<xarray.Variable (time: 7, feature_id: 2776738)>
dask.array<open_dataset-20bffbb6c974fc573210bff50b7b4738qSfcLatRunoff, shape=(7, 2776738), dtype=float64, chunksize=(7, 65536), chunktype=numpy.ndarray>
Attributes:
    coordinates:   latitude longitude
    grid_mapping:  crs
    long_name:     Runoff from terrain routing
    units:         m3 s-1
    valid_range:   [0, 2000000000]
encoding={'chunks': (512, 65536), 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None}
da.encoding={'chunks': (512, 65536), 'preferred_chunks': {'time': 512, 'feature_id': 65536}, 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None, '_FillValue': nan, 'dtype': dtype('float64')}
zattrs={'coordinates': 'latitude longitude', 'grid_mapping': 'crs', 'long_name': 'Runoff from terrain routing', 'units': 'm3 s-1', 'valid_range': [0, 2000000000], '_ARRAY_DIMENSIONS': ['time', 'feature_id']}
key='streamflow', dvar=<xarray.Variable (time: 7, feature_id: 2776738)>
dask.array<open_dataset-20bffbb6c974fc573210bff50b7b4738streamflow, shape=(7, 2776738), dtype=float64, chunksize=(7, 65536), chunktype=numpy.ndarray>
Attributes:
    coordinates:   latitude longitude
    grid_mapping:  crs
    long_name:     River Flow
    units:         m3 s-1
    valid_range:   [0, 5000000]
da=<xarray.DataArray 'streamflow' (time: 7, feature_id: 2776738)>
dask.array<open_dataset-20bffbb6c974fc573210bff50b7b4738streamflow, shape=(7, 2776738), dtype=float64, chunksize=(7, 65536), chunktype=numpy.ndarray>
Coordinates:
  * feature_id  (feature_id) int32 101 179 181 ... 1180001803 1180001804
  * time        (time) datetime64[ns] 2023-07-18T09:00:00 ... 2023-07-18T15:0...
Attributes:
    coordinates:   latitude longitude
    grid_mapping:  crs
    long_name:     River Flow
    units:         m3 s-1
    valid_range:   [0, 5000000]
encoded_da=<xarray.Variable (time: 7, feature_id: 2776738)>
dask.array<open_dataset-20bffbb6c974fc573210bff50b7b4738streamflow, shape=(7, 2776738), dtype=float64, chunksize=(7, 65536), chunktype=numpy.ndarray>
Attributes:
    coordinates:   latitude longitude
    grid_mapping:  crs
    long_name:     River Flow
    units:         m3 s-1
    valid_range:   [0, 5000000]
encoding={'chunks': (512, 65536), 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None}
da.encoding={'chunks': (512, 65536), 'preferred_chunks': {'time': 512, 'feature_id': 65536}, 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None, '_FillValue': nan, 'dtype': dtype('float64')}
zattrs={'coordinates': 'latitude longitude', 'grid_mapping': 'crs', 'long_name': 'River Flow', 'units': 'm3 s-1', 'valid_range': [0, 5000000], '_ARRAY_DIMENSIONS': ['time', 'feature_id']}
/home/cstringari/mambaforge/envs/nwm-api/lib/python3.11/site-packages/xarray/coding/times.py:618: RuntimeWarning: invalid value encountered in cast
  int_num = np.asarray(num, dtype=np.int64)
sys:1: SerializationWarning: saving variable time with floating point data as an integer dtype without any _FillValue to use for NaNs
/home/cstringari/mambaforge/envs/nwm-api/lib/python3.11/site-packages/xarray/coding/variables.py:510: RuntimeWarning: invalid value encountered in cast
  data = data.astype(dtype=dtype)
key='time', dvar=<xarray.IndexVariable 'time' (time: 7)>
array(['2023-07-18T09:00:00.000000000', '2023-07-18T10:00:00.000000000',
       '2023-07-18T11:00:00.000000000',                           'NaT',
                                 'NaT',                           'NaT',
       '2023-07-18T15:00:00.000000000'], dtype='datetime64[ns]')
da=<xarray.DataArray 'time' (time: 7)>
array(['2023-07-18T09:00:00.000000000', '2023-07-18T10:00:00.000000000',
       '2023-07-18T11:00:00.000000000',                           'NaT',
                                 'NaT',                           'NaT',
       '2023-07-18T15:00:00.000000000'], dtype='datetime64[ns]')
Coordinates:
  * time     (time) datetime64[ns] 2023-07-18T09:00:00 ... 2023-07-18T15:00:00
encoded_da=<xarray.Variable (time: 7)>
array([ 1689670800000000000,  1689674400000000000,  1689678000000000000,
       -9223372036854775808, -9223372036854775808, -9223372036854775808,
        1689692400000000000])
Attributes:
    units:     nanoseconds since 1970-01-01
    calendar:  proleptic_gregorian
encoding={'chunks': (512,), 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None}
da.encoding={'chunks': (512,), 'preferred_chunks': {'time': 512}, 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None, 'units': 'nanoseconds since 1970-01-01', 'calendar': 'proleptic_gregorian', 'dtype': dtype('int64')}
zattrs={'units': 'nanoseconds since 1970-01-01', 'calendar': 'proleptic_gregorian', '_ARRAY_DIMENSIONS': ['time']}
key='velocity', dvar=<xarray.Variable (time: 7, feature_id: 2776738)>
dask.array<open_dataset-20bffbb6c974fc573210bff50b7b4738velocity, shape=(7, 2776738), dtype=float64, chunksize=(7, 65536), chunktype=numpy.ndarray>
Attributes:
    coordinates:   latitude longitude
    grid_mapping:  crs
    long_name:     River Velocity
    units:         m s-1
    valid_range:   [0, 5000000]
da=<xarray.DataArray 'velocity' (time: 7, feature_id: 2776738)>
dask.array<open_dataset-20bffbb6c974fc573210bff50b7b4738velocity, shape=(7, 2776738), dtype=float64, chunksize=(7, 65536), chunktype=numpy.ndarray>
Coordinates:
  * feature_id  (feature_id) int32 101 179 181 ... 1180001803 1180001804
  * time        (time) datetime64[ns] 2023-07-18T09:00:00 ... 2023-07-18T15:0...
Attributes:
    coordinates:   latitude longitude
    grid_mapping:  crs
    long_name:     River Velocity
    units:         m s-1
    valid_range:   [0, 5000000]
encoded_da=<xarray.Variable (time: 7, feature_id: 2776738)>
dask.array<open_dataset-20bffbb6c974fc573210bff50b7b4738velocity, shape=(7, 2776738), dtype=float64, chunksize=(7, 65536), chunktype=numpy.ndarray>
Attributes:
    coordinates:   latitude longitude
    grid_mapping:  crs
    long_name:     River Velocity
    units:         m s-1
    valid_range:   [0, 5000000]
encoding={'chunks': (512, 65536), 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None}
da.encoding={'chunks': (512, 65536), 'preferred_chunks': {'time': 512, 'feature_id': 65536}, 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None, '_FillValue': nan, 'dtype': dtype('float64')}
zattrs={'coordinates': 'latitude longitude', 'grid_mapping': 'crs', 'long_name': 'River Velocity', 'units': 'm s-1', 'valid_range': [0, 5000000], '_ARRAY_DIMENSIONS': ['time', 'feature_id']}
abkfenris commented 1 year ago

Great! @caiostringari Were you able to see if ds = ds.reset_encoding() worked for your dataset to?

As I kind of expected, it looks like your actual chunksize doesn't match your encoded one in all cases. At least from my glance through there, feature_id seems to be the the troublemaker.

The encoded has chunks and preffered chunks defined {'chunks': (173547,), 'preferred_chunks': {'feature_id': 173547}, 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None, 'dtype': dtype('int32')}, but dask instead went with dask.array<open_dataset-20bffbb6c974fc573210bff50b7b4738nudge, shape=(7, 2776738), dtype=float64, chunksize=(7, 65536), chunktype=numpy.ndarray>.

I'm guessing you don't need to be explicitly setting your time chunks after you open, unless you need to re-chunk.

kthyng commented 1 year ago

I have this same error and ds = ds.reset_encoding() did indeed fix the issue, so I'll incorporate that into my set up for the dataset.

Details: a 3D variable (Pair) has encoded chunks of 1 time step but the test dataset I am using is read in eagerly and has 48 time steps so the assumed chunks from var_chunks = da.shape in utils/zarr.py gives 48 time steps as the chunk size, setting up this error to occur.

kthyng commented 1 year ago

Though cf-xarray uses coordinates saved in encoding so I worry about the consequences of doing this. @dcherian is it an issue for the functionality of cf-xarray to removing all encoding from a dataset?

dcherian commented 1 year ago

I would just copy the coordinates attribute over to attrs before reset_encoding

kthyng commented 1 year ago

Ah I sometimes see "coordinates" as an attribute instead of in encoding though I started moving it to encoding because I thought it sometimes wasn't recognized in attributes instead of encoding. Should it work with cfxarray equally well in either location?

kthyng commented 1 year ago

Inputting chunks = {} to my xr.open_dataset() also avoided the situation since I think then the dask arrays took on the chunks sizes specified in the encoding (I might be wrong about that).