Open caiostringari opened 1 year ago
Hi @caiostringari I think you haven't heard from anyone, because we may need some more info to start helping you debug.
ds.rest.serve()
vs xpublish.Rest()
...)? /plugins
and /versions
would be fantastic)? overwrite_encoded_chunks
and ds.chunk
(were those in the docs for the dataset)?ValueError
on request, or is the client (and how is your client configured to connect to the server)? @abkfenris So I came to Issues to point out a similar problem. Here is more information.
/zarr/.zmetadata
and /zarr/{var}/{chunk}
.utils.zarr.jsonify_zmetadata()
fails with a key error (see "Error code w/ OSN data").utils.zarr._extract_zarray()
fails with a ValueError as described by @caiostringari. not sure why I get different issues when I try at different times. This to me suggests a cacheing issue?/zarr/.zgroup
and /zarr/.zattrs
work fine. zjson = jsonify_zmetadata(dataset, zmetadata)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\xrnogueira\Miniconda3\envs\catalog_to_xpublish_dev\Lib\site-packages\xpublish\utils\zarr.py", line 142, in jsonify_zmetadata
compressor = zjson['metadata'][f'{key}/{array_meta_key}']['compressor']
~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyError: 'air/.zarray'
{
"python": "3.11.4 | packaged by conda-forge | (main, Jun 10 2023, 17:59:51) [MSC v.1935 64 bit (AMD64)]",
"python-bits": 64,
"OS": "Windows",
"OS-release": "10",
"Version": "10.0.19043",
"machine": "AMD64",
"processor": "Intel64 Family 6 Model 140 Stepping 1, GenuineIntel",
"byteorder": "little",
"LC_ALL": "None",
"LANG": "en_US.UTF-8",
"LOCALE": "English_United States.1252",
"libhdf5": "1.14.0",
"libnetcdf": "4.9.2",
"xarray": "2023.5.0",
"zarr": "2.15.0",
"numcodecs": "0.11.0",
"fastapi": "0.97.0",
"starlette": "0.27.0",
"pandas": "2.0.2",
"numpy": "1.25.0",
"dask": "2023.6.0",
"distributed": "2023.6.0",
"uvicorn": "0.22.0"
}
conus404-hourly-osn:
driver: zarr
description: "CONUS404 - OSN pod storage, 70 TB, 40 years of hourly data, CONUS extent with 4 km gridded spatial resolution, 157 variables"
args:
urlpath: 's3://rsignellbucket2/hytest/conus404/conus404_hourly_202302.zarr'
consolidated: true
storage_options:
anon: true
requester_pays: false
client_kwargs:
endpoint_url: https://renc.osn.xsede.org
alaska-et-2020-subset-osn:
driver: zarr
description: "Sample subset - OSN pod storage, 863M, Gridded 20km Daily Reference Evapotranspiration for the State of Alaska from 1979 to 2017/CCSM4 historical simulation"
args:
urlpath: 's3://rsignellbucket2/nhgf/sample_data/ccsm4.zarr'
consolidated: true
storage_options:
anon: true
requester_pays: false
client_kwargs:
endpoint_url: https://renc.osn.xsede.org
conus404-hourly-s3:
driver: zarr
description: "CONUS404 - s3 storage, 70 TB, 40 years of hourly data, CONUS extent with 4 km gridded spatial resolution, 157 variables"
args:
urlpath: 's3://nhgf-development/conus404/conus404_hourly_202209.zarr'
consolidated: true
storage_options:
requester_pays: true
Thanks @xaviernogueira that let me dig into it some. https://gist.github.com/abkfenris/23fe268eb3f3479919a267efe392e4a5
I didn't end up trying with requester pays (or the OSN that works for that matter), but I was able to reproduce the ValueError
with the OSN dataset.
It looks like an encoding may be set on time
, even though it's a numpy array rather than a dask array under the hood, which appears to be causing a mismatch.
If I yoink the .zmetadata
direct from OSN, it's got chunks on time/.zarray
{
"metadata": {
...
"time/.zarray": {
"chunks": [
46008
],
"compressor": {
"id": "zstd",
"level": 9
},
"dtype": "<i8",
"fill_value": null,
"filters": null,
"order": "C",
"shape": [
368064
],
"zarr_format": 2
},
"time/.zattrs": {
"_ARRAY_DIMENSIONS": [
"time"
],
"calendar": "proleptic_gregorian",
"standard_name": "time",
"units": "hours since 1979-10-01 00:00:00"
},
...
}
}
This is probably over my head for Zarr specifics so I'm not sure if we should go for the encoded/inferred chunks in this case, but maybe @jhamman has some thoughts.
@abkfenris
So it occured to me that xarray.Dataset.unify_chunks()
could potentially address this. I just tried this and where I previously got the KeyError ("conus404-hourly-osn") I got the ValueError instead now. But then I realized I couldn't recreate the KeyError which is wierd.
@abkfenris,
How are you launching Xpublish for the datasets (ds.rest.serve() vs xpublish.Rest()...)? I am using
ds.rest(
app_kws=dict(
title="Some title here",
description="Some description here.",
openapi_url="/dataset.json",
),
cache_kws=dict(available_bytes=1e9), # this is 1 GB worth of cache.
)
`ds.rest.serve()`.
What version of Xpublish, supporting libraries, and any plugins are you using (the output of /plugins and /versions would be fantastic)?
{
"dataset_info": {
"path": "xpublish.plugins.included.dataset_info.DatasetInfoPlugin",
"version": "0.3.0"
},
"module_version": {
"path": "xpublish.plugins.included.module_version.ModuleVersionPlugin",
"version": "0.3.0"
},
"plugin_info": {
"path": "xpublish.plugins.included.plugin_info.PluginInfoPlugin",
"version": "0.3.0"
},
"zarr": {
"path": "xpublish.plugins.included.zarr.ZarrPlugin",
"version": "0.3.0"
}
}
{
"python": "3.11.3 | packaged by conda-forge | (main, Apr 6 2023, 08:57:19) [GCC 11.3.0]",
"python-bits": 64,
"OS": "Linux",
"OS-release": "5.15.90.1-microsoft-standard-WSL2",
"Version": "#1 SMP Fri Jan 27 02:56:13 UTC 2023",
"machine": "x86_64",
"processor": "x86_64",
"byteorder": "little",
"LC_ALL": "None",
"LANG": "C.UTF-8",
"LOCALE": "en_US.UTF-8",
"libhdf5": "1.12.2",
"libnetcdf": null,
"xarray": "2023.4.2",
"zarr": "2.14.2",
"numcodecs": "0.11.0",
"fastapi": "0.95.1",
"starlette": "0.26.1",
"pandas": "2.0.1",
"numpy": "1.24.3",
"dask": "2023.4.1",
"uvicorn": "0.22.0"
}
Does this occur with other datasets (say the Xarray tutorial datasets, or others that we can try without credentials)?
If I use ds = xr.tutorial.load_dataset('air_temperature')
as an example, it works as expected.
Have you tried without overwrite_encoded_chunks and ds.chunk (were those in the docs for the dataset)?
Yes, I tried both with and without overwrite_encoded_chunks
, ds.chunk
and with unify_chunks()
. Always got the same error.
Is the server throwing the ValueError on request, or is the client (and how is your client configured to connect to the server)? The server is throwing the errors:
Traceback (most recent call last):
File "/home/cstringari/mambaforge/envs/nwm-api/lib/python3.11/site-packages/uvicorn/protocols/http/h11_impl.py", line 428, in run_asgi
result = await app( # type: ignore[func-returns-value]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cstringari/mambaforge/envs/nwm-api/lib/python3.11/site-packages/uvicorn/middleware/proxy_headers.py", line 78, in __call__
return await self.app(scope, receive, send)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cstringari/mambaforge/envs/nwm-api/lib/python3.11/site-packages/fastapi/applications.py", line 276, in __call__
await super().__call__(scope, receive, send)
File "/home/cstringari/mambaforge/envs/nwm-api/lib/python3.11/site-packages/starlette/applications.py", line 122, in __call__
await self.middleware_stack(scope, receive, send)
File "/home/cstringari/mambaforge/envs/nwm-api/lib/python3.11/site-packages/starlette/middleware/errors.py", line 184, in __call__
raise exc
File "/home/cstringari/mambaforge/envs/nwm-api/lib/python3.11/site-packages/starlette/middleware/errors.py", line 162, in __call__
await self.app(scope, receive, _send)
File "/home/cstringari/mambaforge/envs/nwm-api/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
raise exc
File "/home/cstringari/mambaforge/envs/nwm-api/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
await self.app(scope, receive, sender)
File "/home/cstringari/mambaforge/envs/nwm-api/lib/python3.11/site-packages/fastapi/middleware/asyncexitstack.py", line 21, in __call__
raise e
File "/home/cstringari/mambaforge/envs/nwm-api/lib/python3.11/site-packages/fastapi/middleware/asyncexitstack.py", line 18, in __call__
await self.app(scope, receive, send)
File "/home/cstringari/mambaforge/envs/nwm-api/lib/python3.11/site-packages/starlette/routing.py", line 718, in __call__
await route.handle(scope, receive, send)
File "/home/cstringari/mambaforge/envs/nwm-api/lib/python3.11/site-packages/starlette/routing.py", line 276, in handle
await self.app(scope, receive, send)
File "/home/cstringari/mambaforge/envs/nwm-api/lib/python3.11/site-packages/starlette/routing.py", line 66, in app
response = await func(request)
^^^^^^^^^^^^^^^^^^^
File "/home/cstringari/mambaforge/envs/nwm-api/lib/python3.11/site-packages/fastapi/routing.py", line 237, in app
raw_response = await run_endpoint_function(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cstringari/mambaforge/envs/nwm-api/lib/python3.11/site-packages/fastapi/routing.py", line 165, in run_endpoint_function
return await run_in_threadpool(dependant.call, **values)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cstringari/mambaforge/envs/nwm-api/lib/python3.11/site-packages/starlette/concurrency.py", line 41, in run_in_threadpool
return await anyio.to_thread.run_sync(func, *args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cstringari/mambaforge/envs/nwm-api/lib/python3.11/site-packages/anyio/to_thread.py", line 31, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cstringari/mambaforge/envs/nwm-api/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
return await future
^^^^^^^^^^^^
File "/home/cstringari/mambaforge/envs/nwm-api/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 867, in run
result = context.run(func, *args)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cstringari/mambaforge/envs/nwm-api/lib/python3.11/site-packages/xpublish/plugins/included/zarr.py", line 39, in get_zarr_metadata
zmetadata = get_zmetadata(dataset, cache, zvariables)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cstringari/mambaforge/envs/nwm-api/lib/python3.11/site-packages/xpublish/dependencies.py", line 97, in get_zmetadata
zmeta = create_zmetadata(dataset)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cstringari/mambaforge/envs/nwm-api/lib/python3.11/site-packages/xpublish/utils/zarr.py", line 126, in create_zmetadata
zmeta['metadata'][f'{key}/{array_meta_key}'] = _extract_zarray(
^^^^^^^^^^^^^^^^
File "/home/cstringari/mambaforge/envs/nwm-api/lib/python3.11/site-packages/xpublish/utils/zarr.py", line 94, in _extract_zarray
raise ValueError('Encoding chunks do not match inferred chunks')
ValueError: Encoding chunks do not match inferred chunks
Since my original post, I re-built by my zarr file from scratch but got the same errors. The data is hosted on Azure and time
encoded as well.
@caiostringari sorry it's taken a few days for me to take another look. Thanks for all that info, but I think we might need a bit more info about the dataset, it's chunk geometry, and any encoding.
Can you try running this against your dataset? It's largely the guts of xpublish.utils.zarr.create_zmetadata
that dumps the variables that are created for each variable if _extract_zarray
throw an error.
from xpublish.utils import zarr
for key, dvar in ds.variables.items():
da = ds[key]
encoded_da = zarr.encode_zarr_variable(dvar, name=key)
encoding = zarr.extract_zarr_variable_encoding(dvar)
zattrs = zarr._extract_dataarray_zattrs(encoded_da)
zattrs = zarr._extract_dataarray_coords(da, zattrs)
try:
extracted_zarray = zarr._extract_zarray(
encoded_da, encoding, encoded_da.dtype
)
except ValueError:
print(f"{key=}, {dvar=}")
print(f"{da=}")
print(f"{encoded_da=}")
print(f"{encoding=}")
print(f"{da.encoding=}")
print(f"{zattrs=}")
The top level ds.encoding
might also be useful to.
What helped in my case: After setting chunks with .chunk
, instead of setting a new chunk encoding (~ds.encoding = {"time": 2**12, "feature_id": 2**16}
~), just simply remove all chunk encoding:
for var in ds.data_vars:
del ds[var].encoding["chunks"]
for var in ds.coords:
if "chunks" in ds[var].encoding:
del ds[var].encoding["chunks"]
@jhamman mentioned that setting the encoding after specifying chunks is problematic in Xarray anyways and is something they are trying to move away from, and to try ds = ds.reset_encoding()
.
Sorry for the delay,
@wachsylon solution works my dataset! =)
@abkfenris here are the outputs
key='feature_id', dvar=<xarray.IndexVariable 'feature_id' (feature_id: 2776738)>
array([ 101, 179, 181, ..., 1180001802, 1180001803,
1180001804], dtype=int32)
Attributes:
cf_role: timeseries_id
comment: NHDPlusv2 ComIDs within CONUS, arbitrary Reach IDs outside of...
long_name: Reach ID
da=<xarray.DataArray 'feature_id' (feature_id: 2776738)>
array([ 101, 179, 181, ..., 1180001802, 1180001803,
1180001804], dtype=int32)
Coordinates:
* feature_id (feature_id) int32 101 179 181 ... 1180001803 1180001804
Attributes:
cf_role: timeseries_id
comment: NHDPlusv2 ComIDs within CONUS, arbitrary Reach IDs outside of...
long_name: Reach ID
encoded_da=<xarray.Variable (feature_id: 2776738)>
array([ 101, 179, 181, ..., 1180001802, 1180001803,
1180001804], dtype=int32)
Attributes:
cf_role: timeseries_id
comment: NHDPlusv2 ComIDs within CONUS, arbitrary Reach IDs outside of...
long_name: Reach ID
encoding={'chunks': (173547,), 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None}
da.encoding={'chunks': (173547,), 'preferred_chunks': {'feature_id': 173547}, 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None, 'dtype': dtype('int32')}
zattrs={'cf_role': 'timeseries_id', 'comment': 'NHDPlusv2 ComIDs within CONUS, arbitrary Reach IDs outside of CONUS', 'long_name': 'Reach ID', '_ARRAY_DIMENSIONS': ['feature_id']}
key='nudge', dvar=<xarray.Variable (time: 7, feature_id: 2776738)>
dask.array<open_dataset-20bffbb6c974fc573210bff50b7b4738nudge, shape=(7, 2776738), dtype=float64, chunksize=(7, 65536), chunktype=numpy.ndarray>
Attributes:
coordinates: latitude longitude
grid_mapping: crs
long_name: Amount of stream flow alteration
units: m3 s-1
valid_range: [-5000000, 5000000]
da=<xarray.DataArray 'nudge' (time: 7, feature_id: 2776738)>
dask.array<open_dataset-20bffbb6c974fc573210bff50b7b4738nudge, shape=(7, 2776738), dtype=float64, chunksize=(7, 65536), chunktype=numpy.ndarray>
Coordinates:
* feature_id (feature_id) int32 101 179 181 ... 1180001803 1180001804
* time (time) datetime64[ns] 2023-07-18T09:00:00 ... 2023-07-18T15:0...
Attributes:
coordinates: latitude longitude
grid_mapping: crs
long_name: Amount of stream flow alteration
units: m3 s-1
valid_range: [-5000000, 5000000]
encoded_da=<xarray.Variable (time: 7, feature_id: 2776738)>
dask.array<open_dataset-20bffbb6c974fc573210bff50b7b4738nudge, shape=(7, 2776738), dtype=float64, chunksize=(7, 65536), chunktype=numpy.ndarray>
Attributes:
coordinates: latitude longitude
grid_mapping: crs
long_name: Amount of stream flow alteration
units: m3 s-1
valid_range: [-5000000, 5000000]
encoding={'chunks': (512, 65536), 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None}
da.encoding={'chunks': (512, 65536), 'preferred_chunks': {'time': 512, 'feature_id': 65536}, 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None, '_FillValue': nan, 'dtype': dtype('float64')}
zattrs={'coordinates': 'latitude longitude', 'grid_mapping': 'crs', 'long_name': 'Amount of stream flow alteration', 'units': 'm3 s-1', 'valid_range': [-5000000, 5000000], '_ARRAY_DIMENSIONS': ['time', 'feature_id']}
key='qBtmVertRunoff', dvar=<xarray.Variable (time: 7, feature_id: 2776738)>
dask.array<open_dataset-20bffbb6c974fc573210bff50b7b4738qBtmVertRunoff, shape=(7, 2776738), dtype=float64, chunksize=(7, 65536), chunktype=numpy.ndarray>
Attributes:
coordinates: latitude longitude
grid_mapping: crs
long_name: Runoff from bottom of soil to bucket
units: m3
valid_range: [0, 20000000]
da=<xarray.DataArray 'qBtmVertRunoff' (time: 7, feature_id: 2776738)>
dask.array<open_dataset-20bffbb6c974fc573210bff50b7b4738qBtmVertRunoff, shape=(7, 2776738), dtype=float64, chunksize=(7, 65536), chunktype=numpy.ndarray>
Coordinates:
* feature_id (feature_id) int32 101 179 181 ... 1180001803 1180001804
* time (time) datetime64[ns] 2023-07-18T09:00:00 ... 2023-07-18T15:0...
Attributes:
coordinates: latitude longitude
grid_mapping: crs
long_name: Runoff from bottom of soil to bucket
units: m3
valid_range: [0, 20000000]
encoded_da=<xarray.Variable (time: 7, feature_id: 2776738)>
dask.array<open_dataset-20bffbb6c974fc573210bff50b7b4738qBtmVertRunoff, shape=(7, 2776738), dtype=float64, chunksize=(7, 65536), chunktype=numpy.ndarray>
Attributes:
coordinates: latitude longitude
grid_mapping: crs
long_name: Runoff from bottom of soil to bucket
units: m3
valid_range: [0, 20000000]
encoding={'chunks': (512, 65536), 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None}
da.encoding={'chunks': (512, 65536), 'preferred_chunks': {'time': 512, 'feature_id': 65536}, 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None, '_FillValue': nan, 'dtype': dtype('float64')}
zattrs={'coordinates': 'latitude longitude', 'grid_mapping': 'crs', 'long_name': 'Runoff from bottom of soil to bucket', 'units': 'm3', 'valid_range': [0, 20000000], '_ARRAY_DIMENSIONS': ['time', 'feature_id']}
key='qBucket', dvar=<xarray.Variable (time: 7, feature_id: 2776738)>
dask.array<open_dataset-20bffbb6c974fc573210bff50b7b4738qBucket, shape=(7, 2776738), dtype=float64, chunksize=(7, 65536), chunktype=numpy.ndarray>
Attributes:
coordinates: latitude longitude
grid_mapping: crs
long_name: Flux from gw bucket
units: m3 s-1
valid_range: [0, 2000000000]
da=<xarray.DataArray 'qBucket' (time: 7, feature_id: 2776738)>
dask.array<open_dataset-20bffbb6c974fc573210bff50b7b4738qBucket, shape=(7, 2776738), dtype=float64, chunksize=(7, 65536), chunktype=numpy.ndarray>
Coordinates:
* feature_id (feature_id) int32 101 179 181 ... 1180001803 1180001804
* time (time) datetime64[ns] 2023-07-18T09:00:00 ... 2023-07-18T15:0...
Attributes:
coordinates: latitude longitude
grid_mapping: crs
long_name: Flux from gw bucket
units: m3 s-1
valid_range: [0, 2000000000]
encoded_da=<xarray.Variable (time: 7, feature_id: 2776738)>
dask.array<open_dataset-20bffbb6c974fc573210bff50b7b4738qBucket, shape=(7, 2776738), dtype=float64, chunksize=(7, 65536), chunktype=numpy.ndarray>
Attributes:
coordinates: latitude longitude
grid_mapping: crs
long_name: Flux from gw bucket
units: m3 s-1
valid_range: [0, 2000000000]
encoding={'chunks': (512, 65536), 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None}
da.encoding={'chunks': (512, 65536), 'preferred_chunks': {'time': 512, 'feature_id': 65536}, 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None, '_FillValue': nan, 'dtype': dtype('float64')}
zattrs={'coordinates': 'latitude longitude', 'grid_mapping': 'crs', 'long_name': 'Flux from gw bucket', 'units': 'm3 s-1', 'valid_range': [0, 2000000000], '_ARRAY_DIMENSIONS': ['time', 'feature_id']}
key='qSfcLatRunoff', dvar=<xarray.Variable (time: 7, feature_id: 2776738)>
dask.array<open_dataset-20bffbb6c974fc573210bff50b7b4738qSfcLatRunoff, shape=(7, 2776738), dtype=float64, chunksize=(7, 65536), chunktype=numpy.ndarray>
Attributes:
coordinates: latitude longitude
grid_mapping: crs
long_name: Runoff from terrain routing
units: m3 s-1
valid_range: [0, 2000000000]
da=<xarray.DataArray 'qSfcLatRunoff' (time: 7, feature_id: 2776738)>
dask.array<open_dataset-20bffbb6c974fc573210bff50b7b4738qSfcLatRunoff, shape=(7, 2776738), dtype=float64, chunksize=(7, 65536), chunktype=numpy.ndarray>
Coordinates:
* feature_id (feature_id) int32 101 179 181 ... 1180001803 1180001804
* time (time) datetime64[ns] 2023-07-18T09:00:00 ... 2023-07-18T15:0...
Attributes:
coordinates: latitude longitude
grid_mapping: crs
long_name: Runoff from terrain routing
units: m3 s-1
valid_range: [0, 2000000000]
encoded_da=<xarray.Variable (time: 7, feature_id: 2776738)>
dask.array<open_dataset-20bffbb6c974fc573210bff50b7b4738qSfcLatRunoff, shape=(7, 2776738), dtype=float64, chunksize=(7, 65536), chunktype=numpy.ndarray>
Attributes:
coordinates: latitude longitude
grid_mapping: crs
long_name: Runoff from terrain routing
units: m3 s-1
valid_range: [0, 2000000000]
encoding={'chunks': (512, 65536), 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None}
da.encoding={'chunks': (512, 65536), 'preferred_chunks': {'time': 512, 'feature_id': 65536}, 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None, '_FillValue': nan, 'dtype': dtype('float64')}
zattrs={'coordinates': 'latitude longitude', 'grid_mapping': 'crs', 'long_name': 'Runoff from terrain routing', 'units': 'm3 s-1', 'valid_range': [0, 2000000000], '_ARRAY_DIMENSIONS': ['time', 'feature_id']}
key='streamflow', dvar=<xarray.Variable (time: 7, feature_id: 2776738)>
dask.array<open_dataset-20bffbb6c974fc573210bff50b7b4738streamflow, shape=(7, 2776738), dtype=float64, chunksize=(7, 65536), chunktype=numpy.ndarray>
Attributes:
coordinates: latitude longitude
grid_mapping: crs
long_name: River Flow
units: m3 s-1
valid_range: [0, 5000000]
da=<xarray.DataArray 'streamflow' (time: 7, feature_id: 2776738)>
dask.array<open_dataset-20bffbb6c974fc573210bff50b7b4738streamflow, shape=(7, 2776738), dtype=float64, chunksize=(7, 65536), chunktype=numpy.ndarray>
Coordinates:
* feature_id (feature_id) int32 101 179 181 ... 1180001803 1180001804
* time (time) datetime64[ns] 2023-07-18T09:00:00 ... 2023-07-18T15:0...
Attributes:
coordinates: latitude longitude
grid_mapping: crs
long_name: River Flow
units: m3 s-1
valid_range: [0, 5000000]
encoded_da=<xarray.Variable (time: 7, feature_id: 2776738)>
dask.array<open_dataset-20bffbb6c974fc573210bff50b7b4738streamflow, shape=(7, 2776738), dtype=float64, chunksize=(7, 65536), chunktype=numpy.ndarray>
Attributes:
coordinates: latitude longitude
grid_mapping: crs
long_name: River Flow
units: m3 s-1
valid_range: [0, 5000000]
encoding={'chunks': (512, 65536), 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None}
da.encoding={'chunks': (512, 65536), 'preferred_chunks': {'time': 512, 'feature_id': 65536}, 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None, '_FillValue': nan, 'dtype': dtype('float64')}
zattrs={'coordinates': 'latitude longitude', 'grid_mapping': 'crs', 'long_name': 'River Flow', 'units': 'm3 s-1', 'valid_range': [0, 5000000], '_ARRAY_DIMENSIONS': ['time', 'feature_id']}
/home/cstringari/mambaforge/envs/nwm-api/lib/python3.11/site-packages/xarray/coding/times.py:618: RuntimeWarning: invalid value encountered in cast
int_num = np.asarray(num, dtype=np.int64)
sys:1: SerializationWarning: saving variable time with floating point data as an integer dtype without any _FillValue to use for NaNs
/home/cstringari/mambaforge/envs/nwm-api/lib/python3.11/site-packages/xarray/coding/variables.py:510: RuntimeWarning: invalid value encountered in cast
data = data.astype(dtype=dtype)
key='time', dvar=<xarray.IndexVariable 'time' (time: 7)>
array(['2023-07-18T09:00:00.000000000', '2023-07-18T10:00:00.000000000',
'2023-07-18T11:00:00.000000000', 'NaT',
'NaT', 'NaT',
'2023-07-18T15:00:00.000000000'], dtype='datetime64[ns]')
da=<xarray.DataArray 'time' (time: 7)>
array(['2023-07-18T09:00:00.000000000', '2023-07-18T10:00:00.000000000',
'2023-07-18T11:00:00.000000000', 'NaT',
'NaT', 'NaT',
'2023-07-18T15:00:00.000000000'], dtype='datetime64[ns]')
Coordinates:
* time (time) datetime64[ns] 2023-07-18T09:00:00 ... 2023-07-18T15:00:00
encoded_da=<xarray.Variable (time: 7)>
array([ 1689670800000000000, 1689674400000000000, 1689678000000000000,
-9223372036854775808, -9223372036854775808, -9223372036854775808,
1689692400000000000])
Attributes:
units: nanoseconds since 1970-01-01
calendar: proleptic_gregorian
encoding={'chunks': (512,), 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None}
da.encoding={'chunks': (512,), 'preferred_chunks': {'time': 512}, 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None, 'units': 'nanoseconds since 1970-01-01', 'calendar': 'proleptic_gregorian', 'dtype': dtype('int64')}
zattrs={'units': 'nanoseconds since 1970-01-01', 'calendar': 'proleptic_gregorian', '_ARRAY_DIMENSIONS': ['time']}
key='velocity', dvar=<xarray.Variable (time: 7, feature_id: 2776738)>
dask.array<open_dataset-20bffbb6c974fc573210bff50b7b4738velocity, shape=(7, 2776738), dtype=float64, chunksize=(7, 65536), chunktype=numpy.ndarray>
Attributes:
coordinates: latitude longitude
grid_mapping: crs
long_name: River Velocity
units: m s-1
valid_range: [0, 5000000]
da=<xarray.DataArray 'velocity' (time: 7, feature_id: 2776738)>
dask.array<open_dataset-20bffbb6c974fc573210bff50b7b4738velocity, shape=(7, 2776738), dtype=float64, chunksize=(7, 65536), chunktype=numpy.ndarray>
Coordinates:
* feature_id (feature_id) int32 101 179 181 ... 1180001803 1180001804
* time (time) datetime64[ns] 2023-07-18T09:00:00 ... 2023-07-18T15:0...
Attributes:
coordinates: latitude longitude
grid_mapping: crs
long_name: River Velocity
units: m s-1
valid_range: [0, 5000000]
encoded_da=<xarray.Variable (time: 7, feature_id: 2776738)>
dask.array<open_dataset-20bffbb6c974fc573210bff50b7b4738velocity, shape=(7, 2776738), dtype=float64, chunksize=(7, 65536), chunktype=numpy.ndarray>
Attributes:
coordinates: latitude longitude
grid_mapping: crs
long_name: River Velocity
units: m s-1
valid_range: [0, 5000000]
encoding={'chunks': (512, 65536), 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None}
da.encoding={'chunks': (512, 65536), 'preferred_chunks': {'time': 512, 'feature_id': 65536}, 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None, '_FillValue': nan, 'dtype': dtype('float64')}
zattrs={'coordinates': 'latitude longitude', 'grid_mapping': 'crs', 'long_name': 'River Velocity', 'units': 'm s-1', 'valid_range': [0, 5000000], '_ARRAY_DIMENSIONS': ['time', 'feature_id']}
Great! @caiostringari Were you able to see if ds = ds.reset_encoding()
worked for your dataset to?
As I kind of expected, it looks like your actual chunksize doesn't match your encoded one in all cases. At least from my glance through there, feature_id
seems to be the the troublemaker.
The encoded has chunks and preffered chunks defined {'chunks': (173547,), 'preferred_chunks': {'feature_id': 173547}, 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None, 'dtype': dtype('int32')}
, but dask instead went with dask.array<open_dataset-20bffbb6c974fc573210bff50b7b4738nudge, shape=(7, 2776738), dtype=float64, chunksize=(7, 65536), chunktype=numpy.ndarray>
.
I'm guessing you don't need to be explicitly setting your time chunks after you open, unless you need to re-chunk.
I have this same error and ds = ds.reset_encoding()
did indeed fix the issue, so I'll incorporate that into my set up for the dataset.
Details:
a 3D variable (Pair) has encoded chunks of 1 time step but the test dataset I am using is read in eagerly and has 48 time steps so the assumed chunks from var_chunks = da.shape
in utils/zarr.py
gives 48 time steps as the chunk size, setting up this error to occur.
Though cf-xarray uses coordinates saved in encoding
so I worry about the consequences of doing this. @dcherian is it an issue for the functionality of cf-xarray to removing all encoding from a dataset?
I would just copy the coordinates
attribute over to attrs
before reset_encoding
Ah I sometimes see "coordinates" as an attribute instead of in encoding though I started moving it to encoding because I thought it sometimes wasn't recognized in attributes instead of encoding. Should it work with cfxarray equally well in either location?
Inputting chunks = {}
to my xr.open_dataset()
also avoided the situation since I think then the dask arrays took on the chunks sizes specified in the encoding (I might be wrong about that).
Hi,
I am having problems with the
/zarr/get_zarr_metadata
endpoint. I can start the server and I see my dataset but when I try to read data from the client side, I getValueError: Encoding chunks do not match inferred chunks
.I tried to explicitly change the chunks / encoding but it did not seem to work.
My code looks something like this:
Any ideas?
Thank you very much