xpublish-community / xpublish

Publish Xarray Datasets via a REST API.
https://xpublish.readthedocs.io
Apache License 2.0
167 stars 23 forks source link

Zarr Router drops associated coordinates from DataArray #175

Closed mpiannucci closed 1 year ago

mpiannucci commented 1 year ago

When serving a kerchunked data set from AWS, the coordinates are not all transmitted when accsssing the same dataset with the Zarr router, in this case specifically lat_rho and lon_rho. The dataset used in the below examples is public.

Accessed with just xarray:

import xarray as xr
import fsspec

# open dataset as zarr object using fsspec reference file system and xarray
fs = fsspec.filesystem("reference", fo='s3://nextgen-dmac/nos/nos.dbofs.fields.best.nc.zarr', remote_protocol='s3', remote_options={'anon':True, 'use_ssl': False}, target_protocol='s3', target_options={'anon':True, 'use_ssl': False})
m = fs.get_mapper("") 

ds = xr.open_dataset(m, engine="zarr", backend_kwargs=dict(consolidated=False), 
                      chunks={'ocean_time':1})
ds

Screenshot 2023-04-03 at 5 13 39 PM

Accessed with xpublish's zarr router:

import xarray as xr
from fsspec.implementations.http import HTTPFileSystem

# We can access our API using fsspec's HTTPFileSystem
fs = HTTPFileSystem()

# The http mapper gives us a dict-like interface to the API, when xpublish is running at localhost:8090
http_map = fs.get_mapper("http://0.0.0.0:8090/datasets/dbofs")

ds = xr.open_zarr(http_map, consolidated=True)

ds.temp

Screenshot 2023-04-03 at 5 18 54 PM

As a check, I logged the coords of the same variable on the xpublish side, which gives the following:

Screenshot 2023-04-03 at 8 40 57 PM

So I am not sure where, but somewhere along the line the lat_rho and lon_rho coords are dropped by the zarr router. I am not sure if this is by design, and I can look into it, but wanted to raise it in case there is info I am missing before I dig too deeply.

abkfenris commented 1 year ago

That definitely doesn't seem right to me, but I've got less experience with the depths of the Zarr router.

It also looks like forecast_reference_time is getting dropped after a single level is selected.

I'm guessing it's not by design so feel free to start digging.

mpiannucci commented 1 year ago

Comparing the .zattrs that is specified by the source dataset for one variable:

 "temp\/.zattrs": "{\"_ARRAY_DIMENSIONS\":[\"ocean_time\",\"s_rho\",\"eta_rho\",\"xi_rho\"],\"cell_methods\":\"ocean_time: point\",\"coordinates\":\"lon_rho lat_rho s_rho ocean_time\",\"field\":\"temperature\",\"grid\":\"grid\",\"location\":\"face\",\"long_name\":\"potential temperature\",\"standard_name\":\"sea_water_potential_temperature\",\"time\":\"ocean_time\",\"units\":\"Celsius\"}",

to the one that the zarr router publishes:

        "temp/.zattrs": {
            "cell_methods": "ocean_time: point",
            "field": "temperature",
            "grid": "grid",
            "location": "face",
            "long_name": "potential temperature",
            "standard_name": "sea_water_potential_temperature",
            "time": "ocean_time",
            "units": "Celsius",
            "_ARRAY_DIMENSIONS": [
                "ocean_time",
                "s_rho",
                "eta_rho",
                "xi_rho"
            ]
        },

So it looks like all of the metadata for the array is returned except the coordinates for some reason

abkfenris commented 1 year ago

The overall metadata is setup here:

https://github.com/xarray-contrib/xpublish/blob/5ab6dcb975de8e27273bdb641b97bdd75e6c4d42/xpublish/utils/zarr.py#L100-L115

But it looks like each variable is encoded here:

https://github.com/xarray-contrib/xpublish/blob/5ab6dcb975de8e27273bdb641b97bdd75e6c4d42/xpublish/utils/zarr.py#L40-L51

I believe Zarr v3 is also landing really soon which will require a bit more restructuring (consolidated by default, removing some Python JSON parsing quirks from the spec) so if it's going to be a good bit of work, it may be worth waiting till v3 and make one push to get the router compliant. Or maybe we want to split off the V2 router into it's own plugin...

mpiannucci commented 1 year ago

yeah i was looking at the same. So coordinates arent included as attrs on the data array. So if I encode them into the attrs separately that might fix it.

But it seems that create_zmetadata(dataset) encodes a Variable and not a DataArray and variables only have dims and no coords.

abkfenris commented 1 year ago

It might be worth looking into xarray.backends.zarr.ZarrStore.store() as that is what ds.to_zarr() ends up calling:

https://github.com/pydata/xarray/blob/d4db16699f30ad1dc3e6861601247abf4ac96567/xarray/backends/zarr.py#L545-L620

https://github.com/pydata/xarray/blob/d4db16699f30ad1dc3e6861601247abf4ac96567/xarray/backends/api.py#L1519C6-L1664

mpiannucci commented 1 year ago

I got it working by encoding attributes from the DataArray instead of the Variable itself. Im not sure how youll feel about it but I will put up a PR and we can go from there