zarr-developers / VirtualiZarr

Create virtual Zarr stores from archival data files using xarray syntax
https://virtualizarr.readthedocs.io/en/stable/api.html
Apache License 2.0
113 stars 22 forks source link

open_virtual_dataset returns some coordinates as data variables #189

Open ayushnag opened 3 months ago

ayushnag commented 3 months ago

The xr.Dataset constructed by open_virtual_dataset doesn't seem to correctly identify coordinates when the coordinate has more than one dimension. The bug seems to be in separate_coords on this line. The correct functionality could be to use the coordinates attribute within each variables .zattrs and maintain a set of all coordinate names

Here is a reproducible example:

>>> import xarray as xr
>>> xr.tutorial.open_dataset("ROMS_example.nc") 
<xarray.Dataset> Size: 19MB
Dimensions:     (ocean_time: 2, s_rho: 30, eta_rho: 191, xi_rho: 371)
Coordinates:
    Cs_r        (s_rho) float64 240B ...
    lon_rho     (eta_rho, xi_rho) float64 567kB ...
    hc          float64 8B ...
    h           (eta_rho, xi_rho) float64 567kB ...
    lat_rho     (eta_rho, xi_rho) float64 567kB ...
    Vtransform  int32 4B ...
  * ocean_time  (ocean_time) datetime64[ns] 16B 2001-08-01 2001-08-08
  * s_rho       (s_rho) float64 240B -0.9833 -0.95 -0.9167 ... -0.05 -0.01667
Dimensions without coordinates: eta_rho, xi_rho
Data variables:
    salt        (ocean_time, s_rho, eta_rho, xi_rho) float32 17MB ...
    zeta        (ocean_time, eta_rho, xi_rho) float32 567kB ...
Attributes: (12/34)
    file:              ../output_20yr_obc/2001/ocean_his_0015.nc
    format:            netCDF-4/HDF5 file
    Conventions:       CF-1.4
    type:              ROMS/TOMS history file
    title:             TXLA ROMS hindcast run with dyes and oxygen
    rst_file:          ../output_20yr_obc/2001/ocean_rst.nc
    ...                ...
    compiler_flags:    -heap-arrays -fp-model fast -mt_mpi -ip -O3 -msse2 -free
    tiling:            010x012
    history:           Tue Jul 24 11:04:43 2018: /opt/nco/ncks -D 4 -t 8 /cop...
    ana_file:          /home/d.kobashi/TXLA_ROMS_reana/Functionals/ana_btflux...
    CPP_options:       TXLA2, ANA_BPFLUX, ANA_BSFLUX, ANA_BTFLUX, ANA_NUDGCOE...
    NCO:               netCDF Operators version 4.7.6-alpha04 (Homepage = htt...
% wget https://github.com/pydata/xarray-data/raw/master/ROMS_example.nc
>>> from virtualizarr import open_virtual_dataset
>>> vds = open_virtual_dataset('ROMS_example.nc', indexes={})
>>> vds
<xarray.Dataset> Size: 19MB
Dimensions:     (ocean_time: 2, eta_rho: 191, xi_rho: 371, s_rho: 30)
Coordinates:
    s_rho       (s_rho) float64 240B ManifestArray<shape=(30,), dtype=float64...
    ocean_time  (ocean_time) float64 16B ManifestArray<shape=(2,), dtype=floa...
Dimensions without coordinates: eta_rho, xi_rho
Data variables:
    zeta        (ocean_time, eta_rho, xi_rho) float32 567kB ManifestArray<sha...
    lon_rho     (eta_rho, xi_rho) float64 567kB ManifestArray<shape=(191, 371...
    Vtransform  int32 4B ManifestArray<shape=(), dtype=int32, chunks=()>
    Cs_r        (s_rho) float64 240B ManifestArray<shape=(30,), dtype=float64...
    hc          float64 8B ManifestArray<shape=(), dtype=float64, chunks=()>
    lat_rho     (eta_rho, xi_rho) float64 567kB ManifestArray<shape=(191, 371...
    h           (eta_rho, xi_rho) float64 567kB ManifestArray<shape=(191, 371...
    salt        (ocean_time, s_rho, eta_rho, xi_rho) float32 17MB ManifestArr...
Attributes: (12/34)
    CPP_options:       TXLA2, ANA_BPFLUX, ANA_BSFLUX, ANA_BTFLUX, ANA_NUDGCOE...
    Conventions:       CF-1.4
    NCO:               netCDF Operators version 4.7.6-alpha04 (Homepage = htt...
    NLM_LBC:           \nEDGE:    WEST   SOUTH  EAST   NORTH  \nzeta:    Che ...
    ana_file:          /home/d.kobashi/TXLA_ROMS_reana/Functionals/ana_btflux...
    avg_base:          ../output_20yr_obc/2001/ocean_avg
    ...                ...
    sta_file:          ocean_sta.nc
    svn_rev:            
    svn_url:           https:://myroms.org/svn/src
    tiling:            010x012
    title:             TXLA ROMS hindcast run with dyes and oxygen
    type:              ROMS/TOMS history file

Note that the underlying kerchunk json does have this coordinate information since when you virtualize the dataset and materialize data, the coordinates are correct:

>>> refs = vds.virtualize.to_kerchunk(filepath=None, format="dict")
>>> xr.open_dataset("reference://", engine="zarr", chunks={}, backend_kwargs={"storage_options": {"fo": refs, "consolidated": False}})
<xarray.Dataset> Size: 19MB
Dimensions:     (s_rho: 30, eta_rho: 191, xi_rho: 371, ocean_time: 2)
Coordinates:
    Cs_r        (s_rho) float64 240B dask.array<chunksize=(30,), meta=np.ndarray>
    Vtransform  float64 8B ...
    h           (eta_rho, xi_rho) float64 567kB dask.array<chunksize=(191, 371), meta=np.ndarray>
    hc          float64 8B ...
    lat_rho     (eta_rho, xi_rho) float64 567kB dask.array<chunksize=(191, 371), meta=np.ndarray>
    lon_rho     (eta_rho, xi_rho) float64 567kB dask.array<chunksize=(191, 371), meta=np.ndarray>
  * ocean_time  (ocean_time) datetime64[ns] 16B 2001-08-01 2001-08-08
  * s_rho       (s_rho) float64 240B -0.9833 -0.95 -0.9167 ... -0.05 -0.01667
Dimensions without coordinates: eta_rho, xi_rho
Data variables:
    salt        (ocean_time, s_rho, eta_rho, xi_rho) float32 17MB dask.array<chunksize=(1, 15, 96, 186), meta=np.ndarray>
    zeta        (ocean_time, eta_rho, xi_rho) float32 567kB dask.array<chunksize=(1, 191, 371), meta=np.ndarray>
Attributes: (12/34)
    CPP_options:       TXLA2, ANA_BPFLUX, ANA_BSFLUX, ANA_BTFLUX, ANA_NUDGCOE...
    Conventions:       CF-1.4
    NCO:               netCDF Operators version 4.7.6-alpha04 (Homepage = htt...
    NLM_LBC:           \nEDGE:    WEST   SOUTH  EAST   NORTH  \nzeta:    Che ...
    ana_file:          /home/d.kobashi/TXLA_ROMS_reana/Functionals/ana_btflux...
    avg_base:          ../output_20yr_obc/2001/ocean_avg
    ...                ...
    sta_file:          ocean_sta.nc
    svn_rev:            
    svn_url:           https:://myroms.org/svn/src
    tiling:            010x012
    title:             TXLA ROMS hindcast run with dyes and oxygen
    type:              ROMS/TOMS history file
TomNicholas commented 1 day ago

I wouldn't say this issue is fully closed yet. See https://github.com/zarr-developers/VirtualiZarr/issues/281#issuecomment-2445526098 for an explanation. #191 closes an important part of it but #224 is also required.