zarr-developers / VirtualiZarr

Create virtual Zarr stores from archival data files using xarray syntax
https://virtualizarr.readthedocs.io/en/latest/
Apache License 2.0
68 stars 10 forks source link

xarray.merge virtual datasets fails because of missing chunk managers #141

Open ghidalgo3 opened 3 weeks ago

ghidalgo3 commented 3 weeks ago

I'm trying to build a single virtual datastore from a collection of NetCDF files with VirtualiZarr using the files from the Daymet dataset on the Microsoft Planetary Computer. There is a NetCDF file for each (year, variable) combination, so I'm thinking that to build a single datastore I need to

  1. Open all the datasets
  2. xr.concat on time
  3. xr.merge on variables
  4. Write the virtual Zarr store to disk or Azure Storage.

Roughly here's what I'm doing:

daily = [blob for blob in all_blobs if blob.startswith("2129")
         and "_hi_" in blob] # only hawaii
paths = [f"az://test-update/{blob}" for blob in daily]
def open_daymet_dataset(path, storage_options) -> xr.Dataset:
    return virtualizarr.open_virtual_dataset(
        path,
        reader_options={"storage_options": storage_options},
        drop_variables=["lambert_conformal_conic"])
datasets = [open_daymet_dataset(path, storage_options) for path in paths if "1980" in path or "1981" in path]
# concat works!
dayl_concat = xr.concat([datasets[0], datasets[1]], dim="time")
prcp_concat = xr.concat([datasets[2], datasets[3]], dim="time")
srad_concat = xr.concat([datasets[4], datasets[5]], dim="time")
# merge fails!
merged = xr.merge([dayl_concat, prcp_concat, srad_concat])

And here is the stack trace from xr.merge:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[31], [line 1](vscode-notebook-cell:?execution_count=31&line=1)
----> [1](vscode-notebook-cell:?execution_count=31&line=1) merged = xr.merge([dayl_concat, prcp_concat, srad_concat])

File ~/code/learn/.venv/lib/python3.10/site-packages/xarray/core/merge.py:976, in merge(objects, compat, join, fill_value, combine_attrs)
    [973](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/gustavo/code/learn/~/code/learn/.venv/lib/python3.10/site-packages/xarray/core/merge.py:973)         obj = obj.to_dataset()
    [974](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/gustavo/code/learn/~/code/learn/.venv/lib/python3.10/site-packages/xarray/core/merge.py:974)     dict_like_objects.append(obj)
--> [976](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/gustavo/code/learn/~/code/learn/.venv/lib/python3.10/site-packages/xarray/core/merge.py:976) merge_result = merge_core(
    [977](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/gustavo/code/learn/~/code/learn/.venv/lib/python3.10/site-packages/xarray/core/merge.py:977)     dict_like_objects,
    [978](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/gustavo/code/learn/~/code/learn/.venv/lib/python3.10/site-packages/xarray/core/merge.py:978)     compat,
    [979](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/gustavo/code/learn/~/code/learn/.venv/lib/python3.10/site-packages/xarray/core/merge.py:979)     join,
    [980](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/gustavo/code/learn/~/code/learn/.venv/lib/python3.10/site-packages/xarray/core/merge.py:980)     combine_attrs=combine_attrs,
    [981](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/gustavo/code/learn/~/code/learn/.venv/lib/python3.10/site-packages/xarray/core/merge.py:981)     fill_value=fill_value,
    [982](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/gustavo/code/learn/~/code/learn/.venv/lib/python3.10/site-packages/xarray/core/merge.py:982) )
    [983](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/gustavo/code/learn/~/code/learn/.venv/lib/python3.10/site-packages/xarray/core/merge.py:983) return Dataset._construct_direct(**merge_result._asdict())

File ~/code/learn/.venv/lib/python3.10/site-packages/xarray/core/merge.py:701, in merge_core(objects, compat, join, combine_attrs, priority_arg, explicit_coords, indexes, fill_value, skip_align_args)
    [699](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/gustavo/code/learn/~/code/learn/.venv/lib/python3.10/site-packages/xarray/core/merge.py:699) collected = collect_variables_and_indexes(aligned, indexes=indexes)
    [700](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/gustavo/code/learn/~/code/learn/.venv/lib/python3.10/site-packages/xarray/core/merge.py:700) prioritized = _get_priority_vars_and_indexes(aligned, priority_arg, compat=compat)
--> [701](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/gustavo/code/learn/~/code/learn/.venv/lib/python3.10/site-packages/xarray/core/merge.py:701) variables, out_indexes = merge_collected(
    [702](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/gustavo/code/learn/~/code/learn/.venv/lib/python3.10/site-packages/xarray/core/merge.py:702)     collected, prioritized, compat=compat, combine_attrs=combine_attrs
    [703](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/gustavo/code/learn/~/code/learn/.venv/lib/python3.10/site-packages/xarray/core/merge.py:703) )
    [705](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/gustavo/code/learn/~/code/learn/.venv/lib/python3.10/site-packages/xarray/core/merge.py:705) dims = calculate_dimensions(variables)
    [707](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/gustavo/code/learn/~/code/learn/.venv/lib/python3.10/site-packages/xarray/core/merge.py:707) coord_names, noncoord_names = determine_coords(coerced)

File ~/code/learn/.venv/lib/python3.10/site-packages/xarray/core/merge.py:290, in merge_collected(grouped, prioritized, compat, combine_attrs, equals)
    [288](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/gustavo/code/learn/~/code/learn/.venv/lib/python3.10/site-packages/xarray/core/merge.py:288) variables = [variable for variable, _ in elements_list]
    [289](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/gustavo/code/learn/~/code/learn/.venv/lib/python3.10/site-packages/xarray/core/merge.py:289) try:
--> [290](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/gustavo/code/learn/~/code/learn/.venv/lib/python3.10/site-packages/xarray/core/merge.py:290)     merged_vars[name] = unique_variable(
    [291](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/gustavo/code/learn/~/code/learn/.venv/lib/python3.10/site-packages/xarray/core/merge.py:291)         name, variables, compat, equals.get(name, None)
    [292](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/gustavo/code/learn/~/code/learn/.venv/lib/python3.10/site-packages/xarray/core/merge.py:292)     )
    [293](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/gustavo/code/learn/~/code/learn/.venv/lib/python3.10/site-packages/xarray/core/merge.py:293) except MergeError:
    [294](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/gustavo/code/learn/~/code/learn/.venv/lib/python3.10/site-packages/xarray/core/merge.py:294)     if compat != "minimal":
    [295](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/gustavo/code/learn/~/code/learn/.venv/lib/python3.10/site-packages/xarray/core/merge.py:295)         # we need more than "minimal" compatibility (for which
    [296](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/gustavo/code/learn/~/code/learn/.venv/lib/python3.10/site-packages/xarray/core/merge.py:296)         # we drop conflicting coordinates)

File ~/code/learn/.venv/lib/python3.10/site-packages/xarray/core/merge.py:137, in unique_variable(name, variables, compat, equals)
    [133](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/gustavo/code/learn/~/code/learn/.venv/lib/python3.10/site-packages/xarray/core/merge.py:133)         break
    [135](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/gustavo/code/learn/~/code/learn/.venv/lib/python3.10/site-packages/xarray/core/merge.py:135) if equals is None:
    [136](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/gustavo/code/learn/~/code/learn/.venv/lib/python3.10/site-packages/xarray/core/merge.py:136)     # now compare values with minimum number of computes
--> [137](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/gustavo/code/learn/~/code/learn/.venv/lib/python3.10/site-packages/xarray/core/merge.py:137)     out = out.compute()
    [138](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/gustavo/code/learn/~/code/learn/.venv/lib/python3.10/site-packages/xarray/core/merge.py:138)     for var in variables[1:]:
    [139](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/gustavo/code/learn/~/code/learn/.venv/lib/python3.10/site-packages/xarray/core/merge.py:139)         equals = getattr(out, compat)(var)

File ~/code/learn/.venv/lib/python3.10/site-packages/xarray/core/variable.py:998, in Variable.compute(self, **kwargs)
    [980](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/gustavo/code/learn/~/code/learn/.venv/lib/python3.10/site-packages/xarray/core/variable.py:980) """Manually trigger loading of this variable's data from disk or a
    [981](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/gustavo/code/learn/~/code/learn/.venv/lib/python3.10/site-packages/xarray/core/variable.py:981) remote source into memory and return a new variable. The original is
    [982](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/gustavo/code/learn/~/code/learn/.venv/lib/python3.10/site-packages/xarray/core/variable.py:982) left unaltered.
   (...)
    [995](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/gustavo/code/learn/~/code/learn/.venv/lib/python3.10/site-packages/xarray/core/variable.py:995) dask.array.compute
    [996](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/gustavo/code/learn/~/code/learn/.venv/lib/python3.10/site-packages/xarray/core/variable.py:996) """
    [997](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/gustavo/code/learn/~/code/learn/.venv/lib/python3.10/site-packages/xarray/core/variable.py:997) new = self.copy(deep=False)
--> [998](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/gustavo/code/learn/~/code/learn/.venv/lib/python3.10/site-packages/xarray/core/variable.py:998) return new.load(**kwargs)

File ~/code/learn/.venv/lib/python3.10/site-packages/xarray/core/variable.py:976, in Variable.load(self, **kwargs)
    [959](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/gustavo/code/learn/~/code/learn/.venv/lib/python3.10/site-packages/xarray/core/variable.py:959) def load(self, **kwargs):
    [960](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/gustavo/code/learn/~/code/learn/.venv/lib/python3.10/site-packages/xarray/core/variable.py:960)     """Manually trigger loading of this variable's data from disk or a
    [961](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/gustavo/code/learn/~/code/learn/.venv/lib/python3.10/site-packages/xarray/core/variable.py:961)     remote source into memory and return this variable.
    [962](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/gustavo/code/learn/~/code/learn/.venv/lib/python3.10/site-packages/xarray/core/variable.py:962) 
   (...)
    [974](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/gustavo/code/learn/~/code/learn/.venv/lib/python3.10/site-packages/xarray/core/variable.py:974)     dask.array.compute
    [975](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/gustavo/code/learn/~/code/learn/.venv/lib/python3.10/site-packages/xarray/core/variable.py:975)     """
--> [976](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/gustavo/code/learn/~/code/learn/.venv/lib/python3.10/site-packages/xarray/core/variable.py:976)     self._data = to_duck_array(self._data, **kwargs)
    [977](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/gustavo/code/learn/~/code/learn/.venv/lib/python3.10/site-packages/xarray/core/variable.py:977)     return self

File ~/code/learn/.venv/lib/python3.10/site-packages/xarray/namedarray/pycompat.py:129, in to_duck_array(data, **kwargs)
    [126](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/gustavo/code/learn/~/code/learn/.venv/lib/python3.10/site-packages/xarray/namedarray/pycompat.py:126) from xarray.namedarray.parallelcompat import get_chunked_array_type
    [128](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/gustavo/code/learn/~/code/learn/.venv/lib/python3.10/site-packages/xarray/namedarray/pycompat.py:128) if is_chunked_array(data):
--> [129](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/gustavo/code/learn/~/code/learn/.venv/lib/python3.10/site-packages/xarray/namedarray/pycompat.py:129)     chunkmanager = get_chunked_array_type(data)
    [130](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/gustavo/code/learn/~/code/learn/.venv/lib/python3.10/site-packages/xarray/namedarray/pycompat.py:130)     loaded_data, *_ = chunkmanager.compute(data, **kwargs)  # type: ignore[var-annotated]
    [131](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/gustavo/code/learn/~/code/learn/.venv/lib/python3.10/site-packages/xarray/namedarray/pycompat.py:131)     return loaded_data

File ~/code/learn/.venv/lib/python3.10/site-packages/xarray/namedarray/parallelcompat.py:165, in get_chunked_array_type(*args)
    [159](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/gustavo/code/learn/~/code/learn/.venv/lib/python3.10/site-packages/xarray/namedarray/parallelcompat.py:159) selected = [
    [160](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/gustavo/code/learn/~/code/learn/.venv/lib/python3.10/site-packages/xarray/namedarray/parallelcompat.py:160)     chunkmanager
    [161](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/gustavo/code/learn/~/code/learn/.venv/lib/python3.10/site-packages/xarray/namedarray/parallelcompat.py:161)     for chunkmanager in chunkmanagers.values()
    [162](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/gustavo/code/learn/~/code/learn/.venv/lib/python3.10/site-packages/xarray/namedarray/parallelcompat.py:162)     if chunkmanager.is_chunked_array(chunked_arr)
    [163](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/gustavo/code/learn/~/code/learn/.venv/lib/python3.10/site-packages/xarray/namedarray/parallelcompat.py:163) ]
    [164](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/gustavo/code/learn/~/code/learn/.venv/lib/python3.10/site-packages/xarray/namedarray/parallelcompat.py:164) if not selected:
--> [165](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/gustavo/code/learn/~/code/learn/.venv/lib/python3.10/site-packages/xarray/namedarray/parallelcompat.py:165)     raise TypeError(
    [166](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/gustavo/code/learn/~/code/learn/.venv/lib/python3.10/site-packages/xarray/namedarray/parallelcompat.py:166)         f"Could not find a Chunk Manager which recognises type {type(chunked_arr)}"
    [167](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/gustavo/code/learn/~/code/learn/.venv/lib/python3.10/site-packages/xarray/namedarray/parallelcompat.py:167)     )
    [168](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/gustavo/code/learn/~/code/learn/.venv/lib/python3.10/site-packages/xarray/namedarray/parallelcompat.py:168) elif len(selected) >= 2:
    [169](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/gustavo/code/learn/~/code/learn/.venv/lib/python3.10/site-packages/xarray/namedarray/parallelcompat.py:169)     raise TypeError(f"Multiple ChunkManagers recognise type {type(chunked_arr)}")

TypeError: Could not find a Chunk Manager which recognises type <class 'virtualizarr.manifests.array.ManifestArray'>

If this is a known issue, feel free to close this and link to it. If this is new, what would it take to implement a new Chunk Manager? Is there a way to make merge work? I'm running VirtualiZarr from main at commit c3f630bfbb6c5.

TomNicholas commented 3 weeks ago

Thanks for raising this @ghidalgo3 ! And for trying out this package :) A minimal reproducible example would be useful if you're up for that.

Is this the same error as https://github.com/zarr-developers/VirtualiZarr/issues/114? If so then merge might be attempting to load data. I would also try passing compat='override' (and maybe join='override') to merge.

You also might need to pass indexes={} to open_virtual_dataset.

You've made me realise that using xr.merge with virtualizarr needs to be more explicitly tested / documented!

ghidalgo3 commented 3 weeks ago

Yes, I didn't know where to get those daymet files publicly but @TomAugspurger helped me out. Sorry!

Here is a quick runnable repro:

import virtualizarr
import xarray as xr

def open_daymet_dataset(path) -> xr.Dataset:
    print("Opening", path)
    return virtualizarr.open_virtual_dataset(
        path,
        filetype=virtualizarr.kerchunk.FileType.netcdf4,
        drop_variables=["lambert_conformal_conic"],
        reader_options={})

files = [
    "https://thredds.daac.ornl.gov/thredds/fileServer/ornldaac/2129/daymet_v4_daily_hi_dayl_1980.nc",
    "https://thredds.daac.ornl.gov/thredds/fileServer/ornldaac/2129/daymet_v4_daily_hi_dayl_1981.nc",
    "https://thredds.daac.ornl.gov/thredds/fileServer/ornldaac/2129/daymet_v4_daily_hi_prcp_1980.nc",
    "https://thredds.daac.ornl.gov/thredds/fileServer/ornldaac/2129/daymet_v4_daily_hi_prcp_1981.nc",
]

datasets = [open_daymet_dataset(path) for path in files]
print("concating")
dayl_concat = xr.concat([datasets[0], datasets[1]], dim="time")
prcp_concat = xr.concat([datasets[2], datasets[3]], dim="time")
print("merging")
merged = xr.merge([dayl_concat, prcp_concat])

I don't think it's the same as #114, because if I add compat="equals" to the xr.concat call, the program still works and fails at the merge call with the same TypeError: Could not find a Chunk Manager which recognises type <class 'virtualizarr.manifests.array.ManifestArray'>.

The catalog for these files is here, they are pretty small files, like 4MB each.

ghidalgo3 commented 3 weeks ago

I tried all 3 suggestions (indexes={} for the open_virtual_dataset, compat="override" and join="override" for xr.merge) but the error is still the same:

Traceback (most recent call last):
  File "/home/gustavo/code/learn/merge_bug.py", line 22, in <module>
    dayl_concat = xr.concat([datasets[0], datasets[1]], dim="time", compat='equals')
  File "/home/gustavo/code/learn/.venv/lib/python3.10/site-packages/xarray/core/concat.py", line 276, in concat
    return _dataset_concat(
  File "/home/gustavo/code/learn/.venv/lib/python3.10/site-packages/xarray/core/concat.py", line 538, in _dataset_concat
    concat_over, equals, concat_dim_lengths = _calc_concat_over(
  File "/home/gustavo/code/learn/.venv/lib/python3.10/site-packages/xarray/core/concat.py", line 437, in _calc_concat_over
    process_subset_opt(coords, "coords")
  File "/home/gustavo/code/learn/.venv/lib/python3.10/site-packages/xarray/core/concat.py", line 391, in process_subset_opt
    v_lhs = datasets[0].variables[k].load()
  File "/home/gustavo/code/learn/.venv/lib/python3.10/site-packages/xarray/core/variable.py", line 976, in load
    self._data = to_duck_array(self._data, **kwargs)
  File "/home/gustavo/code/learn/.venv/lib/python3.10/site-packages/xarray/namedarray/pycompat.py", line 129, in to_duck_array
    chunkmanager = get_chunked_array_type(data)
  File "/home/gustavo/code/learn/.venv/lib/python3.10/site-packages/xarray/namedarray/parallelcompat.py", line 165, in get_chunked_array_type
    raise TypeError(
TypeError: Could not find a Chunk Manager which recognises type <class 'virtualizarr.manifests.array.ManifestArray'>
TomNicholas commented 3 weeks ago

Thanks - I think you need all of indexes={} in open_virtual_dataset, coords='minimal' in the concat, and compat='override' in the merge

Screenshot 2024-06-13 at 1 12 16 PM
TomNicholas commented 3 weeks ago

So in theory you can actually virtualize and combine all the data in this archive with just a few lines:

Screenshot 2024-06-13 at 2 19 45 PM

(I blame kerchunk / thredds for the slowness of the open_virtual_dataset step there)

But it looks like for this data you will run into #5:

---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
Cell In[7], line 1
----> 1 combined = xr.combine_nested(
      2     vds_grid,
      3     concat_dim=['time', None],
      4     coords='minimal', 
      5     compat='override',
      6 )

File ~/Documents/Work/Code/xarray/xarray/core/combine.py:577, in combine_nested(datasets, concat_dim, compat, data_vars, coords, fill_value, join, combine_attrs)
    574     concat_dim = [concat_dim]
    576 # The IDs argument tells _nested_combine that datasets aren't yet sorted
--> 577 return _nested_combine(
    578     datasets,
    579     concat_dims=concat_dim,
    580     compat=compat,
    581     data_vars=data_vars,
    582     coords=coords,
    583     ids=False,
    584     fill_value=fill_value,
    585     join=join,
    586     combine_attrs=combine_attrs,
    587 )

File ~/Documents/Work/Code/xarray/xarray/core/combine.py:356, in _nested_combine(datasets, concat_dims, compat, data_vars, coords, ids, fill_value, join, combine_attrs)
    353 _check_shape_tile_ids(combined_ids)
    355 # Apply series of concatenate or merge operations along each dimension
--> 356 combined = _combine_nd(
    357     combined_ids,
    358     concat_dims,
    359     compat=compat,
    360     data_vars=data_vars,
    361     coords=coords,
    362     fill_value=fill_value,
    363     join=join,
    364     combine_attrs=combine_attrs,
    365 )
    366 return combined

File ~/Documents/Work/Code/xarray/xarray/core/combine.py:232, in _combine_nd(combined_ids, concat_dims, data_vars, coords, compat, fill_value, join, combine_attrs)
    228 # Each iteration of this loop reduces the length of the tile_ids tuples
    229 # by one. It always combines along the first dimension, removing the first
    230 # element of the tuple
    231 for concat_dim in concat_dims:
--> 232     combined_ids = _combine_all_along_first_dim(
    233         combined_ids,
    234         dim=concat_dim,
    235         data_vars=data_vars,
    236         coords=coords,
    237         compat=compat,
    238         fill_value=fill_value,
    239         join=join,
    240         combine_attrs=combine_attrs,
    241     )
    242 (combined_ds,) = combined_ids.values()
    243 return combined_ds

File ~/Documents/Work/Code/xarray/xarray/core/combine.py:267, in _combine_all_along_first_dim(combined_ids, dim, data_vars, coords, compat, fill_value, join, combine_attrs)
    265     combined_ids = dict(sorted(group))
    266     datasets = combined_ids.values()
--> 267     new_combined_ids[new_id] = _combine_1d(
    268         datasets, dim, compat, data_vars, coords, fill_value, join, combine_attrs
    269     )
    270 return new_combined_ids

File ~/Documents/Work/Code/xarray/xarray/core/combine.py:290, in _combine_1d(datasets, concat_dim, compat, data_vars, coords, fill_value, join, combine_attrs)
    288 if concat_dim is not None:
    289     try:
--> 290         combined = concat(
    291             datasets,
    292             dim=concat_dim,
    293             data_vars=data_vars,
    294             coords=coords,
    295             compat=compat,
    296             fill_value=fill_value,
    297             join=join,
    298             combine_attrs=combine_attrs,
    299         )
    300     except ValueError as err:
    301         if "encountered unexpected variable" in str(err):

File ~/Documents/Work/Code/xarray/xarray/core/concat.py:276, in concat(objs, dim, data_vars, coords, compat, positions, fill_value, join, combine_attrs, create_index_for_new_dim)
    263     return _dataarray_concat(
    264         objs,
    265         dim=dim,
   (...)
    273         create_index_for_new_dim=create_index_for_new_dim,
    274     )
    275 elif isinstance(first_obj, Dataset):
--> 276     return _dataset_concat(
    277         objs,
    278         dim=dim,
    279         data_vars=data_vars,
    280         coords=coords,
    281         compat=compat,
    282         positions=positions,
    283         fill_value=fill_value,
    284         join=join,
    285         combine_attrs=combine_attrs,
    286         create_index_for_new_dim=create_index_for_new_dim,
    287     )
    288 else:
    289     raise TypeError(
    290         "can only concatenate xarray Dataset and DataArray "
    291         f"objects, got {type(first_obj)}"
    292     )

File ~/Documents/Work/Code/xarray/xarray/core/concat.py:662, in _dataset_concat(datasets, dim, data_vars, coords, compat, positions, fill_value, join, combine_attrs, create_index_for_new_dim)
    660         result_vars[k] = v
    661 else:
--> 662     combined_var = concat_vars(
    663         vars, dim, positions, combine_attrs=combine_attrs
    664     )
    665     # reindex if variable is not present in all datasets
    666     if len(variable_index) < concat_index_size:

File ~/Documents/Work/Code/xarray/xarray/core/variable.py:2986, in concat(variables, dim, positions, shortcut, combine_attrs)
   2984     return IndexVariable.concat(variables, dim, positions, shortcut, combine_attrs)
   2985 else:
-> 2986     return Variable.concat(variables, dim, positions, shortcut, combine_attrs)

File ~/Documents/Work/Code/xarray/xarray/core/variable.py:1737, in Variable.concat(cls, variables, dim, positions, shortcut, combine_attrs)
   1735 axis = first_var.get_axis_num(dim)
   1736 dims = first_var_dims
-> 1737 data = duck_array_ops.concatenate(arrays, axis=axis)
   1738 if positions is not None:
   1739     # TODO: deprecate this option -- we don't need it for groupby
   1740     # any more.
   1741     indices = nputils.inverse_permutation(np.concatenate(positions))

File ~/Documents/Work/Code/xarray/xarray/core/duck_array_ops.py:402, in concatenate(arrays, axis)
    400     xp = get_array_namespace(arrays[0])
    401     return xp.concat(as_shared_dtype(arrays, xp=xp), axis=axis)
--> 402 return _concatenate(as_shared_dtype(arrays), axis=axis)

File ~/Documents/Work/Code/virtualizarr/virtualizarr/manifests/array.py:121, in ManifestArray.__array_function__(self, func, types, args, kwargs)
    118 if not all(issubclass(t, ManifestArray) for t in types):
    119     return NotImplemented
--> 121 return MANIFESTARRAY_HANDLED_ARRAY_FUNCTIONS[func](*args, **kwargs)

File ~/Documents/Work/Code/virtualizarr/virtualizarr/manifests/array_api.py:110, in concatenate(arrays, axis)
    107     raise TypeError()
    109 # ensure dtypes, shapes, codecs etc. are consistent
--> 110 _check_combineable_zarr_arrays(arrays)
    112 _check_same_ndims([arr.ndim for arr in arrays])
    114 # Ensure we handle axis being passed as a negative integer

File ~/Documents/Work/Code/virtualizarr/virtualizarr/manifests/array_api.py:38, in _check_combineable_zarr_arrays(arrays)
     34 _check_same_dtypes([arr.dtype for arr in arrays])
     36 # Can't combine different codecs in one manifest
     37 # see https://github.com/zarr-developers/zarr-specs/issues/288
---> 38 _check_same_codecs([arr.zarray.codec for arr in arrays])
     40 # Would require variable-length chunks ZEP
     41 _check_same_chunk_shapes([arr.chunks for arr in arrays])

File ~/Documents/Work/Code/virtualizarr/virtualizarr/manifests/array_api.py:59, in _check_same_codecs(codecs)
     57 for codec in other_codecs:
     58     if codec != first_codec:
---> 59         raise NotImplementedError(
     60             "The ManifestArray class cannot concatenate arrays which were stored using different codecs, "
     61             f"But found codecs {first_codec} vs {codec} ."
     62             "See https://github.com/zarr-developers/zarr-specs/issues/288"
     63         )

NotImplementedError: The ManifestArray class cannot concatenate arrays which were stored using different codecs, But found codecs compressor=None filters=[{'id': 'zlib', 'level': 4}] vs compressor=None filters=[{'elementsize': 4, 'id': 'shuffle'}, {'id': 'zlib', 'level': 4}] .See https://github.com/zarr-developers/zarr-specs/issues/288
ghidalgo3 commented 3 weeks ago

Thanks for looking into this Tom, I'll stick to variables with the same codec if possible to avoid that issue.