zarr-developers / VirtualiZarr

Create virtual Zarr stores from archival data files using xarray syntax
https://virtualizarr.readthedocs.io/en/stable/api.html
Apache License 2.0
116 stars 22 forks source link

Ugly error from xarray chunkmanager when loading ManifestArray #114

Open TomNicholas opened 5 months ago

TomNicholas commented 5 months ago

If you create the air1.nc and air2.nc files in the same way as in the docs, then concatenate them with compat='equals', you get a super ugly error:

In [9]: vds1 = open_virtual_dataset('air1.nc', indexes={})

In [10]: vds2 = open_virtual_dataset('air2.nc', indexes={})

In [11]: xr.concat([vds1, vds2], dim='time', compat='equals')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[11], line 1
----> 1 xr.concat([vds1, vds2], dim='time', compat='equals')

File ~/Documents/Work/Code/xarray/xarray/core/concat.py:276, in concat(objs, dim, data_vars, coords, compat, positions, fill_value, join, combine_attrs, create_index_for_new_dim)
    263     return _dataarray_concat(
    264         objs,
    265         dim=dim,
   (...)
    273         create_index_for_new_dim=create_index_for_new_dim,
    274     )
    275 elif isinstance(first_obj, Dataset):
--> 276     return _dataset_concat(
    277         objs,
    278         dim=dim,
    279         data_vars=data_vars,
    280         coords=coords,
    281         compat=compat,
    282         positions=positions,
    283         fill_value=fill_value,
    284         join=join,
    285         combine_attrs=combine_attrs,
    286         create_index_for_new_dim=create_index_for_new_dim,
    287     )
    288 else:
    289     raise TypeError(
    290         "can only concatenate xarray Dataset and DataArray "
    291         f"objects, got {type(first_obj)}"
    292     )

File ~/Documents/Work/Code/xarray/xarray/core/concat.py:538, in _dataset_concat(datasets, dim, data_vars, coords, compat, positions, fill_value, join, combine_attrs, create_index_for_new_dim)
    532     datasets = [
    533         ds.expand_dims(dim, create_index_for_new_dim=create_index_for_new_dim)
    534         for ds in datasets
    535     ]
    537 # determine which variables to concatenate
--> 538 concat_over, equals, concat_dim_lengths = _calc_concat_over(
    539     datasets, dim, dim_names, data_vars, coords, compat
    540 )
    542 # determine which variables to merge, and then merge them according to compat
    543 variables_to_merge = (coord_names | data_names) - concat_over

File ~/Documents/Work/Code/xarray/xarray/core/concat.py:437, in _calc_concat_over(datasets, dim, dim_names, data_vars, coords, compat)
    434         concat_over.update(opt)
    436 process_subset_opt(data_vars, "data_vars")
--> 437 process_subset_opt(coords, "coords")
    438 return concat_over, equals, concat_dim_lengths

File ~/Documents/Work/Code/xarray/xarray/core/concat.py:391, in _calc_concat_over.<locals>.process_subset_opt(opt, subset)
    384     concat_over.add(k)
    386 elif equals[k] is None:
    387     # Compare the variable of all datasets vs. the one
    388     # of the first dataset. Perform the minimum amount of
    389     # loads in order to avoid multiple loads from disk
    390     # while keeping the RAM footprint low.
--> 391     v_lhs = datasets[0].variables[k].load()
    392     # We'll need to know later on if variables are equal.
    393     computed = []

File ~/Documents/Work/Code/xarray/xarray/core/variable.py:976, in Variable.load(self, **kwargs)
    959 def load(self, **kwargs):
    960     """Manually trigger loading of this variable's data from disk or a
    961     remote source into memory and return this variable.
    962 
   (...)
    974     dask.array.compute
    975     """
--> 976     self._data = to_duck_array(self._data, **kwargs)
    977     return self

File ~/Documents/Work/Code/xarray/xarray/namedarray/pycompat.py:129, in to_duck_array(data, **kwargs)
    126 from xarray.namedarray.parallelcompat import get_chunked_array_type
    128 if is_chunked_array(data):
--> 129     chunkmanager = get_chunked_array_type(data)
    130     loaded_data, *_ = chunkmanager.compute(data, **kwargs)  # type: ignore[var-annotated]
    131     return loaded_data

File ~/Documents/Work/Code/xarray/xarray/namedarray/parallelcompat.py:165, in get_chunked_array_type(*args)
    159 selected = [
    160     chunkmanager
    161     for chunkmanager in chunkmanagers.values()
    162     if chunkmanager.is_chunked_array(chunked_arr)
    163 ]
    164 if not selected:
--> 165     raise TypeError(
    166         f"Could not find a Chunk Manager which recognises type {type(chunked_arr)}"
    167     )
    168 elif len(selected) >= 2:
    169     raise TypeError(f"Multiple ChunkManagers recognise type {type(chunked_arr)}")

TypeError: Could not find a Chunk Manager which recognises type <class 'virtualizarr.manifests.array.ManifestArray'>

This is because compat='equals' (which is currently xarray's default, though that should change, see https://github.com/pydata/xarray/issues/8778) tries to load the coordinate variables in order to compare their values. Xarray thinks it needs to load them because it sees the .chunks attribute, assumes it's a computable array like dask or cubed, then searches for a corresponding Chunk Manager to use to compute this chunked array.

Basically ManifestArray breaks one of xarray's assumptions by being chunked but not computable. So it's another example of an array that causes the same issue as https://github.com/pydata/xarray/issues/8733.

The behaviour we want here is for xarray to be more lenient, and not attempt to load the array. Then it will progress to the __eq__ comparison, and ManifestArray can report a more useful error if it gets coerced to an index, or actually just return its own definition of equality.

ghidalgo3 commented 4 months ago

@TomNicholas sorry if this is a silly question, I am but a neophyte and I'm trying to place VirtualiZarr in my mind: what can a user do with a collection of ManifestArrays besides concating/merging them since they do not load data?

I think the eventual goal is to:

  1. Read a file and create a ChunkManifest for it.
  2. Repeat for many files and concat & merge them.
  3. Write a Zarr store for the dataset but use the ChunkManifests instead of chunks themselves (avoiding the copies).
  4. Open the Zarr store with Zarr or Xarray, where those libraries do understand how to load data from the serialized ChunkManifests.

Is that correct? If so, I will stop trying to load data from the xr.Dataset returned by virtualizarr.open_virtual_dataset

TomNicholas commented 4 months ago

@ghidalgo3 that's exactly right. The only thing you've missed is that although writing out a "virtual" zarr store is a future aim, writing out actual kerchunk reference files works today! Which means that you can you this package for the same purpose that lots of people were already using kerchunk.combine.MultiZarrToZarr.

I will stop trying to load data from the xr.Dataset returned by virtualizarr.open_virtual_dataset

Yes you cannot load the ManifestArray objects directly. And there wouldn't be any point in doing so either. The whole point is that those arrays then get serialized back to disk as references (either virtual zarr via chunk manifests or as kerchunk reference files).

Serious question: Did you read the documentation? If so how do you think it could be improved so as to make all this clearer?

EDIT: In my opinion there is no such thing as silly questions, only imperfect documentation :)

ghidalgo3 commented 4 months ago

I did read the documentation, I think what was unclear to me was that there are 2 kinds of xr.Dataset objects I encountered with VirtualiZarr: the one returned by virtualizarr.open_virtual_dataset and the one returned by xr.open_dataset(virtualizarr_produced_kerchunk_or_zarr_store).

The first xr.Dataset is good for concating/merging and writing to storage so that it can be read by xarray later into a new xr.Dataset which can be read. I know this is said on this page, but only in hindsight did I understand the implication:

VirtualiZarr aims to allow you to use the same xarray incantation you would normally use to open and combine all your files, but cache that result as a virtual Zarr store.

If I could wave a magic wand to make it better I would not want the return value of open_virtual_dataset to be an xr.Dataset because that dataset doesn't behave like other xr.Dataset and instead return something like a VirtualiZarr.Dataset which only has the functions that do work, but I understand that xr.Dataset is close enough and users (me) shouldn't expect it to load data.

TomNicholas commented 4 months ago

(@ghidalgo3 I replied in #171 so as to preserve the original topic of this issue)