Open ghidalgo3 opened 3 weeks ago
Thanks for raising this @ghidalgo3 ! And for trying out this package :) A minimal reproducible example would be useful if you're up for that.
Is this the same error as https://github.com/zarr-developers/VirtualiZarr/issues/114? If so then merge
might be attempting to load data. I would also try passing compat='override'
(and maybe join='override'
) to merge
.
You also might need to pass indexes={}
to open_virtual_dataset
.
You've made me realise that using xr.merge
with virtualizarr needs to be more explicitly tested / documented!
Yes, I didn't know where to get those daymet files publicly but @TomAugspurger helped me out. Sorry!
Here is a quick runnable repro:
import virtualizarr
import xarray as xr
def open_daymet_dataset(path) -> xr.Dataset:
print("Opening", path)
return virtualizarr.open_virtual_dataset(
path,
filetype=virtualizarr.kerchunk.FileType.netcdf4,
drop_variables=["lambert_conformal_conic"],
reader_options={})
files = [
"https://thredds.daac.ornl.gov/thredds/fileServer/ornldaac/2129/daymet_v4_daily_hi_dayl_1980.nc",
"https://thredds.daac.ornl.gov/thredds/fileServer/ornldaac/2129/daymet_v4_daily_hi_dayl_1981.nc",
"https://thredds.daac.ornl.gov/thredds/fileServer/ornldaac/2129/daymet_v4_daily_hi_prcp_1980.nc",
"https://thredds.daac.ornl.gov/thredds/fileServer/ornldaac/2129/daymet_v4_daily_hi_prcp_1981.nc",
]
datasets = [open_daymet_dataset(path) for path in files]
print("concating")
dayl_concat = xr.concat([datasets[0], datasets[1]], dim="time")
prcp_concat = xr.concat([datasets[2], datasets[3]], dim="time")
print("merging")
merged = xr.merge([dayl_concat, prcp_concat])
I don't think it's the same as #114, because if I add compat="equals"
to the xr.concat
call, the program still works and fails at the merge
call with the same TypeError: Could not find a Chunk Manager which recognises type <class 'virtualizarr.manifests.array.ManifestArray'>
.
The catalog for these files is here, they are pretty small files, like 4MB each.
I tried all 3 suggestions (indexes={}
for the open_virtual_dataset
, compat="override"
and join="override"
for xr.merge
) but the error is still the same:
Traceback (most recent call last):
File "/home/gustavo/code/learn/merge_bug.py", line 22, in <module>
dayl_concat = xr.concat([datasets[0], datasets[1]], dim="time", compat='equals')
File "/home/gustavo/code/learn/.venv/lib/python3.10/site-packages/xarray/core/concat.py", line 276, in concat
return _dataset_concat(
File "/home/gustavo/code/learn/.venv/lib/python3.10/site-packages/xarray/core/concat.py", line 538, in _dataset_concat
concat_over, equals, concat_dim_lengths = _calc_concat_over(
File "/home/gustavo/code/learn/.venv/lib/python3.10/site-packages/xarray/core/concat.py", line 437, in _calc_concat_over
process_subset_opt(coords, "coords")
File "/home/gustavo/code/learn/.venv/lib/python3.10/site-packages/xarray/core/concat.py", line 391, in process_subset_opt
v_lhs = datasets[0].variables[k].load()
File "/home/gustavo/code/learn/.venv/lib/python3.10/site-packages/xarray/core/variable.py", line 976, in load
self._data = to_duck_array(self._data, **kwargs)
File "/home/gustavo/code/learn/.venv/lib/python3.10/site-packages/xarray/namedarray/pycompat.py", line 129, in to_duck_array
chunkmanager = get_chunked_array_type(data)
File "/home/gustavo/code/learn/.venv/lib/python3.10/site-packages/xarray/namedarray/parallelcompat.py", line 165, in get_chunked_array_type
raise TypeError(
TypeError: Could not find a Chunk Manager which recognises type <class 'virtualizarr.manifests.array.ManifestArray'>
Thanks - I think you need all of indexes={}
in open_virtual_dataset
, coords='minimal'
in the concat
, and compat='override'
in the merge
So in theory you can actually virtualize and combine all the data in this archive with just a few lines:
(I blame kerchunk / thredds for the slowness of the open_virtual_dataset
step there)
But it looks like for this data you will run into #5:
---------------------------------------------------------------------------
NotImplementedError Traceback (most recent call last)
Cell In[7], line 1
----> 1 combined = xr.combine_nested(
2 vds_grid,
3 concat_dim=['time', None],
4 coords='minimal',
5 compat='override',
6 )
File ~/Documents/Work/Code/xarray/xarray/core/combine.py:577, in combine_nested(datasets, concat_dim, compat, data_vars, coords, fill_value, join, combine_attrs)
574 concat_dim = [concat_dim]
576 # The IDs argument tells _nested_combine that datasets aren't yet sorted
--> 577 return _nested_combine(
578 datasets,
579 concat_dims=concat_dim,
580 compat=compat,
581 data_vars=data_vars,
582 coords=coords,
583 ids=False,
584 fill_value=fill_value,
585 join=join,
586 combine_attrs=combine_attrs,
587 )
File ~/Documents/Work/Code/xarray/xarray/core/combine.py:356, in _nested_combine(datasets, concat_dims, compat, data_vars, coords, ids, fill_value, join, combine_attrs)
353 _check_shape_tile_ids(combined_ids)
355 # Apply series of concatenate or merge operations along each dimension
--> 356 combined = _combine_nd(
357 combined_ids,
358 concat_dims,
359 compat=compat,
360 data_vars=data_vars,
361 coords=coords,
362 fill_value=fill_value,
363 join=join,
364 combine_attrs=combine_attrs,
365 )
366 return combined
File ~/Documents/Work/Code/xarray/xarray/core/combine.py:232, in _combine_nd(combined_ids, concat_dims, data_vars, coords, compat, fill_value, join, combine_attrs)
228 # Each iteration of this loop reduces the length of the tile_ids tuples
229 # by one. It always combines along the first dimension, removing the first
230 # element of the tuple
231 for concat_dim in concat_dims:
--> 232 combined_ids = _combine_all_along_first_dim(
233 combined_ids,
234 dim=concat_dim,
235 data_vars=data_vars,
236 coords=coords,
237 compat=compat,
238 fill_value=fill_value,
239 join=join,
240 combine_attrs=combine_attrs,
241 )
242 (combined_ds,) = combined_ids.values()
243 return combined_ds
File ~/Documents/Work/Code/xarray/xarray/core/combine.py:267, in _combine_all_along_first_dim(combined_ids, dim, data_vars, coords, compat, fill_value, join, combine_attrs)
265 combined_ids = dict(sorted(group))
266 datasets = combined_ids.values()
--> 267 new_combined_ids[new_id] = _combine_1d(
268 datasets, dim, compat, data_vars, coords, fill_value, join, combine_attrs
269 )
270 return new_combined_ids
File ~/Documents/Work/Code/xarray/xarray/core/combine.py:290, in _combine_1d(datasets, concat_dim, compat, data_vars, coords, fill_value, join, combine_attrs)
288 if concat_dim is not None:
289 try:
--> 290 combined = concat(
291 datasets,
292 dim=concat_dim,
293 data_vars=data_vars,
294 coords=coords,
295 compat=compat,
296 fill_value=fill_value,
297 join=join,
298 combine_attrs=combine_attrs,
299 )
300 except ValueError as err:
301 if "encountered unexpected variable" in str(err):
File ~/Documents/Work/Code/xarray/xarray/core/concat.py:276, in concat(objs, dim, data_vars, coords, compat, positions, fill_value, join, combine_attrs, create_index_for_new_dim)
263 return _dataarray_concat(
264 objs,
265 dim=dim,
(...)
273 create_index_for_new_dim=create_index_for_new_dim,
274 )
275 elif isinstance(first_obj, Dataset):
--> 276 return _dataset_concat(
277 objs,
278 dim=dim,
279 data_vars=data_vars,
280 coords=coords,
281 compat=compat,
282 positions=positions,
283 fill_value=fill_value,
284 join=join,
285 combine_attrs=combine_attrs,
286 create_index_for_new_dim=create_index_for_new_dim,
287 )
288 else:
289 raise TypeError(
290 "can only concatenate xarray Dataset and DataArray "
291 f"objects, got {type(first_obj)}"
292 )
File ~/Documents/Work/Code/xarray/xarray/core/concat.py:662, in _dataset_concat(datasets, dim, data_vars, coords, compat, positions, fill_value, join, combine_attrs, create_index_for_new_dim)
660 result_vars[k] = v
661 else:
--> 662 combined_var = concat_vars(
663 vars, dim, positions, combine_attrs=combine_attrs
664 )
665 # reindex if variable is not present in all datasets
666 if len(variable_index) < concat_index_size:
File ~/Documents/Work/Code/xarray/xarray/core/variable.py:2986, in concat(variables, dim, positions, shortcut, combine_attrs)
2984 return IndexVariable.concat(variables, dim, positions, shortcut, combine_attrs)
2985 else:
-> 2986 return Variable.concat(variables, dim, positions, shortcut, combine_attrs)
File ~/Documents/Work/Code/xarray/xarray/core/variable.py:1737, in Variable.concat(cls, variables, dim, positions, shortcut, combine_attrs)
1735 axis = first_var.get_axis_num(dim)
1736 dims = first_var_dims
-> 1737 data = duck_array_ops.concatenate(arrays, axis=axis)
1738 if positions is not None:
1739 # TODO: deprecate this option -- we don't need it for groupby
1740 # any more.
1741 indices = nputils.inverse_permutation(np.concatenate(positions))
File ~/Documents/Work/Code/xarray/xarray/core/duck_array_ops.py:402, in concatenate(arrays, axis)
400 xp = get_array_namespace(arrays[0])
401 return xp.concat(as_shared_dtype(arrays, xp=xp), axis=axis)
--> 402 return _concatenate(as_shared_dtype(arrays), axis=axis)
File ~/Documents/Work/Code/virtualizarr/virtualizarr/manifests/array.py:121, in ManifestArray.__array_function__(self, func, types, args, kwargs)
118 if not all(issubclass(t, ManifestArray) for t in types):
119 return NotImplemented
--> 121 return MANIFESTARRAY_HANDLED_ARRAY_FUNCTIONS[func](*args, **kwargs)
File ~/Documents/Work/Code/virtualizarr/virtualizarr/manifests/array_api.py:110, in concatenate(arrays, axis)
107 raise TypeError()
109 # ensure dtypes, shapes, codecs etc. are consistent
--> 110 _check_combineable_zarr_arrays(arrays)
112 _check_same_ndims([arr.ndim for arr in arrays])
114 # Ensure we handle axis being passed as a negative integer
File ~/Documents/Work/Code/virtualizarr/virtualizarr/manifests/array_api.py:38, in _check_combineable_zarr_arrays(arrays)
34 _check_same_dtypes([arr.dtype for arr in arrays])
36 # Can't combine different codecs in one manifest
37 # see https://github.com/zarr-developers/zarr-specs/issues/288
---> 38 _check_same_codecs([arr.zarray.codec for arr in arrays])
40 # Would require variable-length chunks ZEP
41 _check_same_chunk_shapes([arr.chunks for arr in arrays])
File ~/Documents/Work/Code/virtualizarr/virtualizarr/manifests/array_api.py:59, in _check_same_codecs(codecs)
57 for codec in other_codecs:
58 if codec != first_codec:
---> 59 raise NotImplementedError(
60 "The ManifestArray class cannot concatenate arrays which were stored using different codecs, "
61 f"But found codecs {first_codec} vs {codec} ."
62 "See https://github.com/zarr-developers/zarr-specs/issues/288"
63 )
NotImplementedError: The ManifestArray class cannot concatenate arrays which were stored using different codecs, But found codecs compressor=None filters=[{'id': 'zlib', 'level': 4}] vs compressor=None filters=[{'elementsize': 4, 'id': 'shuffle'}, {'id': 'zlib', 'level': 4}] .See https://github.com/zarr-developers/zarr-specs/issues/288
Thanks for looking into this Tom, I'll stick to variables with the same codec if possible to avoid that issue.
I'm trying to build a single virtual datastore from a collection of NetCDF files with VirtualiZarr using the files from the Daymet dataset on the Microsoft Planetary Computer. There is a NetCDF file for each (year, variable) combination, so I'm thinking that to build a single datastore I need to
xr.concat
on timexr.merge
on variablesRoughly here's what I'm doing:
And here is the stack trace from
xr.merge
:If this is a known issue, feel free to close this and link to it. If this is new, what would it take to implement a new Chunk Manager? Is there a way to make
merge
work? I'm runningVirtualiZarr
frommain
at commitc3f630bfbb6c5
.