Open TomNicholas opened 5 months ago
@TomNicholas sorry if this is a silly question, I am but a neophyte and I'm trying to place VirtualiZarr in my mind: what can a user do with a collection of ManifestArrays
besides concating/merging them since they do not load data?
I think the eventual goal is to:
ChunkManifest
for it.ChunkManifest
s instead of chunks themselves (avoiding the copies).ChunkManifests
.Is that correct? If so, I will stop trying to load data from the xr.Dataset
returned by virtualizarr.open_virtual_dataset
@ghidalgo3 that's exactly right. The only thing you've missed is that although writing out a "virtual" zarr store is a future aim, writing out actual kerchunk reference files works today! Which means that you can you this package for the same purpose that lots of people were already using kerchunk.combine.MultiZarrToZarr
.
I will stop trying to load data from the xr.Dataset returned by virtualizarr.open_virtual_dataset
Yes you cannot load the ManifestArray
objects directly. And there wouldn't be any point in doing so either. The whole point is that those arrays then get serialized back to disk as references (either virtual zarr via chunk manifests or as kerchunk reference files).
Serious question: Did you read the documentation? If so how do you think it could be improved so as to make all this clearer?
EDIT: In my opinion there is no such thing as silly questions, only imperfect documentation :)
I did read the documentation, I think what was unclear to me was that there are 2 kinds of xr.Dataset
objects I encountered with VirtualiZarr: the one returned by virtualizarr.open_virtual_dataset
and the one returned by xr.open_dataset(virtualizarr_produced_kerchunk_or_zarr_store)
.
The first xr.Dataset
is good for concating/merging and writing to storage so that it can be read by xarray later into a new xr.Dataset
which can be read. I know this is said on this page, but only in hindsight did I understand the implication:
VirtualiZarr aims to allow you to use the same xarray incantation you would normally use to open and combine all your files, but cache that result as a virtual Zarr store.
If I could wave a magic wand to make it better I would not want the return value of open_virtual_dataset
to be an xr.Dataset
because that dataset doesn't behave like other xr.Dataset
and instead return something like a VirtualiZarr.Dataset
which only has the functions that do work, but I understand that xr.Dataset
is close enough and users (me) shouldn't expect it to load data.
(@ghidalgo3 I replied in #171 so as to preserve the original topic of this issue)
If you create the
air1.nc
andair2.nc
files in the same way as in the docs, then concatenate them withcompat='equals'
, you get a super ugly error:This is because
compat='equals'
(which is currently xarray's default, though that should change, see https://github.com/pydata/xarray/issues/8778) tries to load the coordinate variables in order to compare their values. Xarray thinks it needs to load them because it sees the.chunks
attribute, assumes it's a computable array like dask or cubed, then searches for a corresponding Chunk Manager to use to compute this chunked array.Basically
ManifestArray
breaks one of xarray's assumptions by being chunked but not computable. So it's another example of an array that causes the same issue as https://github.com/pydata/xarray/issues/8733.The behaviour we want here is for xarray to be more lenient, and not attempt to load the array. Then it will progress to the
__eq__
comparison, andManifestArray
can report a more useful error if it gets coerced to an index, or actually just return its own definition of equality.