Closed TomNicholas closed 1 week ago
@ghidalgo that is a totally fair complaint, that distinction between an air-quotes "virtual dataset" and a normal xarray dataset not being reflected in the top-level type is definitely confusing.
Whilst the whole point of this package was to avoid having to write a new data structure with that re-implements concat
and merge
with named dims like xr.Dataset
, we could actually imagine creating a virtualizarr.VirtualDataset
class by subclassing xr.Dataset
. We also wouldn't need the .virtualize
accessor then, because we could just use actual methods on the subclass.
class VirtualDataset(xr.Dataset):
def to_kerchunk(self):
...
def to_zarr(self):
...
(Could you teach such a class that calling xr.merge([virtual_ds, normal_ds])
would always return type VirtualDataset
?)
Subclassing is only kind of supported in xarray, but it could be, and this is a fairly simple application of it (see https://github.com/pydata/xarray/issues/3980).
The downsides of this idea are:
a) That's a lot of effort for something that works perfectly well already,
b) The line between a virtual and normal dataset is somewhat blurred because you can have an xr.Dataset
containing both in-memory numpy arrays and ManifestArrays
. Then doing .drop_vars
on all the variables backed by ManifestArray would get you back to a normal xarray Dataset.
c) I think it technically violates the Liskov Substitution Principle (valid methods on xr.Dataset
are invalid on VirtualDataset
).
Another way to close this would just add more explanation to the documentation, e.g. a section like:
### Virtual datasets are different to normal xarray Datasets!
A "virtual dataset" is any `xr.Dataset` which wraps one or more `ManifestArray` objects.
Although the top-level type is still xr.Dataset, they are intended only as an abstract representation of a set of files, not as something you can do analysis with.
They only support a very limited subset of normal xarray operations, particularly:
- `concat`
- `merge`
- `.rename`...
FWIW here my intuition is that subclassing xr.Dataset
is like the nuclear option - it's potentially possible, it's very powerful (as it allows you to extend the xarray data model), it makes your intentions very clear, but it's also very complicated and dangerous.
The reason we don't need to subclass in this case because we are not adding anything to the xarray data model, we're just using the same data model for a different purpose.
Another way to close this would just add more explanation to the documentation, e.g. a section like:
I do like adding that wording to the documentation to make it very explicit that a virtual dataset is an xr.Dataset
but with limited functionality, and users have to concretize a virtual dataset into a Zarr store or a Kerchunk manifest.
This is a really interesting point and caused me some confusion as well. I am wondering if this is more reason to keep the "special" open_virtual_dataset
rather than trying to register virtualizarr as a backend for open_dataset
.
That's a great point @jsignell - using xr.open_dataset
would only be more confusing (xref #35).
We're realistically not going to actually do class VirtualDataset(xr.Dataset):
, so I think this was really closed by the documentation additions in #173.
What was unclear to me was that there are 2 kinds of
xr.Dataset
objects I encountered with VirtualiZarr: the one returned byvirtualizarr.open_virtual_dataset
and the one returned byxr.open_dataset(virtualizarr_produced_kerchunk_or_zarr_store)
.The first
xr.Dataset
is good for concating/merging and writing to storage so that it can be read by xarray later into a newxr.Dataset
which can be read. I know this is said on this page, but only in hindsight did I understand the implication:If I could wave a magic wand to make it better I would not want the return value of
open_virtual_dataset
to be anxr.Dataset
because that dataset doesn't behave like otherxr.Dataset
and instead return something like aVirtualiZarr.Dataset
which only has the functions that do work, but I understand thatxr.Dataset
is close enough and users (me) shouldn't expect it to load data.Originally posted by @ghidalgo3 in https://github.com/zarr-developers/VirtualiZarr/issues/114#issuecomment-2200627225