zarr-developers / VirtualiZarr

Create virtual Zarr stores from archival data files using xarray syntax
https://virtualizarr.readthedocs.io/en/latest/
Apache License 2.0
67 stars 10 forks source link

VirtualDataset class instead of "virtual xr.Dataset" #171

Open TomNicholas opened 4 days ago

TomNicholas commented 4 days ago

What was unclear to me was that there are 2 kinds of xr.Dataset objects I encountered with VirtualiZarr: the one returned by virtualizarr.open_virtual_dataset and the one returned by xr.open_dataset(virtualizarr_produced_kerchunk_or_zarr_store).

The first xr.Dataset is good for concating/merging and writing to storage so that it can be read by xarray later into a new xr.Dataset which can be read. I know this is said on this page, but only in hindsight did I understand the implication:

VirtualiZarr aims to allow you to use the same xarray incantation you would normally use to open and combine all your files, but cache that result as a virtual Zarr store.

If I could wave a magic wand to make it better I would not want the return value of open_virtual_dataset to be an xr.Dataset because that dataset doesn't behave like other xr.Dataset and instead return something like a VirtualiZarr.Dataset which only has the functions that do work, but I understand that xr.Dataset is close enough and users (me) shouldn't expect it to load data.

Originally posted by @ghidalgo3 in https://github.com/zarr-developers/VirtualiZarr/issues/114#issuecomment-2200627225

TomNicholas commented 4 days ago

@ghidalgo that is a totally fair complaint, that distinction between an air-quotes "virtual dataset" and a normal xarray dataset not being reflected in the top-level type is definitely confusing.

Whilst the whole point of this package was to avoid having to write a new data structure with that re-implements concat and merge with named dims like xr.Dataset, we could actually imagine creating a virtualizarr.VirtualDataset class by subclassing xr.Dataset. We also wouldn't need the .virtualize accessor then, because we could just use actual methods on the subclass.

class VirtualDataset(xr.Dataset):
    def to_kerchunk(self):
        ...

    def to_zarr(self):
        ...

(Could you teach such a class that calling xr.merge([virtual_ds, normal_ds]) would always return type VirtualDataset?)

Subclassing is only kind of supported in xarray, but it could be, and this is a fairly simple application of it (see https://github.com/pydata/xarray/issues/3980).

The downsides of this idea are: a) That's a lot of effort for something that works perfectly well already, b) The line between a virtual and normal dataset is somewhat blurred because you can have an xr.Dataset containing both in-memory numpy arrays and ManifestArrays. Then doing .drop_vars on all the variables backed by ManifestArray would get you back to a normal xarray Dataset. c) I think it technically violates the Liskov Substitution Principle (valid methods on xr.Dataset are invalid on VirtualDataset).

TomNicholas commented 4 days ago

Another way to close this would just add more explanation to the documentation, e.g. a section like:

### Virtual datasets are different to normal xarray Datasets!

A "virtual dataset" is any `xr.Dataset` which wraps one or more `ManifestArray` objects.

Although the top-level type is still xr.Dataset, they are intended only as an abstract representation of a set of files, not as something you can do analysis with. 

They only support a very limited subset of normal xarray operations, particularly:

- `concat`
- `merge`
- `.rename`...
TomNicholas commented 4 days ago

FWIW here my intuition is that subclassing xr.Dataset is like the nuclear option - it's potentially possible, it's very powerful (as it allows you to extend the xarray data model), it makes your intentions very clear, but it's also very complicated and dangerous.

The reason we don't need to subclass in this case because we are not adding anything to the xarray data model, we're just using the same data model for a different purpose.

ghidalgo3 commented 4 days ago

Another way to close this would just add more explanation to the documentation, e.g. a section like:

I do like adding that wording to the documentation to make it very explicit that a virtual dataset is an xr.Dataset but with limited functionality, and users have to concretize a virtual dataset into a Zarr store or a Kerchunk manifest.

jsignell commented 3 days ago

This is a really interesting point and caused me some confusion as well. I am wondering if this is more reason to keep the "special" open_virtual_dataset rather than trying to register virtualizarr as a backend for open_dataset.

TomNicholas commented 3 days ago

That's a great point @jsignell - using xr.open_dataset would only be more confusing (xref #35).