zarr-developers / VirtualiZarr

Create virtual Zarr stores from archival data files using xarray syntax
https://virtualizarr.readthedocs.io/en/stable/api.html
Apache License 2.0
120 stars 23 forks source link

Support for groups #84

Open TomNicholas opened 7 months ago

TomNicholas commented 7 months ago

We should support generating references from files containing multiple groups in the same way that xr.open_dataset and datatree.open_datatree work.

So we should add a new open_virtual_datatree function, and a new (optional) group kwarg to open_virtual_dataset.

This can be done right now using the datatree package (as an optional dependency imported inside open_virtual_datatree) but once that gets merged into xarray main (which is happening right now) we can get rid of that dependency.

See https://github.com/TomNicholas/VirtualiZarr/issues/78#issuecomment-2059737479 and #11.

cc @sharkinsspatial

TomNicholas commented 7 months ago

One thing I realized about this is that concatenating multiple DataTree objects is currently a little awkward. I don't know if this is actually a common pattern, but imagine you had two netCDF files each with groups, and you wanted to concatenate group1 in file1 with group1 in file2 etc.

Adding open_virtual_datatree would allow you to open the files like this:

vdt1 = open_virtual_datatree('file1.nc')
vdt1 = open_virtual_datatree('file2.nc')

but currently you can't do

combined_vdt = xr.concat([vdt1, vdt1], dim='time')

because xr.concat doesn't understand DataTree objects. To get around this you should be able to do

from datatree import map_over_subtree

concat_datatrees = map_over_subtree(xr.concat)

combined_vdt = concat_datatrees([vdt1, vdt1], dim='time')

but it raises the question of whether the xarray DataTree upstream integration should include generalizing xr.concat.

jonas-spaeth commented 5 months ago

Hi @TomNicholas

I often have the problem that I want to concat different datatrees into a single xr.Dataset again. I came across your code above and tried it, but I get an error.

Generate some sample data:

import datatree
import xarray as xr

ds1 = xr.Dataset(
    data_vars=dict(a=("x", [11, 22, 33])),
    coords=dict(x=[1,2,3])
)
ds2 = xr.Dataset(
    data_vars=dict(a=("x", [111, 222, 333])),
    coords=dict(x=[1,2,3])
)

mytree = datatree.DataTree.from_dict({"two_digits": ds1, "three_digits": ds2})
print(mytree)

output:

DataTree('None', parent=None)
├── DataTree('two_digits')
│       Dimensions:  (x: 3)
│       Coordinates:
│         * x        (x) int64 24B 1 2 3
│       Data variables:
│           a        (x) int64 24B 11 22 33
└── DataTree('three_digits')
        Dimensions:  (x: 3)
        Coordinates:
          * x        (x) int64 24B 1 2 3
        Data variables:
            a        (x) int64 24B 111 222 333

I tried:

# fails
concat_datatrees = datatree.map_over_subtree(xr.concat)
ds_concatenated = concat_datatrees(mytree, dim='digits')
# also fails:
# ds_concatenated = concat_datatrees([mytree.two_digits, mytree.three_digits], dim='digits')

output:

TypeError: can only concatenate xarray Dataset and DataArray objects, got <class 'str'>
Raised whilst mapping function over node with path /two_digits"
}

this would work, but it is not really nice:

# works
ds_concatenated = xr.concat([mytree[subtree].ds for subtree in mytree], dim="digits")
print(ds_concatenated)

output:

<xarray.Dataset> Size: 72B
Dimensions:  (digits: 2, x: 3)
Coordinates:
  * x        (x) int64 24B 1 2 3
Dimensions without coordinates: digits
Data variables:
    a        (digits, x) int64 48B 11 22 33 111 222 333

Do you have a suggestion of how to deal with such situations? Thanks!

TomNicholas commented 5 months ago

Thanks for trying this @jonas-spaeth ! I now realise that I didn't think hard enough before making this suggestion 😅

This is an issue with xarray-datatree, not with virtualizarr at all, so I will re-raise this on the xarray repo instead and we can continue discussion there.

# fails
concat_datatrees = datatree.map_over_subtree(xr.concat)
ds_concatenated = concat_datatrees(mytree, dim='digits')

I think your first attempt is just an incorrect use of the map_over_subtree decorator, as calling xr.concat(mytree.ds, dim='digits') wouldn't work either (notice the lack of square brackets around mytree.ds - xr.concat expects a list).

# also fails:
# ds_concatenated = concat_datatrees([mytree.two_digits, mytree.three_digits], dim='digits')

This is more troubling, but I've realised why it doesn't work. Basically map_over_subtree currently looks for positional and keyword arguments that are DataTree objects, and iterates over the .ds in them.

https://github.com/pydata/xarray/blob/2e0dd6f2779756c9c1c04f14b7937c3b214a0fc9/xarray/core/datatree_mapping.py#L128

But I never thought to make map_over_subtree understand lists of DataTree objects... In general perhaps we should support Iterable[DataTree]? Then this example should work.

We might imagine changing xr.concat to automatically handle the map_over_subtree part - I'll raise an issue for that too.

# works
ds_concatenated = xr.concat([mytree[subtree].ds for subtree in mytree], dim="digits")

For now I think this is your only option.

TomNicholas commented 4 months ago

Note that https://github.com/pydata/xarray/issues/9077 might affect this - if datatree becomes slightly less general then there could in theory be some netcdf files that cannot be opened as a DataTree containing ManifestArray objects. You could still open the groups individually and combine, and also any virtual datatree that would become forbidden is also one that you wouldn't be able to open later for analysis as a normal loadable datatree.

TomNicholas commented 1 month ago

Also this issue wouldn't really be closed until we also have the ability to do dt.to_kerchunk/.to_zarr - i.e. we need a VirtualDataTreeAccessor in addition to our existing VirtualDatasetAccessor, which can write out the references for all the groups at once into one kerchunk file/zarr store.

EDIT: Serialization of DataTree objects is tracked separately in #244