xarray-contrib / datatree

WIP implementation of a tree-like hierarchical data structure for xarray.
https://xarray-datatree.readthedocs.io
Apache License 2.0
161 stars 43 forks source link

Resolve dimension coordinates from parents #297

Open dehorsley opened 6 months ago

dehorsley commented 6 months ago

Basically this issue but for datatree https://github.com/pydata/xarray/issues/1982

We have some hourly and weekly data in different groups, and shared coordinates in the root. For example, I have a netCDF with the following structure:

netcdf test {
dimensions:
        storage = 11 ;
        sample = 780 ;
        case = 1 ;
        generator = 31 ;
        line = 1 ;
        region = 1 ;
        price_scenario = 78 ;
variables:
        string storage(storage) ;
        float sample(sample) ;
        string case(case) ;
        string generator(generator) ;
        string line(line) ;
        string region(region) ;
        int price_scenario(price_scenario) ;

group: weekly {
  dimensions:
        time = 549 ;
  variables:
        int time(time) ;
                time:units = "Seconds since 1970" ;
        float generation(time, generator, sample, case) ;
                generation:units = "GWh" ;
        float storage_volume(time, storage, sample, case) ;
                storage_volume:units = "CMD" ;
        float line_flow(time, line, sample, case) ;
                line_flow:units = "GWh" ;
        float line_flow_back(time, line, sample, case) ;
                line_flow_back:units = "GWh" ;
        float generator_offer(time, generator, sample, case) ;
                generator_offer:units = "$/MWh" ;
  } // group weekly

group: hourly {
  dimensions:
        time = 92232 ;
  variables:
        int time(time) ;
                time:units = "Seconds since 1970" ;
        float region_price(time, region, price_scenario, case) ;
                region_price:units = "$/MWh" ;
  } // group hourly
}

I was hoping datatree would add the shared coordinates variables when I access one of the child groups, but instead if I access one of the Datasets, I have a whole bunch of dimensions without coordinates. Eg:

image

Is this something you think datatree should/will address? Thanks!

keewis commented 6 months ago

the trouble here is that there are many datasets where sharing coordinates is not desired or possible, or it mutates the dataset in a way that makes it invalid. So I don't think we should do this by default, and doing so on variable access just makes the code (and behavior) way too complicated.

However, what I could imagine is adding a method that you can explicitly call to (shallow) copy the variables from the parent nodes to the children.

dehorsley commented 5 months ago

If I understand correctly, this is how the CF Conventions scoping is supposed to work. So I'd imagine there is a great number of datasets out there where this is how you'd want to resolve coordinates, and certainly something worth making easy for users. I'd personally love to see CF Conventions by default, and optionally disabled for using non-CF datasets.

marcel-goldschen-ohm commented 4 months ago

I agree with @TomNicholas that this should not be default behavior, but would be a VERY useful option. Here is some code for a tree of xarray Datasets that I'm thinking of abandoning in favor of DataTree, but it does provide some functions for inheriting data or coords from ancestors. See the inherited_data and inherited_coord methods. Might be a good start for what you want.

TomNicholas commented 4 months ago

So the reason I did not implement this "inheritance" of information from parent groups is that (a) it was considerably simpler not to and (b) I thought it could lead to inconsistent states.

However, today I had a long discussion with @shoyer about this (plus @flamingbear, @etienneschalk and @eni-awowale), and now I think this might be possible.

My original concern was what happens if I inherit a variable with a dimension with the same name but different length to what's already present in this node? i.e.

- node A
| dims: {time: 10}
| vars: [time]
|--  node B
     dims: {time: 20}
     vars: [foo, bar, ...]

where the variables foo, bar have a time dimension of length 20. In this situation calling B.ds['time'] would look upwards to find the length 10 time variable to use, but then it wouldn't be consistent with the time dimensions of foo and bar. It would be attempting to create a group that violated xarray's requirements to be a valid Dataset object (i.e. that variables can't have different lengths along dimensions of the same name).

This difficulty is why I did not implement this feature. But looking again today at the CF conventions for groups, it says

If any dimension of an out-of-group variable has the same name as a dimension of the referring variable, the two must be the same dimension (i.e. they must have the same netCDF dimension ID).

In the xarray model without explicit dimension objects, this translates to "the two dimensions must be the same length". Interpreting this "must" to mean "if this is not the case don't try to inherit this variable", I think this caveat resolves the concern I had.


We could imagine changing the behaviour of DataTree.__getitem__(key) to:

  1. If key exists on this node, return the variable under that name,
  2. If not, look upwards through parent groups until we find a variable of that name,
  3. If found, check that dimensions are compatible with the dimensions of variables on the calling group,
  4. If not found or the only found variables have incompatible-length dimensions, raise KeyError.

(This is essentially the "search by proximity" described in the CF conventions, just without the deprecated "lateral search" stuff.)

The questions now are:

TomNicholas commented 4 months ago

A practical API question here is whether or not inherited variables are considered to be "part of" the child DataTree node. For example, given the following (which unlike the example in the previous comment is valid and safe because the dimension lengths match):

dt
- node A
| dims: {time: 10}
| vars: [time]
|--  node B
     dims: {time: 10}
     vars: [foo, bar, ...]

should dt['/B'].variables return a collection that includes time or not? Internally dt['/B']._variables probably should not, else updating the variables on the node B would become messy. dt['/B'].ds.variables should definitely contain time, but it's not clear if dt['/B'] should display "local" or "global" information. This is conceptually somewhat similar to the issues in https://github.com/xarray-contrib/datatree/issues/240.

shoyer commented 4 months ago

dt['/B'].ds.variables should definitely contain time, but it's not clear if dt['/B'] should display "local" or "global" information. This is conceptually somewhat similar to the issues in #240.

I think dt['B'] and dt['/B'] should always resolve to the same object, with "global" information. Otherwise the search mechanism starts to impact the API itself.

We could easily have another interface for surfacing only the "local" dataset (e.g., dt['B'].node_dataset).

Incidentally, this would be a reasonable way to organize the contents of a node:

TomNicholas commented 4 months ago

I think dt['B'] and dt['/B'] should always resolve to the same object, with "global" information. Otherwise the search mechanism starts to impact the API itself.

dt['/B'] and dt['B'] are the same (if you're at the root), the question is if dt['B'] and dt['B'].ds should display the same variables, or if only the latter should display inherited variables.

Incidentally, this would be a reasonable way to organize the contents of a node:

  • A public .local_dataset attribute with a local DatasetView, constructed from the data in _dataset

I quite like this idea. Providing a .local_dataset might also ease the datatree.DataTree->xarray.DataTree transition for existing users because then they can just change .ds->.local_dataset anywhere that they are worried and it will definitely work as before.

  • A private _dataset attribute on each node with an xarray.Dataset containing the underlying data.

I think I can test the inheritance idea orthogonally from this suggested refactor.

TomNicholas commented 4 months ago

Does it conflict with other features of datatree, such as map_over_subtree?

If a variable at the root is inherited down through the whole tree, it's now effectively present everywhere. If I then map a function/method over the whole tree with map_over_subtree, it's going to apply that function again and again to the same inherited variable for every single node in the tree. This seems silly... At the very least we wouldn't want to duplicate that work. This might be an argument for having map_over_subtree only map over leaves of the tree (because now the leaves will be populated by the inherited variables). That would be another big change though.

EDIT: Of course inheriting variables downwards + mapping only over the leaves does not add up to give equivalent behaviour to not inheriting + mapping over the whole tree...