xarray-contrib / xarray-simlab

Xarray extension and framework for computer model simulations
http://xarray-simlab.readthedocs.io
BSD 3-Clause "New" or "Revised" License
73 stars 9 forks source link

Use xarray data structures inside models #141

Open benbovy opened 4 years ago

benbovy commented 4 years ago

from @eho-tacc's https://github.com/benbovy/xarray-simlab/issues/140#issuecomment-709586253

I think it would be really useful to expose variable metadata in the group (and generally across the package, but that's a different point)

Even though I haven't had any use case for this yet, accessing variable metadata from inside process class methods would probably make sense indeed. Right now xarray data structures are used for the model "outer" interface only, but I've been wondering if it would make sense to also leverage it inside models.

Since xsimlab.variable attributes contain all the metadata needed to wrap their values as xarray variables in model inputs/outputs, nothing prevents doing the same in process classes too, i.e.,

@xs.process
class Foo:
    bar = xsimlab.variable(...)

    def initialize(self):
        x = self.bar  # self.bar returns a xarray.Variable or a xarray.DataArray object

We could probably look model-wise for xsimlab.index variables to automatically populate xarray.DataArray coordinates.

That said, I'm not sure if the example above should be the default behavior (this would be a breaking change). Maybe an option or flag exposed somewhere? I don't know where exactly... Or maybe an explicit function? E.g.,

@xs.process
class Foo:
    bar = xsimlab.variable(...)

    def initialize(self):
        x = xsimlab.getattr_as_dataarray(self, 'bar')

I like the latter option, although it might be quite verbose if we want this as the default behavior.

ethho commented 4 years ago

Thanks for the reply @benbovy.

Accessing variable metadata from inside process class methods would probably make sense indeed.

I have spent little time in the package code, but it seems that one could quite easily expose the underlying Attribute of a variable by implementing a function similar to xsimlab.variable_info (which appears to call utils.variable_dict). From a user standpoint, I could imagine this taking one of several forms:

@xs.process
class Foo:
    # Kwarg in `xsimlab.variable` triggers call to `utils.variable_dict` or similar
    as_attr = xsimlab.variable(..., as_attr=True)

    def initialize(self):
        assert isinstance(self.as_attr, attr.Attribute)

Caveat of above: I don't believe that attr.Attribute includes the variable value. If not, one could imagine a new or modified function like:

@xs.process
class Foo:
    # Retrieve the value separately
    bar = xsimlab.variable(...)

    # This function would now return an item from `utils.variable_dict()`
    bar_metadata = xsimlab.variable_info(..., verbose=False)

    # ...or just a new function
    bat_metadata = xsimlab.variable_metadata(...)

We could probably look model-wise for xsimlab.index variables to automatically populate xarray.DataArray coordinates.

I agree that the second (non-breaking) option is preferable here.

A user-side opinion about this feature: I will almost always prefer to write my own process InitializeArray and construct the xarray.DataArray from scratch. While getattr_as_dataarray would be a nice feature, I think that users should be able to access Attribute.metadata.dims and construct manually (as in #140), and not have to rely on this new method. In other words, compared to getattr_as_dataarray, the user gets the same, slightly more declarative/explicit behavior by using a group_dict that aggregates a set of xs.index(groups=['coords']).

benbovy commented 4 years ago

Mmm I'm not sure that we really need something like xsimlab.variable_info to get variable metadata inside process classes:

I'm not sure either if it's a good idea to manually create xarray.DataArray from scratch in process classes. It would be too easy to create inconsistent metadata, e.g.,

@xs.process
class Foo:
    bar = xsimlab.variable(dims=('x', 'y'))

    def initialize(self):
        self.bar = xarray.DataArray(..., dims=('y', 'z'))

From your example in #140, it would be definitely possible to automatically create a DataArray for your InitArray.arr attribute with the dimensions and coordinates that you want. This would be less error-prone.

ethho commented 4 years ago
  • Using an attr.Attribute to store an attr.Attribute looks a bit weird to me.

Good point. I overlooked the fact that as_attr in the above example would itself be an Attribute.

I'm not sure that we really need something like xsimlab.variable_info to get variable metadata inside process classes

I see your point here. Once the DataArray is constructed, one already has access to all the coords, dims, attrs, and the rest of the xarray API. While getattr_as_dataarray would be perfectly viable for my use case, I still feel like there exist cases in which the user would want full control over the DataArray construction. Workflows like self.bar = pickle.load('bar_data_array.pckl') come to mind, but I'll come back here if/when I come up with more concrete examples.

Again, thanks for your responsiveness and great work on this project!

benbovy commented 4 years ago

Thanks!

I'll come back here if/when I come up with more concrete examples.

Yes please! This is very much appreciated!

benbovy commented 4 years ago

Thinking again about this, a possible API would be:

# Get the value of `self.var` as a DataArray. If `self.var` is not a DataArray,
# construct a new DataArray on the fly by retrieving metadata and coordinates
# from the model. If it is already a DataArray, simply return it.

value = xsimlab.getattr_as_dataarray(self, 'var')

# Set `self.var` with `value` coerced into a DataArray. If `value` is not a DataArray,
# try creating a new one by retrieving metadata and coordinates from the model.
# If it is already a DataArray, perform some sanity checks to ensure that dimensions are
# compatible and add missing coordinates / attributes.

xsimlab.setattr_as_dataarray(self, 'var', value)

The user has still full control on the values assigned to 'inout'/'out' variables, but using the functions above provides both a convenient and safe way to get/set values. I think it's safer to let xarray-simlab handle coercing values into DataArray objects -- i.e., infer dimension labels from the shape of the (unlabelled) input array, maybe transpose the dimensions of the input DataArray, etc. -- rather than let the user do it manually. It all can be done automatically.

We could expose a global option in Xarray-simlab so that the two functions above are implicitly called in xsimlab variables' getter/setter properties, respectively. With this option activated, the code below would have the exact same behavior than the code above:

value = self.var

self.var = value
benbovy commented 4 years ago

We could expose a global option in Xarray-simlab so that the two functions above are implicitly called in xsimlab variables' getter/setter properties

Or maybe the right place to expose this option is the xs.process() decorator.

feefladder commented 3 years ago

A concrete example I am working on now is a model in which a process would obtain, say, ISRIC soil data using OWSLib's WebCoverageService and then load it with rioxarray. Then if you have an input rioxarray raster, convenience functions such as reproject_match could be used to match the downloaded raster(s) to an example raster.