xarray-contrib / xarray-simlab

Xarray extension and framework for computer model simulations
http://xarray-simlab.readthedocs.io
BSD 3-Clause "New" or "Revised" License
73 stars 9 forks source link

Resize index variable dynamically #163

Closed jvail closed 3 years ago

jvail commented 3 years ago

Hi,

first of all: Great library! I was supposed to create something similar and I am happy I found xarray-simlab. Saves a lot of time and I doubt I could have come up with something as good and solid.

I have two dimensions (time and X - X being an xs.index) and I do not know upfront how large X will be i.e. the length of the 1d array. It is necessary to resize, add entries to X while the simulation runs (the process where X is defined takes care of it). If X is resized I need to resize all variables in all processes that have the dims property set to ('X').

I was wondering if you have an idea how to do that most efficiently. Is there a way to avoid resizing all variables inside each process run_step function?

benbovy commented 3 years ago

Hi @jvail,

Thanks!

There's no way built in xarray-simlab to resize data along given dimensions. I'm not sure how xarray-simlab would take care of this as the dims of process variables are just labels used to deal with xarray datasets in model inputs/outputs.

Automatically resizing variables in all processes is a complex problem that may have many sub-problems (e.g., process dependencies vs. when to resize the variables), so I'm afraid you'd still need to handle this "manually".

Maybe you could use runtime hooks to retrieve all variables with dims=('X',) and then resize their array value at each step if needed. The state argument of runtime hook functions is a read-only dictionary, but you can still modify the values (arrays) in-place. That feels hacky to me, though. Also, runtime hooks are not attached to Model objects so you would have to provide hooks to Dataset.xsimlab.run each time you want to run a simulation.

Maybe #141 would help if we add an option like align=True that would resize the 1-d array to the size of coordinate found in the model for this dimension. It gets complicated, though (perhaps in other cases we want to resize the coordinate to the data instead?)

jvail commented 3 years ago

Hey @benbovy ,

thank you sharing your ideas. I definitely need to dig deeper into your sources to have a better idea how this could be done. Ideally I would like to do something like watch all index vars and upon change update all vars that use this dimension. I suppose, since I need access to all processes it must happen somewhere in the model or the driver. But I'd rather avoid sub classing right now.

I will give it a try with your suggestion (runtime hook) first and see how far I can get.

Maybe #141 would help if we add an option like align=True that would resize the 1-d array to the size of coordinate found in the model for this dimension. It gets complicated, though (perhaps in other cases we want to resize the coordinate to the data instead?)

Not sure if understand that correctly. If align=True would be a property for xs.index meaning it would automatically update all vars if this index changed, then that is exactly what I am looking for. And right, it might get pretty complicated if you have complex shapes instead of a 1d array.

If you don't mind I'll leave this issue open for a while.

benbovy commented 3 years ago

What I suggest in #141 is something like xsimlab.getattr_as_dataarray(self, 'var') that would return the value of self.var inside a process class as a xarray.DataArray object. When constructing the DataArray object, we could scan the model variables to see if there is an xs.index variable for each of the dimensions defined for self.var, and if that's the case we could include it as a coordinate.

With xsimlab.getattr_as_dataarray(self, 'var', align=True), we could also check if the size of the index variable matches the size of self.var, and resize it if needed.

But I'd rather avoid sub classing right now.

Yeah, I wouldn't recommend it.

jvail commented 3 years ago

Thank you for clarifying.

What I suggest in #141 is something like xsimlab.getattr_as_dataarray(self, 'var') that would return the value of self.var inside a process class as a xarray.DataArray object. When constructing the DataArray object, we could scan the model variables to see if there is an xs.index variable for each of the dimensions defined for self.var, and if that's the case we could include it as a coordinate.

Having access to the xarray itself within a process is certainly useful and having the dimension labeled (as far as I understand the xarray vocabulary - sorry, just starting to learn to work with xarray) as in the index variable even more so.

With xsimlab.getattr_as_dataarray(self, 'var', align=True), we could also check if the size of the index variable matches the size of self.var, and resize it if needed.

Hm, I don't think this would help me a lot in my use case. Then I would need to replace all self.var with xsimlab.getattr_as_dataarray(self, 'var', align=True) and then back to numpy in each process run_step.

Instead calling a function inside each process, iterating over each variable and resizing it if necessary seems to be a better - not ideal I admit - way. Just another idea: Maybe a prepare_step stage could help here.

I'll keep on digging...

benbovy commented 3 years ago

Then I would need to replace all self.var with xsimlab.getattr_as_dataarray(self, 'var', align=True)

For more convenience, I also suggest in #141 to have an option at the process level, e.g., @xs.process(getattr_as_dataarray=True) so that you could just use self.var inside the process class.

and then back to numpy in each process run_step.

Note that most numpy functions should now work with xarray objects, which are nep-18 compliant.

Maybe a prepare_step stage could help here

I think that you would still need to manually resize each variable in each process. If I understand correctly, what you'd like to do is to perform some operation (resize) on a set of variables (model-wise) that meet some condition (sharing the same dimension), at a given "time" in the workflow (after the index variable has been updated).

Runtime hooks currently allows doing such thing, but I wouldn't rely on that much, those hooks are more for simulation monitoring and are not meant to update the state variables in the model (it might not be consistent with the intent given for those variables, and thus might break the simulation workflow).

benbovy commented 3 years ago

Variable groups would have been helpful for, e.g., resizing all the variables of the same group (you can assign a group name to all variables with dimension X) in one process class, e.g., ResizeAllXVars. However, xs.group only supports intent='in', while you would need intent='inout'.

This restriction is needed so that xarray-simlab can automatically sort the processes in a model. If we allow intent='inout', it would be much more difficult to implement a logic that ensures that the ResizeAllXVars process is executed before all processes where those X vars are defined.

We could allow in xarray-simlab user-defined process order and relax this restriction, but that would require some work.

benbovy commented 3 years ago

To save some code duplication, you could use process class inheritance, e.g.,

@xs.process
class Foo:
    idx = xs.index(dims='X')

    def run_step(self):
        # update idx (which gets resized)

@xs.process
class ResizeX:
    idx = xs.foreign(Foo, 'idx')

    def _resize(self, vars):
        for v in vars:
            # resize v

@xs.process
class Bar(ResizeX):
    var = xs.variable(dims='X', intent='out')

    def run_step(self):
        self._resize([self.var])

        # do other things

That is still not ideal, though.

jvail commented 3 years ago

To save some code duplication, you could use process class inheritance, e.g.,

Yes, that is what I had in mind to to try as a first solution. Maybe I'd name that base class function prepare_step and then iterate over all vars, check their dimension against the index and resize if necessary.

P.S.: Of course for performance it would be nicer to only exercise this if I knew the index has changed (in our model it will only change once in a while).

Note that most numpy functions should now work with xarray objects, which are nep-18 compliant.

Ah! Thanks for this hint. That certainly helps.

benbovy commented 3 years ago

Of course for performance it would be nicer to only exercise this if I knew the index has changed (in our model it will only change once in a while).

You could do something like:

@xs.process
class Foo:
    idx = xs.index(dims='X')
    idx_resized = xs.any_object()

    def run_step(self):
        # maybe update idx...

        if is_resized(self.idx):      # anything that checks if the index has changed
            self.idx_resized = True
        else:
            self.idx_resized = False

@xs.process
class ResizeX:
    idx = xs.foreign(Foo, 'idx')
    idx_resized = xs.foreign(Foo, 'idx_resized')

    def _resize(self, vars):
        if not self.idx_resized:
            return
        for v in vars:
            # resize v
jvail commented 3 years ago

After some experiments we ended up with an awful hack. But all solutions seem to involve some sort of semi ideal work-around.

I did replace the set_state function of the ModelBuilder in order to "inject" our own State class. This is just a dictionary with a new setter. When an index variable changes it will go through all other variables and align their shape to match the index's shape/length.

We just need something more powerful than a process class and I did not like the idea of importing the index everywhere. I did a few experiments with introducing some sort of "meta-process" that is not involved in the actual model logic. The idea was to have something that is powerful enough and could do some work in administering the model: e.g. watch variables, maybe more complex validation, read input files and pass them on to the initialization function of processes, manipulate the state - these sort of things. But I did not really get far and soon hacked too much around in the simlab core.

However, would be nice if (in some distant future) simlab could provide extension points, "slots" to drill your own hole into the implementation to add custom stuff. This is not a feature request - just an idea. Will be a lot of work, I suppose.

Thank you for your support and ideas!

benbovy commented 3 years ago

Thank you for your support and ideas!

You're welcome.

Feel free to open issues if you have suggestions on how we can refactor xarray-simlab to make it more flexible. I'm sure there are ways to make it so without introducing much new concepts.