xarray-contrib / xarray-simlab

Xarray extension and framework for computer model simulations
http://xarray-simlab.readthedocs.io
BSD 3-Clause "New" or "Revised" License
73 stars 9 forks source link

simple xarray-simlab code fail to set attrs attributes. #51

Closed luca-penasa closed 6 years ago

luca-penasa commented 6 years ago

trying to run the following code:

import xsimlab as xs
import numpy as np

@xs.process
class TestProcess(object):
    spacing = xs.variable(description='grid spacing', default=2)

mod = xs.Model({"proc1": TestProcess})

fails with the following error from the attrs's generated init:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-8-600a28a8e410> in <module>()
      6     spacing = xs.variable(description='grid spacing', default=2)
      7 
----> 8 mod = xs.Model({"proc1": TestProcess})

~/.local/lib/python3.7/site-packages/xsimlab/model.py in __init__(self, processes)
    357             ensure_process_decorated(cls)
    358 
--> 359         builder = _ModelBuilder(processes)
    360 
    361         builder.bind_processes(self)

~/.local/lib/python3.7/site-packages/xsimlab/model.py in __init__(self, processes_cls)
     42     def __init__(self, processes_cls):
     43         self._processes_cls = processes_cls
---> 44         self._processes_obj = {k: cls() for k, cls in processes_cls.items()}
     45 
     46         self._reverse_lookup = {cls: k for k, cls in processes_cls.items()}

~/.local/lib/python3.7/site-packages/xsimlab/model.py in <dictcomp>(.0)
     42     def __init__(self, processes_cls):
     43         self._processes_cls = processes_cls
---> 44         self._processes_obj = {k: cls() for k, cls in processes_cls.items()}
     45 
     46         self._reverse_lookup = {cls: k for k, cls in processes_cls.items()}

<attrs generated init c3bfe1a9beca995748daa500465cd5be1c8546d3> in __init__(self)
      1 def __init__(self, ):
----> 2     self.spacing = attr_dict['spacing'].default
      3     self.__attrs_post_init__()

AttributeError: can't set attribute

I tested various tagged version of both xarray-simlab and attrs, any idea of what might be wrong?

BTW I am working on setting up a large scale simulation so I was considering using this package for initializing/keeping track of all the parameters, although the actual modelling in the time-domain will be executed by external code. Thus I'd like to use xarray-simlab mostly for the smart variable method and to create the experimental setup in discrete steps. Is there any problem I could find in only calling the model.initialize() method ? or e.g not passing clocks to the builder?

thank you very much. this seems a very useful project. Hope I could contribute anything. luca p

luca-penasa commented 6 years ago

brief update after some research, maybe it might help. It seems the following code:

import xsimlab as xs
from xsimlab.variable import _as_dim_tuple, VarIntent, VarType
import attr

def variable(dims=(), intent='in', group=None, default=attr.NOTHING,
             validator=None, description='', attrs=None):

    metadata = {
        'var_type': VarType.VARIABLE,
                'dims': _as_dim_tuple(dims),
                'intent': VarIntent(intent),
                'group': group,
                'attrs': attrs or {},
                'description': description}

    return attr.attrib(metadata=metadata, default=default, validator=validator,
                       init=False, cmp=False, repr=False)

import attr
@xs.process
class test(object):
    spacing = variable(description='grid spacing', default=2)

t = test()

does not work bu this does (only the key of the metadata dict is changed):

import xsimlab as xs
from xsimlab.variable import _as_dim_tuple, VarIntent, VarType
import attr

def variable(dims=(), intent='in', group=None, default=attr.NOTHING,
             validator=None, description='', attrs=None):

    metadata = {
        'var_type2': VarType.VARIABLE,
                'dims': _as_dim_tuple(dims),
                'intent': VarIntent(intent),
                'group': group,
                'attrs': attrs or {},
                'description': description}

    return attr.attrib(metadata=metadata, default=default, validator=validator,
                       init=False, cmp=False, repr=False)

import attr
@xs.process
class test(object):
    spacing = variable(description='grid spacing', default=2)

t = test()
luca-penasa commented 6 years ago

Last comment was not very useful, but I've been trying to investigate the issue better. here is what I discovered (btw I am on current git version), but I'd really appreciate some devs input:

property(fget=get_from_store, fset=put_in_store, doc=var_doc)

and thus they need to exists before setting and getting the values. But when put_in_store is called (e.g. to set the default value) the self.__xsimlab_store and self.xsimlab_store_keys__ still does not exists.

I'd really appreciate some input from the developers... thanks luca

luca-penasa commented 6 years ago

after a lot of trial and errors I ended up by these considerations:

benbovy commented 6 years ago

Many thanks for this detailed report @luca-penasa ! And sorry for the delay.

default values support in variables is broken. in fact there is no test using a default value

Actually, support for default values is not yet fully implemented! I admit it wasn't a good idea to already add the default argument in xs.variable()! I didn't expect that this would cause issues like this, sorry! For now we should really leave default=attr.NOTHING.

I would suggest to move the default value to the metadata associated to the variable

Do you mean something like xs.variable(attrs={'default': 2})? The attrsargument is rather for holding user custom metadata. If we support default values, it should still be set using a reserved keyword argument IMO.

Or maybe you mean moving the default value to the metadata attribute of the underlying attr.attrib? Like this:

def variable(dims=(), intent='in', group=None, default=attr.NOTHING,
             validator=None, description='', attrs=None):

    metadata = {'default_value': default, ...}

    return attr.attrib(metadata=metadata, validator=validator,
                       init=False, cmp=False, repr=False)

Unless there is a way to handle attr.attrib(default) so that we can avoid the issue here, I think that this solution above is the best one.

xarray-simlab uses attrs in a rather uncommon way. For example, xs.process decorated classes turn all declared variables into properties, so is not always possible just reusing the features of attrs that are related to instance initialization.

But in this case, there no need to do something special with the default value in a process class __init__. Process classes are used here just as holders of some useful metadata about their declared parameters/variables. The logic of setting the default value of a variable, e.g., in a xarray.Dataset or in __xsimlab_store__ is implemented elsewhere and executed after a xs.Model object has been created (i.e., after Process classes have been instantiated).

benbovy commented 6 years ago

BTW I am working on setting up a large scale simulation so I was considering using this package for initializing/keeping track of all the parameters, although the actual modelling in the time-domain will be executed by external code. Thus I'd like to use xarray-simlab mostly for the smart variable method and to create the experimental setup in discrete steps. Is there any problem I could find in only calling the model.initialize() method ? or e.g not passing clocks to the builder?

I don't see anything that will prevent you doing that, but I might be wrong. However, I think that you will still need to set a (dummy) clock with create_setup. It is required, but in your case it would just run an empty loop during the simulation.

Actually, I've thinking a bit about the possibility of setting custom execution workflows, i.e., not tied to the rigid initialize->clock*(run_step->finalize_step)->finalize. This would require a fair amount of work, though.

this seems a very useful project. Hope I could contribute anything.

This would be greatly appreciated! :-)

luca-penasa commented 6 years ago

thank you @benbovy . Now I've got a little better the architecture... and can understand it was meant to work like that.

I am actually trying to replicate a similar structure for my project. One of the major deal breaking for me was the impossibility to deal with "nested" variables: a process defines a new variable in the form of a a complex object. that object might in turn contain attributes that should be treated as variables themselves. i.e. in psudopythoncode

@process
class Process1:
    var1 = variable (...) [instance of Var1]

with Var1, something like

@attr.attrs [or process]
class Var1:
    v1 = attr.ib [or variable](...)
    v2 = attr.ib [or variable](...)
    v3 = attr.ib [or variable](...)
    v4 = attr.ib [or variable](...)

in a different process i might need to access by reading or writing one of these nested variables. As it is structured now I think it is possible to only access the var1 object defined in the different process, like this:

@process
class Process2:
    var1_imported = foreign variable from Process1(...) [instance of Var1]

in this case I can import within Process2 var1 as a foreign variable. But then I cannot keep a record of what attribute is actually changes in var1 - e.g. v1? or v2? all of them?

would be more useful to be able to import nested variables. e.g. something like:

@process
class Process2:
    v1_imported = "Process1.var1.v1" as foreign variable from Process1(...) 

depending on how I define the variable I can track whether the variable is read/write etc.

The main problem I see now is the structure of the store as a non-nested dict as it is implemented now. Thus I am not sure how I could improve xsimlab to deal with this.... maybe you thought about this? any idea?

My interest is basically in keeping track where a variable is set/updated/read in the model by which module, etc which is the same idea behind xsimlab as far as I got....

btw nice to see such a good earth-scientist pythonist! thank you very much for help

best, luca

benbovy commented 6 years ago

Using flat collections of variables or processes was a design choice. Nested structures would greatly complicate the implementation and doesn't play well with the xarray data model (I initially started this project with the goal to use xarray objects as simulation I/O). For example, how to define intent for a complex object having multiple attributes (or a foreign variable linked to that complex object)?

Looking at your example, would it be a problem to have three process classes like below?

@xs.process
class Var1:
    v1 = xs.variable()
    v2 = xs.variable()
    v3 = xs.variable()
    v4 = xs.variable()

@xs.process
class Process1:
    ref_v1 = xs.foreign(Var1, 'v1')
    ref_v2 = xs.foreign(Var1, 'v2')
    ref_v3 = xs.foreign(Var1, 'v3')
    ref_v4 = xs.foreign(Var1, 'v4')

@xs.process
class Process2:
    ref_v1 = xs.foreign(Var1, 'v1')

You can actually create a process class as a "variable bundle", i.e., just for declaring some variables (no methods), like Var1 above. The downside is that if you want to use those variables in other process classes, you need to declare a foreign for each of them (like in Process1), which is a bit tedious. #37 solves this, but only partially (not possible to track which specific variable is read and shouldn't be used to set values).

btw nice to see such a good earth-scientist pythonist!

Thanks :-)

luca-penasa commented 6 years ago

just to update you, after some thinking, what about using a nested variables scheme? like this:

@xs.process
class Process1:
    ref_v1 = xs.foreign(Var1, 'v1.v2.v3')

A super simplified working implementation here (I am not using a store like in xsimlab): https://gist.github.com/luca-penasa/89e3ad12ca80d597e0cf14a0a0cff06c

but i don't know how well it could work with the current implementation and how much work one should do to make this work.

variables imported with this nested scheme could just live where they are... in fact they would be yet in the store given the object containing them is. [I also don't know how xsimlab would react to storing objects in variables...]

btw I think the store is good but it would definitively better to leave that as an option. thanks for your help luca

benbovy commented 6 years ago

What about something like:

@xs.process
class Process1:
    ref_v1 = xs.foreign(Var1, 'v1')
    ref_v2 = xs.foreign(Var1, 'v2')
    ref_v3 = xs.foreign(Var1, 'v3')

Do I miss any real advantage of using a nested scheme over explicit declaration of each attribute (apart from maybe being a little bit more succinct)?

I'm quite reluctant to add support for nested variables for several reasons:

Without taking statements from "The Zen of Python" too religiously, I think that the "flat is better than nested" principle rather applies well here.

benbovy commented 6 years ago

btw I think the store is good but it would definitively better to leave that as an option.

Storing values of process variables directly as class attributes would certainly avoid issues like the one you've reported here. However, properties have also another advantage than using a centralized, abstract store for simulation data: it allows some safe guards, e.g., add read-only properties for variables declared with intent='in'.

luca-penasa commented 6 years ago

thank you for your reply @benbovy, btw mine was just a way of discussing about this with somebody who thought a lot about the problem. I am facing the issue of implementing something very similar. So I was not suggesting to adopt this for xsimlab but rather I just wanted to know your opinion. So thank you very much for your detailed reply. I agree on the keep it simple principle.

I'll expose you my problem, just if you might be curious why i need to go with something slightly different and I cannot reuse xsimlab easily.

What about something like:

@xs.process
class Process1:
    ref_v1 = xs.foreign(Var1, 'v1')
    ref_v2 = xs.foreign(Var1, 'v2')
    ref_v3 = xs.foreign(Var1, 'v3')

Do I miss any real advantage of using a nested scheme over explicit declaration of each attribute (apart from maybe being a little bit more succinct)?

yes, because in this case Var1 MUST be a process and cannot be a Variable. In my case the instance of Var1 to which members I wanted to "import" as a "foreign" variable are regular, but complex classes, that I really don't want to rewrite as processes. This because they are generic classes I jet wrote and I think won't make sense to re-wrap within "processes". BTW this would be a common case if one uses external libraries into the modelling process.

In my case I am trying to put together a class-instance builder. The step-like-organization is perfect for using the processes in this way I think, although the "time-based" stepping of the simulation is not very useful in my case [would be cool to have it more "decoupled" imo]. The class instance I am creating is basically a 3D scene that is then used for performing a dynamic simulation possibly on a cluster [thus using xsimlab as the "time stepper" maybe is not the better choice]. The elements of the scene, or "Props" are analytical representations of 3D objects (spheres, cubes and so on) that are then transformed into "Mesh" objects by a "mesher" class. Each Prop will thus have some parameters and then their mesh representation is kept as an attribute of the Prop (I think this makes sense). Each prop will have additional attributes as the material, the initial velocity etc... again kept as attributes of the "Prop" class. I think this all makes sense as an object oriented framework. I am yet using attrs.attrs to create slotted attributes for my classes (the prop class, mesh class and so on) so moving towards the "variable" would be pretty easy. Those would also allow me to document the meaning of the attributes and so on. The only deal-breaking here is that I cannot use an unique store for those classes. That would require a lot of rethinking of a codebase that is yet up and running + problems with testing etc... When it comes to "process" function I think it is very cool but what it does not allow me to do (apart from the store) is to assign a class instance as a variable. In my case this makes sense because e.g. the Prop class might be the result of a single process "CreateProps" while the Mesh representation (which should live as an attribute of Props for the code to work) would be the result of a "MeshProps" process. Then I would have other processes like "ComputeInternalMeshing" and "AssignInitialVelocities" and so on...

e.g. if each mesh is composed by a list of vertices and indices I could create many different processes (as many a the number of meshes I need), like "CreateMesh1" and "CreateMesh2" each with the [output] variables "indices" [for the faces] and "vertices" and so on but that would be not very flexible if I need to create 100 meshes. This somehow also breaks the goodness of using objects into the modelling, which is an important part of most codes out there.

I've been looking for quite some time to packages like param or pypet and I found them somehow unsatisfying for one reason or another, while xsimlab seems to have all the features that make sense to have for dealing with scientific parameters within python objects/models (I am 100% in with your very good idea).

Thus, in my opinion, this would be what I think would make of xsimlab a more generic framework, that could be applied more easily in more circumstances:

this would render xsimlab more decoupled from the store and xarray structure and make of it a great "pipelining" tool [I recognize it was not the intent] for replicable scientific computations [which is a great deal now.] still allowing to use it in the store/xarray context or for solving other problems that can be described as "steps" with dependencies. basically splitting the "x", from the "sim" and the "lab" into different modules :+1: "lab" could be used with or without the "x" and the "sim" depending what one is doing. so long story short. what you did here is great! but it is basically a pipelining framework for science that could be super useful in hundreds of other contextes but it would require to separate more the "pipelining" from the "modelling" somehow. I would just keep that in mind because it could be also become a good propsal for allowing science replicability and overall for describing the "passages" that one make for computing a scientific result, which is of primarily importance everywhere. I really liked your take of this problem very-very much :1st_place_medal: .

furthermore the dependencies-resolution would also allow to perform some intelligent "caching" of the results so that one does not need to recompute some processes when the parameters controlling them were not changed.

to be clear these are just ideas and suggestions, I don't want to convince you about any of this or to modify in any way xsimlab, I just really liked to bring in my case. A more generic framework could attract more users and contributors, although I recognize your goal was very specific, perfectly executed and well-reached!

best, luca ps. are you going to be at EGU next year?

I'm quite reluctant to add support for nested variables for several reasons:

* This would add too much complexity to the (already quite complex) current implementation in many places, which is not worth of the benefits IMHO.

* We would need to extend API support for many things, e.g.,  `xs.foreign(Var1, ('v1', 'v2', 'v3'), intent={'v1': 'in', 'v2': 'in', 'v3': 'out'})`. For `xs.variable` the same kind of dictionary should be needed for specifying `dims` or later default values, validators, etc. At the end the code is not really shorter, it's even less readable.

* I'm slightly concerned by the overhead this would add when getting/setting the values of the variables in process classes through their properties.

* Nested variables don't play well with the xarray data model. The current one-to-one relationship between a`xs.variable` and a `xarray.Variable` is easy to deal with. Maintaining both one-to-one and one-to-many relationships in the case of nested variables is hard to maintain. We could argue the same for other kinds of automated I/O interfaces that we could later support, like command-line (options or input file entries) or UI (widgets). Storing complex objects (i.e., other than arrays or primitive types) as `xs.variable` values wouldn't play well with those I/O interfaces either.

Without taking statements from "The Zen of Python" too religiously, I think that the "flat is better than nested" principle rather applies well here.

benbovy commented 6 years ago

First of all, it is always interesting to see for which problems people'd like to use xarray-simlab. This is still a young project with much room for improvement, feedback is very welcomed. So thanks for your feedback and for sharing your use case!

I agree that it is currently not straightforward to wrap existing code with xarray-simlab, or at-least it is not very clear (or still undocumented) how this could be done.

That said, I think that your use case is not incompatible with xarray-simlab (how it is working currently). You don't need to refactor your object-oriented code, but you'd need to write some wrapper code in xs.process classes, using composition.

I've uploaded a example similar to what you describe, but very simplified. Hopefully you'll find it helpful:

https://gist.github.com/benbovy/2bb908c745657712e1da213c0a9badd9

As you can see, it's actually possible to declare xs.variable for complex objects (or object collections) and make references to those objects in other process classes with xs.foreign. What you need is also to declare some other xs.variable that expose one or more attributes of those objects because (1) you want to use them as I/O of your processing and/or (2) these determine the workflow (the order of processes). This can all be done without nested variables.

Note that the the term simulation "store" is a bit misleading here. It is not a container that holds all simulation data but it rather gathers all references to the data that is useful to share between process classes and/or as I/O. Currently it is a simple dictionary accepting any kind of value. We could later support on-disk or distributed stores, which will still work if your objects are serializable.

luca-penasa commented 6 years ago

super! thank you very much. I also did not investigated much the use of on_demand variables. I'll have a look at them.

thank you again!