Open jvail opened 3 years ago
Could you provide a concrete example where such quasi-cyclic situation happens?
I wonder if you could refactor your problem into 3 processes A
, B
, C
, or if user-defined process ordering in a Model
(thus relaxing DAG constraints) would help here.
I think that foreign vars with something like is_cyclic=True
would open a big can of worms. It is quite confusing and would complicate further xarray-simlab's internal logic for process sorting (it is already complex).
I think that foreign vars with something like
is_cyclic=True
would open a big can of worms. It is quite confusing and would complicate further xarray-simlab's internal logic for process sorting (it is already complex).
I don't know the internals but I thought - maybe naively - you could probably just ignore the variables with is_cyclic=True
in the process ordering.
Could you provide a concrete example where such quasi-cyclic situation happens?
Fair enough
A - A process for a tree (here leaf) architecture - in practice all sorts of properties B - Photosynthesis C - Carbon allocation
A.leaf_area (out) --> B B.assimilates (out) --> C C.leaf_dry_matter (out) - - > A
In order to compute A.leaf_area
A needs C.leaf_dry_matter
from t-1 (or the initial value at t = 0). If there is a way to model that in xarray-simlab without cycle then I apologize for not having read the documentation more carefully or missing the obvious.
Of course we could just merge it all together in one process but that is where we come from and we would like to split it up in modules/processes.
Would it be an option to declare C.leaf_dry_matter
with intent='inout'
and with some default value (i.e., some placeholder for the value at t = -1)? This way you could avoid the C -> A
dependency thus remove cyclic dependencies.
Ok, a fat apology
I forgot to mention that circular imports where in my way when I first tried to get it working ;) This time I tried a combination of xs.group in A (avoid explicit linking and circular import of C) and an 'inout' in C with the same group.
So, moving everything into groups plus 'inout' is a solution but I loose a bit on the side of your nice model visualization/inspection features since I obscure where the variable comes from. But that's worth it!
Thank you for your outstanding support!
Ok, a fat apology
No worries at all!
Thank you for your outstanding support!
You're welcome!
Sorry, need to bring this up again. I am still thinking about another way to get this "cyclic" dependencies working (this is indeed one of our three pain points apart from sparse and growing index variables):
I sketched a little prove of concept. Here is a notebook: https://github.com/jvail/xarray-simlab/blob/seemingly-cyclic/notebooks/seemingly_cyclic.ipynb
I am not entirely happy with the current work-around:
Declare an 'inout' rather than 'out' variable if you want to let previously run processes to use it) because it requires to deliver an input for each of them although initializing them in the initialize func would be sufficient; i.e. I can not just use intent 'out'.
Also, as far as I can see from the code where you compute the dependencies that you are filtering for 'out' variables to determine process deps and sorting. So simply allowing a "pseudo cyclic" variable with intent 'out' wont work because process ordering will be ambiguous. Therefore, I think, another property is required (for the sake of clarity I called it "seemingly_cyclic") to explicitly exclude this dependency from being evaluated.
It seems clear from my tests that if there is such a dependency the producer must declare a global name (or group - that's a bit confusing I have to say - a group is always global I guess) and the consumer use a global ref (group) in any case to avoid cycles at module import level. Therefore I added the property "seemingly_cyclic" only to "global_ref".
I also added a dashed arrow to the dot graph to visualize this dependency. I can not see why this might open a "pandora's box" and introduce chaos - but I may - most certainly - lack knowledge about the mechanics of your lib.
P.S.: I will try to make a suggestion for #163 as well but that one is really tricky - bear with me.
I kept playing a bit with this idea and tested if we could use it with xs.foreign as well. If we split it into base classes that implement the different aspects of a process I can get around the cyclic import issue and we could arrive at a pretty "plugable", modular model design (in our use case): i.e. just replace a process with some different math as long as it provides the same output.
The advantage of xs.foreign is that the dependency is explicit - "global_ref" hides it. Makes the model more readable, I think.
Thanks for sharing your notebooks, it gives a clear overview of the problem.
I admit that the 'inout' variable with a default value workaround is not ideal in the case where you just would use intialize
. That said, conceptually it makes some sense IMO: in your example, I see b_v
as a variable that pre-exists with some state before the simulation. If you need to set more advanced values as default, you could use a factory (although I'm not sure that the takes_self=True
option would work out of the box in xarray-simlab).
TBH, I'm reluctant to implement "pseudo cyclic" variables in xarray-simlab for the main reason that it breaks a fundamental concept of this framework, i.e., that the process workflow may be represented as DAG. Allowing one exception like this is opening the door to more exceptions, which may end up with something conceptually very hard to understand. It also adds a parameter to xs.variable
, xs.global_ref
, etc., which have already (too) many options. In your notebooks, you need to rely on either global_ref
or foreign
with some base process classes, which both IMO are not simpler workarounds than 'inout' variables with a default value.
TBH, I'm reluctant to implement "pseudo cyclic" variables in xarray-simlab
Sure, understood and accepted - wont bring it up again - however :) , out of curiosity...
for the main reason that it breaks a fundamental concept of this framework
I don't understand why it would break it because you have it already (global/group + 'inout'). I think such a change would just bring this option to the surface and improve it because an input is no longer required. Is there a fundamental difference between input_vars values and values set in initialize? Isn't it both just the state at step -1?
It also adds a parameter to
xs.variable
,xs.global_ref
, etc., which have already (too) many options
No, only to xs.global_ref
and xs.foreign
. Well, I could imagine to have an xs.cyclic(intent='in') instead ... But I admit, there are already quite a lot of different variables and options which could become even more confusing.
I don't understand why it would break it because you have it already (global/group + 'inout'). I think such a change would just bring this option to the surface and improve it because an input is no longer required. Is there a fundamental difference between input_vars values and values set in initialize? Isn't it both just the state at step -1?
Ah I see, you're right. Probably it's not clear enough that the value of a 'inout' variable declared in a process should be updated in a different simulation stage in that case (that was clear only in my mind so far :-) ).
An 'inout' variable
declared in a process A
and that is accessed by a 'in' group
or global_ref
declared in process B
would not be affected by process ordering (i.e., either A -> B
or B -> A
is correct) as long as the value of this variable is not updated in A
and read in B
during the same simulation stage.
This implicit condition is a bit unfortunate, actually, and I don't see how we could properly inspect the process classes to check this condition when building a new model :-/.
Your suggestion of a "seemingly_cyclic" reference at least makes the intention a bit clearer, but it doesn't solve this implicit condition which still applies here. I'm not very comfortable with adding another way to allow this.
I'm not very comfortable with adding another way to allow this.
D'accord. Fair enough.
Ah I see, you're right. Probably it's not clear enough that the value of a 'inout' variable declared in a process should be updated in a different simulation stage in that case (that was clear only in my mind so far :-) ).
An 'inout'
variable
declared in a processA
and that is accessed by a 'in'group
orglobal_ref
declared in processB
would not be affected by process ordering (i.e., eitherA -> B
orB -> A
is correct) as long as the value of this variable is not updated inA
and read inB
during the same simulation stage.
Wow - that's subtle. A bit hard to follow, I find. I think you could put a bit more faith in the modeler who is supposed to know what he is doing. Nevertheless - I regard this as settled and remain quite now and hope you keep up your excellent work! :)
Nevertheless - I regard this as settled and remain quite now and hope you keep up your excellent work! :)
Thanks! Please continue to give helpful feedback, it's very much appreciated! In this case it made me realize that we can actually break the DAG in the current version!
Please continue to give helpful feedback, it's very much appreciated!
:) All right. Though no guarantees on the "helpfulness". Why not think about the DAG this way:
t1 A->B->C
^
\
\
t0 A->B->C
Taking the time dimension into consideration. If you then go from C to A the cycle disappears. It is just offset in time. The interpretation of the DAG might be too "flat", static. This is what actually happens with 'inout' - it is just not transparent in the visualization.
I remain a bit confused by this, I think there should be some clear documentation as to what is the best way to have this quasi-cyclic behaviour.
Like: what is the correct way to have a variable that is updated on every timestep? In fastscape, this is done by refactoring : Surf2erode is the surface that all processes get their data from. However, with multiple different processes that need to be updated (e.g. have this cyclic behaviour), this greatly increases the number of processes, just to implement updating a variable. but the most simple way to work around this is basically the hack of changing some variable in the cycle from 'out' to 'inout'? and then make sure that that process:
However, AFAIK this is not checked (and indeed I could assign values in the run_step
in a cyclic case). One way would be to base the DAG on variables in the run_step, but this becomes really hairy (how do we know which variables are called in that step?)
Maybe this should be for the model designer, and proper documentation is just what is needed?
However, with multiple different processes that need to be updated (e.g. have this cyclic behaviour), this greatly increases the number of processes, just to implement updating a variable.
I agree this is a major limitation in xarray-simlab. I currently use other kinds of hacks in fastscape to circumvent this because computing all uplift and/or erosion processes from the topography at time t may not be the best / most stable solution compared to the chained application of those processes, one after another, each updating the topography. Also, I've never been 100% happy with how output snapshots are saved between run_step
and finalize_step
. It might be simpler to save them just after run_step
is executed for their corresponding process and maybe get rid of finalize_step
.
I think that the cleanest way to overcome this limitation would be to allow user-defined process ordering and relax some constraints on variable intent, e.g., allow 'inout' in multiple processes for the same variable.
This would be highly welcome, but this would require a big refactoring and design effort.
An open question is also how can we support both automated (any DAG) and user-defined (single-chain DAG) process ordering? Combining both approaches in the same model would be challenging, but I still like the possibility to just add a process to a model and let xarray-simlab figure out where it could be inserted in the DAG. That's very convenient when we want to extend a model "at the edges" (e.g., for fastscape adding processes that simulate limit conditions such as climate or tectonics).
I agree this is a major limitation in xarray-simlab. I currently use other kinds of hacks in fastscape to circumvent this because computing all uplift and/or erosion processes from the topography at time t may not be the best / most stable solution compared to the chained application of those processes, one after another, each updating the topography.
Aah, now that explains all the on_demand
variables in fastscape?! :)
Also, I've never been 100% happy with how output snapshots are saved between
run_step
andfinalize_step
. It might be simpler to save them just afterrun_step
is executed for their corresponding process and maybe get rid offinalize_step
.
So, the conclusion here is that the finalize_step
is actually not really necessary, since the order of calculation is determined from the DAG anyways? Or it is necessary because of parallel execution? and therefore, the inout
variable will be calculated last, and it's value, when used will be from the beginning of the timestep.
I think that the cleanest way to overcome this limitation would be to allow user-defined process ordering and relax some constraints on variable intent, e.g., allow 'inout' in multiple processes for the same variable.
This would be highly welcome, but this would require a big refactoring and design effort.
Oof, that indeed sounds like a lot of work... Honestly, I think this limitation is already overcome with proper use of the inout
intent right? Since that is currently the only place where the cycle is 'broken'. Having multiple 'inout' variables would also make the model very complex for the model maintainer I presume? Maybe, just allowing the intent
of variables to be changed after process declaration can accomplish this? But then again, this requires proper documentation, because it is far from trivial from a user perspective.
An open question is also how can we support both automated (any DAG) and user-defined (single-chain DAG) process ordering? Combining both approaches in the same model would be challenging, but I still like the possibility to just add a process to a model and let xarray-simlab figure out where it could be inserted in the DAG. That's very convenient when we want to extend a model "at the edges" (e.g., for fastscape adding processes that simulate limit conditions such as climate or tectonics).
So this could also be accomplished by allowing changing intent of xs.variables
, or allowing changing the location of the 'inout' intent? something like:
xs.change_inout_to_process(before_process, after_process)
That then puts the after_process
last in the calculation chain.
hmm... on second thought, this may become more complicated. Also, I'm not sure if I completely understand what you want to accopmlish.. from looking at fastscape, it seems that you want to be able to change whether tectonics and erosion are done concurrently or simultaneously, and for that change the ordering in the graphs. This should be a (good) other issue I think?
On another note, I do really like what @jvail did with the dotted arrows. It may be nice to have references to an inout
variable as input to another process indicated with a dotted arrow? possibly with an option for that, or as default when the show_only_variable
is active?
I'm not sure if I completely understand what you want to accomplish
Let me show a simple example:
@xs.process
class ProcessA:
foo = xs.variable(intent='inout')
@xs.process
class ProcessB:
foo = xs.foreign(ProcessA, 'foo', intent='in')
Currently it is possible to do:
model = xs.Model({'a': ProcessA, 'b': ProcessB})
Let's say that both a
and b
implement run_step
and that we want to make sure that b.foo
returns the value before it is updated in a
. Unfortunately, we can't (*). A current workaround is to update the foo variable in a.finalize_step()
that is run after b.run_step()
.
What we may also want to do:
@xs.process
class ProcessC:
foo = xs.foreign(ProcessA, 'foo', intent='inout')
But this is not possible either: xs.foreign
doesn't support intent='inout'
(for the obvious reason that no process order can be determined from the variable intents unless it this order is explicitly defined by the user).
This could look like:
model = xs.Model(
{'a': ProcessA, 'b': ProcessB, 'c': ProcessC},
dependencies=[('a', 'b'), ('c', 'a')]
)
Where the dependencies
argument here means that we explicitly want a
be executed after b
and c
be executed after a
(i.e., for each tuple: 1st item is the name of a process that depends on the process named by the 2nd item)
When we know the order of all processes in the model (here b -> a -> c
), instead of specifying the whole chain of process dependencies as a list of 2-items tuples we could have the following options for more convenience (all these options have the same semantics than the code snippet here above):
processes
argument:model = xs.Model([('b', ProcessB), ('a', ProcessA), ('c', ProcessC)])
model = xs.Model.from_ordered_collection({'b': ProcessB, 'a': ProcessA, 'c': ProcessC})
xs.Model
(if we want to always have user-defined process order):model = xs.SequentialModel({'b': ProcessB, 'a': ProcessA, 'c': ProcessC})
processes
, but it's more an implementation detail so I'm not sure it's a good idea to rely on it. I'd prefer a more explicit API like:model = xs.Model({'b': ProcessB, 'a': ProcessA, 'c': ProcessC}, dependencies='strict')
With user-defined dependencies, we could allow intent='inout'
for foreign variables, and therefore update a variable successively in multiple processes during the same simulation stage. This would be a major improvement!
Unless one needs to clear some resources a each time step, we could get rid of finalize_step
. Time-dependent output snapshots of a variable would then be saved just after the execution of run_step
of its corresponding process. If we want to save the values of the same variable that is updated successively in different processes, we could just do:
xs.create_setup(
model=model,
clocks={'clock': ...},
input_vars={'a__foo': ...},
output_vars={'b__foo': 'clock', 'a__foo': 'clock', 'c__foo': 'clock'}
)
There's some potential challenges, though:
intent='inout'
is used in multiple processes (with any intent) and when no explicit dependencies is given for such processesmodel.update_processes()
and model.drop_processes()
Aah, now that explains all the on_demand variables in fastscape?! :)
Not exactly, on_demand
variables are more for some model diagnostics (variables that are not useful in other processes but useful for post-processing, visualization, etc).
A better illustration is the SurfaceToErode process and the SurfaceAfterTectonics process that inherits form it. This creates additional arrays on the grid that could be avoided with user-defined process dependencies.
Great to see the discussion ongoing here. Just some observations while we are porting our model to simlab:
intent='out'
variable to the all-inout-var-process
in order to force ordering (I needed it to run before another process that was consuming one of the 'inout' vars).With user-defined dependencies, we could allow
intent='inout'
for foreign variables, and therefore update a variable successively in multiple processes during the same simulation stage. This would be a major improvement!
Not so sure about this. You could maybe create a new xs.shared
variable. It seems very confusing that a xs.foreign
can be changed in a consuming process. It might also be difficult to debug since it it can not be deduced anymore how the variable changed in each step because you might have several sub-steps now in a single model step. I think it is a good feature that xs.foreign
are read-only outside the process.
So I added an option show_inout_arrows
in dot.py, that adds dotted arrows as in this example notebook (sorry it is not as concise).
Now it shows arrows by default, only in the most basic view. I also enhanced the inout
look a bit for input variables.
this becaome a bit too long....
TL;DR
I made a proof-of-concept PR, that is exhibited in the example notebook.
Actually, both cases are basically the same, if an inout
variable is declared, it does not have automatic dependencies. Therefore, the user can add them. But to still allow for automatic dependency sorting, we use a dictionary such as:
{'process_name__var_name':'dependent_process_name'}
for determining dependencies. Then, only the dependency for that variable is removed (so other processes can be dependencies as well). Strictly speaking, this is not really necessary, and this removing can actually made redundant when graph reduction is implemented #120
What now still can be implemented is the linear model, or a quicker syntax as @benbovy mentioned, that is then converted to the custom_dependencies
dict format.
What we may also want to do:
@xs.process class ProcessC: foo = xs.foreign(ProcessA, 'foo', intent='inout')
But this is not possible either:
xs.foreign
doesn't supportintent='inout'
(for the obvious reason that no process order can be determined from the variable intents unless it this order is explicitly defined by the user).
This is implemented in #177, but as of now, it does not check if xs.foreign(intent='inout')
variables have a dependency. This may need some more work.
* we need to return meaningful error messages when a variable with `intent='inout'` is used in multiple processes (with any intent) and when no explicit dependencies is given for such processes
Actually, this is even more subtle:
a normal variable with intent='inout'
does not cause significant problems (assuming that it's calculation is last, except for explicit dependent variables).
However, for xs.foreign(intent='inout')
variables, it all becomes quite complex. For example in the following configuration:
bad:
in
/
out->inout1->in
\
inout2->in
both inout1
and inout2
mutate the same variable, with unknown order of calculation. So we would have to ensure that all inout variables are chained linearly Also, AFAIK, the order of calculation in these cases is not known:
good:
out->inout1->inout2
\ \ \
in1 in2 in3
apart from implementation, how do we ensure even in the good
case that in1
is used before inout1
(or even inout2
) does its calculation and then mutates the variable? Because the DAG only ensures that processes are calculated after proper execution of the variables.
This would need some kind of lock in execution of xs.foreign(intent='inout')
variables, or a DAG that becomes as complex as:
better:
out->inout1->inout2
\ ^ \ ^ \
\ / \ / \
in1 in2 in3
where an inout
variable depends on all input variables before that.
I think this really complicates things, that should be solved in another PR possibly?
* we need to check that user-defined dependencies are consistent with the DAG defined from the intent of the variables defined in the model
Yes, in the current implementation, where the user specifies dependencies as a {'p_name__var_name':'dep_p_name'}
dictionary. Now I'm stuck at getting the variable of dep_p
from the key, or p_name__var_name
* it might not be straightforward to update the explicit dependencies when creating new models from existing ones using `model.update_processes()` and `model.drop_processes()`
I let the custom_dependencies
be copied over (that should work). However, maybe write a function that filters for processes that are not in the model anymore...
if an
inout
variable is declared, it does not have automatic dependencies. Therefore, the user can add them.
Yes, exactly! The user must add them, except in the two following cases where this isn't required:
inout
variable has no other reference in the modelinout
variable (resp. foreign) declared in process A
has one reference (resp. is the only inout
reference to an) out
variable declared in process B
. In this case I think it's ok to consider that out
takes precedence over inout
, i.e., B
must be executed before A
.But to still allow for automatic dependency sorting, we use a dictionary such as:
{'process_name__var_name':'dependent_process_name'}
for determining dependencies.
I like the idea of using a dictionary for user-defined dependencies. We could even allow lists of process names as dict values.
However I don't understand why we need the variable name in dict keys. Could you expand on why you skip those variables when retrieving the process dependencies in #177 please?
I let the custom_dependencies be copied over (that should work). However, maybe write a function that filters for processes that are not in the model anymore...
I'm afraid this won't be enough for all cases. Suppose we have a model
with a -> b -> c
custom order and we want to create new_model = model.drop_processes('b')
, then we need some logic so that new_model
is built with the custom dependency ('c', 'a')
that may not be explicitly stated in model
.
You could maybe create a new xs.shared variable. It seems very confusing that a xs.foreign can be changed in a consuming process.
Hmm how xs.shared
would be different than xs.foreign
? It seems to me that there wouldn't be much difference to justify having two kinds of variables. xs.shared
looks a bit confusing to me while xs.foreign
makes it clear that the variable is declared in another process (declaration != value).
It might also be difficult to debug since it it can not be deduced anymore how the variable changed in each step because you might have several sub-steps now in a single model step.
This should not be harder to debug if we change how output variable snapshots are saved, like I suggest just before and/or after the execution of every process during the run_step
stage and not anymore once between the run_step
and finalize_step
stages, so that we can save the values for each "sub-step" (i.e. process).
However I don't understand why we need the variable name in dict keys. Could you expand on why you skip those variables when retrieving the process dependencies in #177 please?
We need the variable name, since a process can depend on multiple other processes (with different variables). The variable name is used to create a dictionary of skip_deps = {'p_name',key}
that is then checked (and skipped if present) during normal dependency building.
e.g.:
xs.model({'A':A,'B':B,'C':C,'D',D},
custom_dependencies={'C__b_var':'B'})
A->B->C
/
D
where we find that b_var
is an inout
var in B
, so we have to add it manually. However, normally during normal dependency building, A->C
would be there, so we have to skip that dependency using skip_deps
.
Now, we still use normal dependency building, so C
is still added automatically.
I preferred this method over a {p_name:dep_p_name}
dictionary, because it also allows the dependency building that is already there.
However, normally during normal dependency building, A->C would be there, so we have to skip that dependency using skip_deps.
I don't think it's necessary to skip it. The topological sorting algorithm will yield consistent process ordering with or without A->C
. Also, I think it's better to keep both user-defined and automatically inferred dependencies instead of doing any graph "optimization" here, even though it results in more edges (dependencies).
The checking algortihm works now! (I added a transitive reduce, since it was easy) Both require a descendants/deps dict to work. That would also be very useful in drawing the dotted arrows. @benbovy what are your thoughts on adding that to the model class? EDIT: I have kept it in the _ModelBuilder
for now, since that seems to be the design pattern.
* the `inout` variable has no other reference in the model * the `inout` variable (resp. foreign) declared in process `A` has one reference (resp. is the only `inout` reference to an) `out` variable declared in process `B`. In this case I think it's ok to consider that `out` takes precedence over `inout`, i.e., `B` must be executed before `A`.
in fact, we never have to check for out
variables, since they always come first. So we have two cases: inout
variables with or without corresponding in
variables.
I like the idea of using a dictionary for user-defined dependencies. We could even allow lists of process names as dict values.
done that, maybe a set is better? it needs one conversion less here, but maybe users are not as familiar with sets?
I'm afraid this won't be enough for all cases. Suppose we have a
model
witha -> b -> c
custom order and we want to createnew_model = model.drop_processes('b')
, then we need some logic so thatnew_model
is built with the custom dependency('c', 'a')
that may not be explicitly stated inmodel
.
added that
Great! I commented on #177.
done that, maybe a set is better? it needs one conversion less here, but maybe users are not as familiar with sets?
Yep internally coercing into a set a sequence (tuple, list, etc) that passed through the API seems good to me.
Sorry to open up a new 'battleground' - the last thing that haunts me for now:
What do you, @benbovy, think about the following idea, use case:
Let's say we have two processes
A
andB
.A
has a varx
that is a foreign variable inB
andB
has a variabley
that is a foreign variable inA
. Wont work because you can not derive the DAG in that case.What if I annotate the foreign variable
B.y
inA
with something likeis_cyclic=True
. That would mean:A.y
, when building the process graph, the orderingB -> y -> A
relation in the visualization (maybe with astep - 1
label, dotted line...)NaN
in step = 0 if y is notinout
in B or the value set inB.initialize
It just happens that we have a lot of these "cycles" but typically it is just the value from step
time - 1.
So only quasi-cyclic.