How to represent incremental updates to variables without creating bogus variables?

olyerickson commented 8 years ago

It is common for a script to have a series of blocks that update e.g. a tensor. Currently YW doesn't allow blocks with the same @out variables; we have to fake it with code like...

@begin block_1
@in myTensor
@out myTensor @as myTensor_1
@end block_1

@begin block_2
@in myTensor @as myTensor_1
@out myTensor @as myTensor_2
@end block_2
:

This does not accurately depict what is happening to myTensor. Any ideas?

tmcphillips commented 8 years ago

Yes, this is a common pattern. Currently, the immediate argument to @in or @out is meant to name the actual variable, in the code, that is used to transport a value into or out of the block. Eventually we plan to provide a validation option that confirms, among other things, that a variable referenced in this way actually appears in the corresponding source code block. Flagging situations where the variable does not occur may indicate that the there is an error in the YW annotations, possibly because the name of the variable in the code has been changed since the YW annotations were made.

In contrast, the argument to @as is an optional alias for the variable. It is expected that the alias does not name a variable in the code block, and validation actually should flag cases where it does name a variable because it would be the wrong one.

The alias is used for matching (i.e. inferring dataflow channels between) @out ports and @in ports if the alias is provided; otherwise the actual variable name is used for this purpose. It is also used in the graphical views to label edges between code blocks in the process view and the data blocks in the data and combined views, if present; otherwise the variable name is displayed.

In this sense, what you are doing to represent updates to variables is accurate. The variable is reused, and you are distinguishing via the aliases which state of the variable serves as input to each block. One thing you could do to clarify things is to provide aliases that describe the state of the variable at that point, e.g. "myDataSet_unnormalized" going into a block and "myDataSet_normalized" coming out (i.e. think of the alias as the name of the value rather than the name of the variable).

There has been a request for an explicit YW tag for indicating that a code block updates the value of a variable. This would allow one, e.g., to distinguish between blocks that update variable values and blocks that filter values (without changing them). Does this sound useful? Note that you would still need to provide a unique alias somehow so that YW has unique names with which to wire up the blocks as you intend. Additionally, I don't know how one would distinguish updaters and filters graphically.

olyerickson commented 8 years ago

This is very interesting. What you are describing is (in essence) a type system for blocks.

One can think of a number of examples: updating, filtering, imputation, etc. Being semantic web types, we'd probably want to associate a URI associated with the type. Off the top of my head, I could see...

@begin My_Code @uri codeblock_uri @typeuri uri_from_workflow_ontology 
:

... in which @uri is just an entity identifier for this block (to get it into some knowledge graph) and @typeuri is the uri of some class from an ontology describing e.g. analytic workflows. So for example it could specify multiple imputation, etc.

John

olyerickson commented 8 years ago

Or, as discussed in another issue, something like...

@begin BlockName @uri uri_assigned_to_block
@desc "Description of block"
@type typeName @uri uri_associated_with_type_of_block
:

yesworkflow-org / yw-prototypes

How to represent incremental updates to variables without creating bogus variables? #37