Workflow RO-Crate specification requirements

stuzart commented 4 years ago

A place to collect requirements for the Workflow RO-Crate requirements.

The current specification is described at https://github.com/workflowhub-eu/about/blob/master/Workflow-RO-Crate.md

DrYak commented 4 years ago

Shall we also consider some running environment information, such as docker ?

e.g.:

snakemake workflows such as our will very likely run best in snakemake's docker (with conda and snakemake preinstalled).
conda has the possibility to pre-download and generate conda environment for workflow dependencies (e.g.: in our cluster tutorial ). At some point in the future we're planning to release such ready-to-use dockers for our pipeline.

bedroesb commented 4 years ago

I had something similar in mind for galaxy, like, on which instance can we run this workflow

bedroesb commented 4 years ago

I had a small list with things:

Field for Openaire: Funding Reference
Example data (reference?)
tutorial link
instances/where to run?
docker container?
doi (coming from zenodo?)

simleo commented 4 years ago

Support for describing test cases

Hi!

One of the main things we're trying to figure out while developing the Life Monitor is how test description is going to look like. Up to now, we've created three workflow ro-crate examples, which don't assume anything other than what's already in the current specs, i.e., that the crate COULD contain a "test" subdir.

In practice, as a first stab at imagining how things could be structured, we've arranged the "test" dir in the examples to contain a params file that points at inputs and expected outputs. The check_*.py at the top level of the repo (e.g., check_cwl.py) serve as an example of how these could be used to drive test execution. At the moment, nothing has been formalized about the params file, not even the format (perhaps it should be a YAML file instead?) or whether it should be separate from the main ro-crate-metadata.jsonld.

Here are a couple of questions that came up while trying to reason on this:

What should go in the test description format: inputs, expected outputs, parameters (options), ...? Some kind of mapping between workflow I/O slots and test data seems like a minimum requirement.
Are parameters really a degree of freedom (e.g., Galaxy workflows have a "tool_state" field that presets all parameters)?
How should the test description format fit into the broader ro-crate metadata standard? Should it be part of the workflow ro-crate specs or be separate (i.e., with the "test" dir as a black box from the ro-crate standpoint, while its internals are described by test-specific specs). How about the even broader workflow hub metadata standard in general? Should seek/workflowhub know about tests in its data model, whether they have been uploaded as ro-crates or not (seems reasonable in the latter case, where the ro-crate is generated by the hub)?
How would the workflow author/uploader define tests for the workflow if it's not being uploaded as an RO-Crate? By uploading additional files? By filling in extra form fields (some uploading would be required anyway for test data)?

The last two bullets seem to suggest that ro-crates would make things much simpler whith respect to workflows uploaded as single files.

What do you think about all this? Let me know if you think this should be moved to a separate issue.

simleo commented 4 years ago

Update:

I have done more work on the current test metadata file structure and related code (see https://github.com/crs4/life_monitor/pull/10). I've also tried to better clarify the current state of things in https://github.com/crs4/life_monitor/issues/11, which can hopefully serve as a starting point for further discussion. Any feedback is more than welcome!

fbacall commented 4 years ago

@simleo sorry for very late reply. With regards to how this fits in the Workflow RO-Crate spec, I think the test descriptions should stay in the file you described, and the ro-crate-metadata.jsonld would direct consumers to that file using either a JSON-LD property (can't think of one immediately, but maybe there is one in bioschemas) or just a convention (e.g. the file should be located at test/params.jsonld).

For how users would add tests to a pre-uploaded workflow, I think they would almost always want define the tests externally rather than through a web form. Currently in Workflow Hub is it not possible to upload additional files after the fact, but we will support this soon.

stain commented 4 years ago

Considering @simleo's use case of test data https://github.com/crs4/life_monitor/wiki/Test-Metadata-Draft-Spec

I think we can use https://schema.org/potentialAction from the ComputationalWorkflow to link to a description of running it as an https://schema.org/Action (or even https://schema.org/AssessAction) and then from there link or expand @simleo specific requirements, e.g. inputs.

E.g.

On the Workflow contextual entity:

            "potentialAction": { "@id": "tests/test1.json" }

which we could describe as:

{ "@id": "tests/test1.json",
  "@type": "AssessAction",
  "agent": {
       "@id": "http://lifemonitor.example.com/"
   },
   "instrument": { "@id": "tests/inputs-job.json" },
   "target": { "@id": "#a4b022ac-abc0-46cc-bf75-3402f89a304e" }
},
{ "@id": "#a4b022ac-abc0-46cc-bf75-3402f89a304e",
  "@type": "EntryPoint",
  "urlTemplate": "http://lifemonitor.example.com/test/submit?wf=https://workflowhub.eu/workflows/53",
  "contentType": "application/json"
}

Here there is a cheaky @id on the usually abstract AssessAction to link to the JSON file that Life Monitor needs - you would need to look for actions with the life monitor as agent. I'll admit this is a bit of a hack as http://schema.org/agent does not seem to cover software agents at the moment

We may not need the EntryPoint but if there is a web service in Life Monitor that can be triggered, that is one way it could be described.

If all the things needed by https://github.com/crs4/life_monitor/wiki/Test-Metadata-Draft-Spec can be represented as Action or similar it can be lifted to the RO-Crate metadata file - but I think keep it separate for now. Also then for now you don't need to worry about making sure it's valid JSON-LD terms etc.

simleo commented 4 years ago

Our current assumption is indeed that the test metadata file will be separate, acting as a sort of "plug-in" that adds additional metadata to crates that include testing material. The idea is that basically Life Monitor only expects to find a test/test-metadata.json file in the crate, possibly with references to additional items such as inputs, expected outputs and test/job configuration files for a testing engine (all under the test dir). For instance, this prototype example contains an input and output file and a Planemo test configuration file. The test metadata file only references the Planemo file which, in turn, references the input/output files. This script is a self-contained toy example of interaction with such a crate.

We also assumed you'd link to the test metadata file in some special way (e.g., the Action example you just posted) in a future version of the Workflow RO-Crate specs. However, due to the fixed test/test-metadata.json path, and the fact that Workflow RO-Crates are currently agnostic with respect to the contents of the optional test subdir, a crate with tests can be built even now (as shown in the linked example above). This decoupling should offer maximum flexibility on both sides, allowing the test metadata specs to develop in possibly breaking ways without affecting the Workflow RO-Crate specs.

One thing that occurred to me after reading you post, though, is that we were somewhat assuming Life Monitor to be the only possible testing service, while one might want to add an arbitrary test layout to a crate. The agent entry you showcased seems to solve the problem, since it states that the testing material is meant for a specific service.

One thing I find strange about the entry point example is that it assumes the crate to be registered to (a specific instance of) the WorkflowHub, and with a specific ID. My understanding is that RO-Crates can exist independently from the WorkflowHub, so we are working under the assumption that information about the interaction with the Life Monitor API will be handled in the WorkflowHub backend, not in the crate. I.e., a user registers a new workflow on the WorkflowHub; the WorkflowHub backend makes a POST call to the Life Monitor API (containing a link to the RO-Crate and other info such as uuid, name and version); Life Monitor gets the RO-Crate, reads its contents and acts according to any testing material in it.

workflowhub-eu / about

Workflow RO-Crate specification requirements #10