whole-tale / wt-prov-model

Experiments, design documents, and prototypes supporting a provenance model for Tales and runs.
MIT License
0 stars 1 forks source link

Graph showing the run as a whole and its inputs and outputs #10

Open tmcphillips opened 4 years ago

tmcphillips commented 4 years ago

The simplest visualization of a run we want to provide is one that shows the entire run as a single graph node (e.g. one box), and each of the runs inputs and outputs flowing into and out of that box.

By default the inputs and outputs should include only those files meaningful to the researcher, i.e. should not include files provided by operating system, installed software packages, or language runtimes.

tmcphillips commented 4 years ago

One question here is how to handle files that are both written and read by processes comprising the run. A typical computational pipeline comprising multiple scripts or programs will have many such files, and it is often the case that a file written by one process and subsequently read by another is to be considered to be among the "outputs" of the pipeline.

A file written by a process and consumed by another in the same run might not be considered an "input" to the run, however--particularly if that file did not exist before the run began.

To unambiguously represent the above, it would be very useful to know, for each file that is both read and written by a run, whether the file preexisted the run, or whether it was created by the run (either anew or by completely overwriting a pre-existing file). This also should be clear to anyone looking at the resulting visualization.

tmcphillips commented 4 years ago

One option for disambiguating outputs is to support designated output directories. Files written to an output directory would be considered an output even if also read during the run.

tmcphillips commented 4 years ago

Here's the Prolog that yields the DOT file for the first version of the visualization of a run and its outputs:

    gv_graph('wt_run', 'Run Inputs and Outputs', 'LR'),

        wt_node_style__run,
        wt_node__run(),

        gv_borderless_cluster('inputs'),
            wt_node_style__file,
            wt_nodes__run_input_files(),
        gv_cluster_end,

        gv_borderless_cluster('outputs'),
            wt_node_style__file,
            wt_nodes__run_output_files(),
        gv_cluster_end,

        wt_edges__input_files_to_run(),
        wt_edges__run_to_output_files(),

    gv_graph_end.

And the result for example 05-cat-file-to-file:

wt_run_inputs_outputs

..which has the following run.sh:

#!/bin/bash
cat inputs/input.txt > outputs/output.txt
tmcphillips commented 4 years ago

The most unsatisfactory characteristic of the above visualization is that shared libraries used by the run are rendered alongside legitimate input files to the run and using the same node style.

Display of software libraries should be optional, and should have a different appearance from files explicitly read or written by the executed run.

ludaesch commented 4 years ago

That makes total sense to me! Another option would be to group a set of those into a block (cluster) node..

tmcphillips commented 4 years ago

As discussed in issue #11, an optional configuration file now declares the roles of files in different directories. For the 05-cat-file-to-file example above, the newtrace2facts.yml file...

---
roles:
    os:
    - /lib
    - /etc
    - /usr/lib
    sw:
    - .
    - /bin
    in:
    - ./inputs
    out:
    - ./outputs

for this example cleans up the graph to the following:

wt_run_inputs_outputs

We can parameterize the prolog query that produces this graph to show and hide files playing different roles, grouped in different clusters.