Open tmcphillips opened 4 years ago
One question here is how to handle files that are both written and read by processes comprising the run. A typical computational pipeline comprising multiple scripts or programs will have many such files, and it is often the case that a file written by one process and subsequently read by another is to be considered to be among the "outputs" of the pipeline.
A file written by a process and consumed by another in the same run might not be considered an "input" to the run, however--particularly if that file did not exist before the run began.
To unambiguously represent the above, it would be very useful to know, for each file that is both read and written by a run, whether the file preexisted the run, or whether it was created by the run (either anew or by completely overwriting a pre-existing file). This also should be clear to anyone looking at the resulting visualization.
One option for disambiguating outputs is to support designated output directories. Files written to an output directory would be considered an output even if also read during the run.
Here's the Prolog that yields the DOT file for the first version of the visualization of a run and its outputs:
gv_graph('wt_run', 'Run Inputs and Outputs', 'LR'),
wt_node_style__run,
wt_node__run(),
gv_borderless_cluster('inputs'),
wt_node_style__file,
wt_nodes__run_input_files(),
gv_cluster_end,
gv_borderless_cluster('outputs'),
wt_node_style__file,
wt_nodes__run_output_files(),
gv_cluster_end,
wt_edges__input_files_to_run(),
wt_edges__run_to_output_files(),
gv_graph_end.
And the result for example 05-cat-file-to-file:
..which has the following run.sh
:
#!/bin/bash
cat inputs/input.txt > outputs/output.txt
The most unsatisfactory characteristic of the above visualization is that shared libraries used by the run are rendered alongside legitimate input files to the run and using the same node style.
Display of software libraries should be optional, and should have a different appearance from files explicitly read or written by the executed run.
That makes total sense to me! Another option would be to group a set of those into a block (cluster) node..
As discussed in issue #11, an optional configuration file now declares the roles of files in different directories. For the 05-cat-file-to-file example above, the newtrace2facts.yml
file...
---
roles:
os:
- /lib
- /etc
- /usr/lib
sw:
- .
- /bin
in:
- ./inputs
out:
- ./outputs
for this example cleans up the graph to the following:
We can parameterize the prolog query that produces this graph to show and hide files playing different roles, grouped in different clusters.
The simplest visualization of a run we want to provide is one that shows the entire run as a single graph node (e.g. one box), and each of the runs inputs and outputs flowing into and out of that box.
By default the inputs and outputs should include only those files meaningful to the researcher, i.e. should not include files provided by operating system, installed software packages, or language runtimes.