whole-tale / wt-prov-model

Experiments, design documents, and prototypes supporting a provenance model for Tales and runs.
MIT License
0 stars 1 forks source link

Configuration file declaring special directories #11

Open tmcphillips opened 4 years ago

tmcphillips commented 4 years ago

The various files accessed during a traced run can come from the base operating system, packages installed via the OS-level package managers, other software installed on the system, and scripts and other programs closely associated with the run itself.

An optional configuration file for rpz2prolog could declare directories containing files of each kind. This would supplement the automatic detection of files that belonging to particular installed packages performed by ReproZip.

This classification of files would be useful when creating visualizations that focus on different aspects of the run.

tmcphillips commented 4 years ago

Such a configuration file could start off looking like this, with room for more kinds of configuration info as needed:

---
dirs:
  os:
  - /lib
  - /etc
  - /usr/lib
  input:
  - ./inputs
  output:
  - ./outputs
  software:
  - /opt
  - ./scripts
remram44 commented 4 years ago

ReproZip uses a similar list: reprozip/tracer/linux_pkgs.py

It is used to restrict the detection of distribution packages, and the detection of experiment inputs/outputs.

tmcphillips commented 4 years ago

Yes, thank you! I've been playing with this as well. The detection of what files belong to which installed packages is particularly cool. I am planning on incorporating the config.yaml contents to our model today so that we can query which processes in a run used a particular package.

The idea of the additional configuration file is to enable the user (or the developers of the Whole Tale framework) to provide authoritative classification of the files before the run takes place. For Whole Tale users in particular, all of the provenance management would be behind the scenes--end users will not have the opportunity to edit config.yml after the run.

One thing I observed is that the semantics of "inputs_outputs" in the config.yml might be a little different from what I am thinking of run inputs and outputs. For example 06-hello-python which just runs a simple Hello World example in Python, I see the following in config.yml:

inputs_outputs:
- name: reprounzip-1.0.16-py3.6-nspkg.pth
  path: /home/tmcphill/.venv/reprozip/lib/python3.5/site-packages/reprounzip-1.0.16-py3.6-nspkg.pth
  written_by_runs: []
  read_by_runs: [0]
- name: python
  path: /home/tmcphill/.venv/reprozip/bin/python
  written_by_runs: []
  read_by_runs: [0]
- name: pyvenv.cfg
  path: /home/tmcphill/.venv/reprozip/pyvenv.cfg
  written_by_runs: []
  read_by_runs: [0]
- name: python3
  path: /home/tmcphill/.venv/reprozip/bin/python3
  written_by_runs: []
  read_by_runs: [0]

With this user- (or Whole Tale)-provided configuration...

---
dirs:
    os:
    - /lib
    - /etc
    - /usr/lib
    sw:
    - .
    - /usr/bin
    - /usr/lib/python3.5
    - /home/tmcphill/.venv
    in:
    - ./inputs
    out:
    - ./outputs

...these python files are classified as "sw" and can be filtered out in visualizations.

Is there a way to provide ReproZip with hints of this kind prior to tracing a run?

Thanks!

remram44 commented 4 years ago

I am planning on adding language/tool-specific patterns (for example for Python), which would recognize Python environments and remove those files from inputs/outputs (in fact I want it to recognize Python packages and their version, in addition of distribution packages). But this is not implemented yet.

The intended use of ReproZip is trace -> edit packing config -> pack, adding this configuration before running would complicate this workflow into provide tracing config -> trace -> edit packing config -> pack. Of course editing the packing config is a hassle when a lot of files are wrongly included, but there's a trade-off here I'm not sure how to deal with.