whole-tale / wt-prov-model

Experiments, design documents, and prototypes supporting a provenance model for Tales and runs.
MIT License
0 stars 1 forks source link

Option to simplify paths extracted from a ReproZip-traced run #7

Open tmcphillips opened 4 years ago

tmcphillips commented 4 years ago

By default, we can trim the working directory from the paths to files or directories below that directory. We can additionally trim an arbitrary prefix from the working directory as well.

For example, the facts corresponding to a run of Hello World currently looks like this:

%---------------------------------------------------------------------------------------------------
% FACT: rpz_process(ProcessID, ParentID, RunID, IsThread, ExitCode, TimeStamp).
%---------------------------------------------------------------------------------------------------
rpz_process(p1, nil, r0, false, 0, 159626246090098).

%---------------------------------------------------------------------------------------------------
% FACT: rpz_executed_file(ExecutionID, RunID, ProcessID, Program, Argv, WorkingDir, TimeStamp).
%---------------------------------------------------------------------------------------------------
rpz_executed_file(e1, r0, p1, "/mnt/c/Users/tmcphill/OneDrive/GitRepos/wt-prov-model/examples/03-hello-c/./bin/hello_c", "./bin/hello_c", "/mnt/c/Users/tmcphill/OneDrive/GitRepos/wt-prov-model/examples/03-hello-c", 159626252507798).

%---------------------------------------------------------------------------------------------------
% FACT: rpz_opened_file(FileID, RunID, ProcessID, File, Mode, IsDirectory, Timestamp).
%---------------------------------------------------------------------------------------------------
rpz_opened_file(f1, r0, p1, "/mnt/c/Users/tmcphill/OneDrive/GitRepos/wt-prov-model/examples/03-hello-c", 4, true, 159626246096298).
rpz_opened_file(f2, r0, p1, "/mnt/c/Users/tmcphill/OneDrive/GitRepos/wt-prov-model/examples/03-hello-c", 4, true, 159626246098698).
rpz_opened_file(f3, r0, p1, "/mnt/c/Users/tmcphill/OneDrive/GitRepos/wt-prov-model/examples/03-hello-c/bin/hello_c", 1, false, 159626255197898).
rpz_opened_file(f4, r0, p1, "/lib/x86_64-linux-gnu/ld-2.24.so", 1, false, 159626255218998).
rpz_opened_file(f5, r0, p1, "/etc/ld.so.cache", 1, false, 159626256427998).
rpz_opened_file(f6, r0, p1, "/lib/x86_64-linux-gnu/libc.so.6", 1, false, 159626257712298).```

Replacing the working directory in each path with ., and trimming the system-specific directory /mnt/c/Users/tmcphill/OneDrive/GitRepos from the working directory would give:

%---------------------------------------------------------------------------------------------------
% FACT: rpz_process(ProcessID, ParentID, RunID, IsThread, ExitCode, TimeStamp).
%---------------------------------------------------------------------------------------------------
rpz_process(p1, nil, r0, false, 0, 159626246090098).

%---------------------------------------------------------------------------------------------------
% FACT: rpz_executed_file(ExecutionID, RunID, ProcessID, Program, Argv, WorkingDir, TimeStamp).
%---------------------------------------------------------------------------------------------------
rpz_executed_file(e1, r0, p1, "./bin/hello_c", "./bin/hello_c", "wt-prov-model/examples/03-hello-c", 159626252507798).

%---------------------------------------------------------------------------------------------------
% FACT: rpz_opened_file(FileID, RunID, ProcessID, File, Mode, IsDirectory, Timestamp).
%---------------------------------------------------------------------------------------------------
rpz_opened_file(f1, r0, p1, ".", 4, true, 159626246096298).
rpz_opened_file(f2, r0, p1, ".", 4, true, 159626246098698).
rpz_opened_file(f3, r0, p1, "./bin/hello_c", 1, false, 159626255197898).
rpz_opened_file(f4, r0, p1, "/lib/x86_64-linux-gnu/ld-2.24.so", 1, false, 159626255218998).
rpz_opened_file(f5, r0, p1, "/etc/ld.so.cache", 1, false, 159626256427998).
rpz_opened_file(f6, r0, p1, "/lib/x86_64-linux-gnu/libc.so.6", 1, false, 159626257712298).
tmcphillips commented 4 years ago

One problem with this is that paths that contain a symbolic link seem to be represented differently in different rows of the opened_files table in the ReproZip trace. For example, the path to the 01-date-cmd example directory in some rows is represented as:

/home/tmcphill/GitRepos/wt-prov-model/examples/01-date-cmd

and in others as

/mnt/c/Users/tmcphill/OneDrive/GitRepos/wt-prov-model/examples/01-date-cmd

The file /home/tmcphill/GitRepos is a symbolic link to /mnt/c/Users/tmcphill/OneDrive/GitRepos.

The executed_files table gives the former as the working directory, so only some of the paths starting from this directory in executed_files will appear to have this prefix.

remram44 commented 4 years ago

The SQLite database contains the paths as they were accessed by the processes. That is deterministic given the filesystem and the program.

The paths are expanded when writing the configuration file by this code, which will include the final target but also mark all the links traversed as "read".

tmcphillips commented 4 years ago

This is very useful, thanks!

I'm mainly interested in visualizing and querying the process-data graph. Is there a way only using info in the trace to determine that two files paths resolved to the same file on the filesystem, even if they take different paths (e.g. one traverses a symbolic link while the other does not)? When I change run.sh in 04-date-to-file to the following...

#!/bin/bash
date > outputs/date.txt
cat `pwd`/outputs/date.txt

... and then type make in that directory, I see these two rows exported from the SQLite database indicating that date.txt is written and then read via two distinct paths:

% FACT: rpz_opened_file(FileID, RunID, ProcessID, File, Mode, IsDirectory, Timestamp).
rpz_opened_file(f36, r0, p2, "/mnt/c/Users/tmcphill/OneDrive/GitRepos/wt-prov-model/examples/04-date-to-file/outputs/date.txt", 2, false, nil).
rpz_opened_file(f59, r0, p4, "/home/tmcphill/GitRepos/wt-prov-model/examples/04-date-to-file/outputs/date.txt", 1, false, nil).

However, it's not obvious from the above that this is actually the same file.

In config.yml the file is listed in under inputs_outputs as follows:

inputs_outputs:
- name: date.txt
  path: /mnt/c/Users/tmcphill/OneDrive/GitRepos/wt-prov-model/examples/04-date-to-file/outputs/date.txt
  written_by_runs: [0]
  read_by_runs: []

Any suggestions? Thanks, I really appreciate your help!

remram44 commented 4 years ago

Unfortunately the trace.sqlite3 database itself only contains information about the calls the process did. You are going to need to get the list of symlinks from the filesystem or the DATA.tar.gz (since I realize it's only written to config.yml as comments... oops)

tmcphillips commented 4 years ago

Thanks!

Resolving the paths in question to inode number seems to confirm they point to the same file:

04-date-to-file$ ls -il  outputs/date.txt `pwd`/outputs/date.txt  /mnt/c/Users/tmcphill/GitRepos/wt-prov-model/examples/04-date-to-file/outputs/date.txt
36591746972505925 -rw-r--r-- 1 tmcphill tmcphill 29 Jan 27 22:28 /home/tmcphill/GitRepos/wt-prov-model/examples/04-date-to-file/outputs/date.txt
36591746972505925 -rw-r--r-- 1 tmcphill tmcphill 29 Jan 27 22:28 /mnt/c/Users/tmcphill/GitRepos/wt-prov-model/examples/04-date-to-file/outputs/date.txt
36591746972505925 -rw-r--r-- 1 tmcphill tmcphill 29 Jan 27 22:28 outputs/date.txt

Qualified with the device ID, this might be enough to solve the identity problem. I'll try it out.

tmcphillips commented 4 years ago

The rpz2prolog program now exports a fourth table of facts, rpz_accessed_file, which assigns a FileIndex attribute to each executed or opened file such that FileIndex is the same for two file accesses if the inode numbers for the access paths are the same.

In this new table, it is clear that that the two alternative paths to date.txt refer to the same file (both have FileIndex i5):

%---------------------------------------------------------------------------------------------------
% FACT: rpz_accessed_file(ID, FilePath, FileIndex).
%---------------------------------------------------------------------------------------------------
rpz_accessed_file(e1, "/mnt/c/Users/tmcphill/OneDrive/GitRepos/wt-prov-model/examples/04-date-to-file/./run.sh", i1).
rpz_accessed_file(e2, "/bin/date", i2).
rpz_accessed_file(e3, "/bin/cat", i3).
rpz_accessed_file(o34, "/mnt/c/Users/tmcphill/OneDrive/GitRepos/wt-prov-model/examples/04-date-to-file", i4).
rpz_accessed_file(o35, "/mnt/c/Users/tmcphill/OneDrive/GitRepos/wt-prov-model/examples/04-date-to-file/outputs/date.txt", i5).
rpz_accessed_file(o36, "/lib/x86_64-linux-gnu/ld-2.24.so", i6).
rpz_accessed_file(o37, "/etc/ld.so.cache", i7).
rpz_accessed_file(o38, "/lib/x86_64-linux-gnu/libc.so.6", i8).
rpz_accessed_file(o39, "/usr/lib/locale/locale-archive", i9).
rpz_accessed_file(o40, "/etc/localtime", i10).
rpz_accessed_file(o41, "/mnt/c/Users/tmcphill/OneDrive/GitRepos/wt-prov-model/examples/04-date-to-file", i4).
rpz_accessed_file(o53, "/mnt/c/Users/tmcphill/OneDrive/GitRepos/wt-prov-model/examples/04-date-to-file", i4).
rpz_accessed_file(o54, "/lib/x86_64-linux-gnu/ld-2.24.so", i6).
rpz_accessed_file(o55, "/etc/ld.so.cache", i7).
rpz_accessed_file(o56, "/lib/x86_64-linux-gnu/libc.so.6", i8).
rpz_accessed_file(o57, "/usr/lib/locale/locale-archive", i9).
rpz_accessed_file(o58, "/home/tmcphill/GitRepos/wt-prov-model/examples/04-date-to-file/outputs/date.txt", i5).
tmcphillips commented 4 years ago

The FileIndex (based on inode number) is now used to trim the working directory (of the first process) from each accessed file path. The alternative paths to outputs/date.txt in the example above now disappear in this case:

%---------------------------------------------------------------------------------------------------
% FACT: rpz_accessed(ID, FilePath, FileIndex).
%---------------------------------------------------------------------------------------------------
rpz_accessed(e1, "./run.sh", i2).
rpz_accessed(e2, "/bin/date", i3).
rpz_accessed(e3, "/bin/cat", i5).
rpz_accessed(o35, ".", i1).
rpz_accessed(o36, "./outputs/date.txt", i6).
rpz_accessed(o37, "/lib/x86_64-linux-gnu/ld-2.24.so", i8).
rpz_accessed(o38, "/etc/ld.so.cache", i11).
rpz_accessed(o39, "/lib/x86_64-linux-gnu/libc.so.6", i13).
rpz_accessed(o40, "/usr/lib/locale/locale-archive", i14).
rpz_accessed(o41, "/etc/localtime", i18).
rpz_accessed(o42, ".", i1).
rpz_accessed(o54, ".", i1).
rpz_accessed(o55, "/lib/x86_64-linux-gnu/ld-2.24.so", i8).
rpz_accessed(o56, "/etc/ld.so.cache", i11).
rpz_accessed(o57, "/lib/x86_64-linux-gnu/libc.so.6", i13).
rpz_accessed(o58, "/usr/lib/locale/locale-archive", i14).
rpz_accessed(o59, "./outputs/date.txt", i6).