Open tmcphillips opened 4 years ago
One problem with this is that paths that contain a symbolic link seem to be represented differently in different rows of the opened_files
table in the ReproZip trace. For example, the path to the 01-date-cmd example directory in some rows is represented as:
/home/tmcphill/GitRepos/wt-prov-model/examples/01-date-cmd
and in others as
/mnt/c/Users/tmcphill/OneDrive/GitRepos/wt-prov-model/examples/01-date-cmd
The file /home/tmcphill/GitRepos
is a symbolic link to /mnt/c/Users/tmcphill/OneDrive/GitRepos
.
The executed_files
table gives the former as the working directory, so only some of the paths starting from this directory in executed_files
will appear to have this prefix.
The SQLite database contains the paths as they were accessed by the processes. That is deterministic given the filesystem and the program.
The paths are expanded when writing the configuration file by this code, which will include the final target but also mark all the links traversed as "read".
This is very useful, thanks!
I'm mainly interested in visualizing and querying the process-data graph. Is there a way only using info in the trace to determine that two files paths resolved to the same file on the filesystem, even if they take different paths (e.g. one traverses a symbolic link while the other does not)? When I change run.sh
in 04-date-to-file to the following...
#!/bin/bash
date > outputs/date.txt
cat `pwd`/outputs/date.txt
... and then type make
in that directory, I see these two rows exported from the SQLite database indicating that date.txt
is written and then read via two distinct paths:
% FACT: rpz_opened_file(FileID, RunID, ProcessID, File, Mode, IsDirectory, Timestamp).
rpz_opened_file(f36, r0, p2, "/mnt/c/Users/tmcphill/OneDrive/GitRepos/wt-prov-model/examples/04-date-to-file/outputs/date.txt", 2, false, nil).
rpz_opened_file(f59, r0, p4, "/home/tmcphill/GitRepos/wt-prov-model/examples/04-date-to-file/outputs/date.txt", 1, false, nil).
However, it's not obvious from the above that this is actually the same file.
In config.yml
the file is listed in under inputs_outputs
as follows:
inputs_outputs:
- name: date.txt
path: /mnt/c/Users/tmcphill/OneDrive/GitRepos/wt-prov-model/examples/04-date-to-file/outputs/date.txt
written_by_runs: [0]
read_by_runs: []
Any suggestions? Thanks, I really appreciate your help!
Unfortunately the trace.sqlite3 database itself only contains information about the calls the process did. You are going to need to get the list of symlinks from the filesystem or the DATA.tar.gz (since I realize it's only written to config.yml
as comments... oops)
Thanks!
Resolving the paths in question to inode number seems to confirm they point to the same file:
04-date-to-file$ ls -il outputs/date.txt `pwd`/outputs/date.txt /mnt/c/Users/tmcphill/GitRepos/wt-prov-model/examples/04-date-to-file/outputs/date.txt
36591746972505925 -rw-r--r-- 1 tmcphill tmcphill 29 Jan 27 22:28 /home/tmcphill/GitRepos/wt-prov-model/examples/04-date-to-file/outputs/date.txt
36591746972505925 -rw-r--r-- 1 tmcphill tmcphill 29 Jan 27 22:28 /mnt/c/Users/tmcphill/GitRepos/wt-prov-model/examples/04-date-to-file/outputs/date.txt
36591746972505925 -rw-r--r-- 1 tmcphill tmcphill 29 Jan 27 22:28 outputs/date.txt
Qualified with the device ID, this might be enough to solve the identity problem. I'll try it out.
The rpz2prolog
program now exports a fourth table of facts, rpz_accessed_file
, which assigns a FileIndex
attribute to each executed or opened file such that FileIndex
is the same for two file accesses if the inode numbers for the access paths are the same.
In this new table, it is clear that that the two alternative paths to date.txt
refer to the same file (both have FileIndex i5
):
%---------------------------------------------------------------------------------------------------
% FACT: rpz_accessed_file(ID, FilePath, FileIndex).
%---------------------------------------------------------------------------------------------------
rpz_accessed_file(e1, "/mnt/c/Users/tmcphill/OneDrive/GitRepos/wt-prov-model/examples/04-date-to-file/./run.sh", i1).
rpz_accessed_file(e2, "/bin/date", i2).
rpz_accessed_file(e3, "/bin/cat", i3).
rpz_accessed_file(o34, "/mnt/c/Users/tmcphill/OneDrive/GitRepos/wt-prov-model/examples/04-date-to-file", i4).
rpz_accessed_file(o35, "/mnt/c/Users/tmcphill/OneDrive/GitRepos/wt-prov-model/examples/04-date-to-file/outputs/date.txt", i5).
rpz_accessed_file(o36, "/lib/x86_64-linux-gnu/ld-2.24.so", i6).
rpz_accessed_file(o37, "/etc/ld.so.cache", i7).
rpz_accessed_file(o38, "/lib/x86_64-linux-gnu/libc.so.6", i8).
rpz_accessed_file(o39, "/usr/lib/locale/locale-archive", i9).
rpz_accessed_file(o40, "/etc/localtime", i10).
rpz_accessed_file(o41, "/mnt/c/Users/tmcphill/OneDrive/GitRepos/wt-prov-model/examples/04-date-to-file", i4).
rpz_accessed_file(o53, "/mnt/c/Users/tmcphill/OneDrive/GitRepos/wt-prov-model/examples/04-date-to-file", i4).
rpz_accessed_file(o54, "/lib/x86_64-linux-gnu/ld-2.24.so", i6).
rpz_accessed_file(o55, "/etc/ld.so.cache", i7).
rpz_accessed_file(o56, "/lib/x86_64-linux-gnu/libc.so.6", i8).
rpz_accessed_file(o57, "/usr/lib/locale/locale-archive", i9).
rpz_accessed_file(o58, "/home/tmcphill/GitRepos/wt-prov-model/examples/04-date-to-file/outputs/date.txt", i5).
The FileIndex
(based on inode number) is now used to trim the working directory (of the first process) from each accessed file path. The alternative paths to outputs/date.txt
in the example above now disappear in this case:
%---------------------------------------------------------------------------------------------------
% FACT: rpz_accessed(ID, FilePath, FileIndex).
%---------------------------------------------------------------------------------------------------
rpz_accessed(e1, "./run.sh", i2).
rpz_accessed(e2, "/bin/date", i3).
rpz_accessed(e3, "/bin/cat", i5).
rpz_accessed(o35, ".", i1).
rpz_accessed(o36, "./outputs/date.txt", i6).
rpz_accessed(o37, "/lib/x86_64-linux-gnu/ld-2.24.so", i8).
rpz_accessed(o38, "/etc/ld.so.cache", i11).
rpz_accessed(o39, "/lib/x86_64-linux-gnu/libc.so.6", i13).
rpz_accessed(o40, "/usr/lib/locale/locale-archive", i14).
rpz_accessed(o41, "/etc/localtime", i18).
rpz_accessed(o42, ".", i1).
rpz_accessed(o54, ".", i1).
rpz_accessed(o55, "/lib/x86_64-linux-gnu/ld-2.24.so", i8).
rpz_accessed(o56, "/etc/ld.so.cache", i11).
rpz_accessed(o57, "/lib/x86_64-linux-gnu/libc.so.6", i13).
rpz_accessed(o58, "/usr/lib/locale/locale-archive", i14).
rpz_accessed(o59, "./outputs/date.txt", i6).
By default, we can trim the working directory from the paths to files or directories below that directory. We can additionally trim an arbitrary prefix from the working directory as well.
For example, the facts corresponding to a run of Hello World currently looks like this:
Replacing the working directory in each path with
.
, and trimming the system-specific directory/mnt/c/Users/tmcphill/OneDrive/GitRepos
from the working directory would give: