whole-tale / wt_versioning

0 stars 1 forks source link

Hard links are used to snapshot RRun workspace #50

Open Xarthisius opened 1 year ago

Xarthisius commented 1 year ago

We're using self.snapshotRecursive on version/workspace to create run/workspace. That's wrong cause it makes a hardlink instead of a copy. Can you guess the result?

How to reproduce?

  1. On your local machine, create a file:
    echo "This and nothing else" > /tmp/file_that_should_have_one_line
  2. On your local machine, create run.sh with the following content:
    #!/bin/sh
    echo "This shouldn't be here..." >> file_that_should_have_one_line
    echo "Nor this. I'm a very sad panda now." >> file_that_should_have_one_line
  3. Create a Tale, upload file_that_should_have_one_line and run.sh to Tale's workspace, perform Recorded Run

Actual Result

file_that_should_have_one_line in 1) Tale's active workspace, 2) version's immutable workspace and 3) RRun's workspace contains:

This and nothing else
This shouldn't be here...
Nor this. I'm a very sad panda now.

Expected Result

file_that_should_have_one_line in 1) Tale's active workspace, 2) version immutable workspace contains:

This and nothing else

whereas file_that_should_have_one_line has 3 lines in RRun's workspace.

hategan commented 1 year ago

That's wrong cause it makes a hardlink instead of a copy. Can you guess the result?

Hmm, once we ran tales in containers which mounted workspace through davfs. It looks like that was essential for the overall scheme to work. The design doc (https://docs.google.com/document/d/1b2xZtIYvgVXz7EVeV-C18So_a7QLGg59dPQMxvBcA5o/edit#) mentions this (and limitations thereof) explicitly. In particular, the pseudocode uses a copy operation that could be replaced with a hard link for efficiency but only if copy-on-write semantics are in place when writing to files in workspace during normal operations. Copy-on-write was more or less an accident with our webdav client, but it did fit the bill.

Under the assumption that workspace only contains source-like files, it would still be significantly more efficient to do copy-on-write when modifying workspace files than when creating versions/runs, since it is unlikely that every workspace file is modified with every version/run.

Xarthisius commented 1 year ago

Yeah, that's definitely my fault. I didn't realize the implication of switching to bind mounts. I see 3 options:

  1. Revert to WebDav (which I'm doing right now to prevent data corruption in versions)
  2. Keep the usage of hardlinks in versions, but implement mounts with COW (overlayfs with syncing back works quite nicely for that)
  3. Change versions backing to something that's not using hard links (git?)

1. is definitely out of the question long term. I'm only bringing it back as an emergency fix. We can ponder 2 or 3 once the dust settles.

hategan commented 1 year ago

Yeah, that's definitely my fault. I didn't realize the implication of switching to bind mounts.

Not my point. Because I didn't see it either. And we might want to leave a warning sign in a more conspicuous place than some dusty doc in the googles.

[...]

  1. Revert to WebDav (which I'm doing right now to prevent data corruption in versions)
  2. Keep the usage of hardlinks in versions, but implement mounts with COW (overlayfs with syncing back works quite nicely for that)
  3. Change versions backing to something that's not using hard links (git?)

1. is definitely out of the question long term. I'm only bringing it back as an emergency fix. We can ponder 2 or 3 once the dust settles.

  1. Use copy instead of hard link in snapshotRecursive.
  2. Use (4) when creating a run, but hard link for snapshotting versions.

I agree that (1) is not ideal long term. I like (2). We are ultimately doing trickery for performance reasons. So, while using git as a backing store might work, I'm not sure (3) will work well. Even the current scheme is not quite right, since every snapshot is O(n) in the number of files in the workspace even if no file has been touched, but any solution based on a generic FS will probably have to be at least that.