wolfcw / libfaketime

libfaketime modifies the system time for a single application
https://github.com/wolfcw/libfaketime
GNU General Public License v2.0
2.64k stars 321 forks source link

Parsing of faketimerc breaks under some circumstances #392

Closed abrudin closed 1 year ago

abrudin commented 2 years ago

We have been using libfaketime for years in our docker-based integration tests, but since yesterdays commit/pr https://github.com/wolfcw/libfaketime/pull/391 we now get intermittent errors when libfaketime tries to read from the file:

libfaketime: In parse_ft_string(), failed to parse FAKETIME timestamp.
Please check specification  with format %Y-%m-%d %T

We have not done any changes to the tests, except that we always use the latest master of libfaketime.

To me it is not obvious what is going on, if there is some race condition or other reason for not being able to read the file.

Noticably, I realize that we seem to be slightly inconsistent whether or not we use a newline at the end of the file when writing to it, maybe there could be something going on there (e.g. if the format is decided once and for all now, whereas previously it was decided at parse time??)

Reverting to 2bfbe19f719c34f07656cd1513b79209f7d22b61 makes everything work again.

Not working: e7ca8378ca6a64e14408849ccdc1e33be4109007 Working: 2bfbe19f719c34f07656cd1513b79209f7d22b61

psychon commented 2 years ago

Hi,

thanks for the ping.

Please check specification with format %Y-%m-%d %T

Judging from the code, the above means that an empty string was read from the file: https://github.com/wolfcw/libfaketime/blob/e7ca8378ca6a64e14408849ccdc1e33be4109007/src/libfaketime.c#L2430-L2431

Since I obviously think I am flawless ( ;-) ), here is an attempt at finding the error on your end:

How exactly are you writing the file? Are you opening the file for writing (which truncated the file to length zero) and then write to it? If so, libfaketime could be re-reading the file just after the file size became zero and before you wrote to it. Can you easily change your code to write to another file and then use rename() to atomically replace the file?

Why would this happen? Well, my guess would be a race: fopen() / fclose() allocate memory and thus are a tiny bit slower than the new code that does not allocate memory. That's why this always worked and now doesn't. If you want, you could try adding

fclose(fopen(custom_filename, "rt"));
fclose(fopen(filename, "rt"));
fclose(fopen("/etc/faketimerc, "rt"));

here: https://github.com/wolfcw/libfaketime/blob/e7ca8378ca6a64e14408849ccdc1e33be4109007/src/libfaketime.c#L3052

(The idea being to re-introduce the memory allocation and checking whether that makes the race go away; if so, I think that confirms my theory)

abrudin commented 2 years ago

Hi,

thanks for the swift reply. You are probably right. As we our mounting the file in several docker containers it is slightly harder to do an atomic replace, because I cannot easily change the inode. However, after moving the file to another directory I was able to mount the directory and so atomically replace it when changing the time.

Unfortunately it seems that Java detects time changes much more aggressively (i.e. many times per time change) when we replace the file, probably due to the inode change, and so the behaviour of the internal scheduler becomes much flakier (don't know how this relates to DONT_FAKE_MONOTONIC).

I will take some more time later to investigate this a little deeper, but in the meantime we will have to pin libfaketime at an older commit, in order for our tests to continue to work. I guess this is a typical case of the old fact that people depend as much on the misbehaviours of systems as on the intended behaviours.

Thanks anyway for your help - we could probably close this issue, but I hope the information could still be useful if more people starts reporting issues.

psychon commented 2 years ago

So, just do make sure I understand correctly: My proposal works, but other stuff now "goes south"? (I have no good ideas on why exactly this happens / what could be done about it; sorry)

Random thought: libfaketime could have a "retry mode": If the faketime file was read as empty, just try again (either immediately or after sleeping for 0.000001 ms, or something like that). I could try implementing this, but I can't say when I'll have time for that.

abrudin commented 2 years ago

Yes, exactly - changing the file atomically prevents the initial parsing error from happening, but also forces a change of inode which seems to trigger a different behavior from the "other stuff" (the way the Java scheduler reacts to the new time).

Our quite big tests seems to reproduce the problem quite consistently (always working with the old behaviour, never with the new, but failing in slightly different ways every time) although to me it seems that this race condition should have had a reasonable chance to happen with the previous code as well. However, I know too little about how time works on Linux to deeply understand what is going on.

I was also thinking about the possibility of "tolerating" the file to not be parsable for a short while, in order to make it possible to change it "in-place", i.e. non-atomically.

EDIT: It seems as if there is something going on with the directory synk/mount in docker(/-compose) - because when I restart a container it picks up an "old" time, even though the content of the mounted file has been replaced. So maybe the old inode is still referred to in some mysterious way, causing this behaviour. When the time is updated again the container picks up the change. Very strange and subtle...

wolfcw commented 2 years ago

Random thought: libfaketime could have a "retry mode": If the faketime file was read as empty, just try again (either immediately or after sleeping for 0.000001 ms, or something like that). I could try implementing this, but I can't say when I'll have time for that.

Sounds like a plan. :-)

Maybe it's also worth to check whether we already have a valid faketime setting that could be used (as in cache) instead of failing. Otherwise, hopefully any system call that does not allocate memory but triggers a context switch would be sufficient. If not, it's hard to tell how long a sleeping time should be; eventually increasing based on the number of the current try, with an upper limit after which it fails.

psychon commented 2 years ago

@abrudin Here is a suggestion for a patch you could try. No testing done beyond "It compiles".

diff --git a/src/libfaketime.c b/src/libfaketime.c
index ac73978..408f5d2 100644
--- a/src/libfaketime.c
+++ b/src/libfaketime.c
@@ -3089,18 +3089,25 @@ int read_config_file()
   static char custom_filename[BUFSIZ];
   static char filename[BUFSIZ];
   int faketimerc;
+  bool first_try = true;
   /* check whether there's a .faketimerc in the user's home directory, or
    * a system-wide /etc/faketimerc present.
    * The /etc/faketimerc handling has been contributed by David Burley,
    * Jacob Moorman, and Wayne Davison of SourceForge, Inc. in version 0.6 */
   (void) snprintf(custom_filename, BUFSIZ, "%s", getenv("FAKETIME_TIMESTAMP_FILE"));
   (void) snprintf(filename, BUFSIZ, "%s/.faketimerc", getenv("HOME"));
+retry:
   if ((faketimerc = open(custom_filename, O_RDONLY)) != -1 ||
       (faketimerc = open(filename, O_RDONLY)) != -1 ||
       (faketimerc = open("/etc/faketimerc", O_RDONLY)) != -1)
   {
     ssize_t length = read(faketimerc, user_faked_time, sizeof(user_faked_time) - 1);
     close(faketimerc);
+    if (length == 0 && first_try) {
+      fprintf(stderr,"Read an empty config file, retrying once (this print exists to do something 'slow')\n");
+      first_try = false;
+      goto retry;
+    }
     if (length < 0) {
       length = 0;
     }