wolfcw / libfaketime

libfaketime modifies the system time for a single application
https://github.com/wolfcw/libfaketime
GNU General Public License v2.0
2.71k stars 325 forks source link

Fix another hang under ASAN #391

Closed psychon closed 2 years ago

psychon commented 2 years ago

We have a long-running program that we want to run under sanitizers for extra error checking and under faketime to speed up the clock. This program hangs after a while. Backtraces suggest that the hangs occur because of recursion in the memory allocator, which apparently locks a non-recursive mutex.

Specifically, what we see is that due to our use of FAKETIME_NO_CACHE=1, libfaketime wants to reload the config file inside a (random) call to clock_gettime(). libfaketime then uses fopen() / fclose() for reading the config files. These function allocate / free a buffer for reading data and specifically the call to free() that happens inside fclose() ends up calling clock_gettime() again. At this point, libfaketime locks up because it has a time_mutex that is locked and none-recursive.

Sadly, I did not manage to come up with a stand-alone reproducer for this problem. Also, the above description is from memory after half a week of vacation, so it might be (partially) wrong.

More information can be found here:

This commit first adds a test case. This new test uses the already existing libmallocintercept.so to cause calls to clock_gettime() during memory allocation routines. The difference to the already existing version is that additionally FAKETIME_NO_CACHE and FAKETIME_TIMESTAMP_FILE are set. This new test hung with its last output suggesting a recursive call to malloc:

Called malloc() from libmallocintercept...successfully Called free() on from libmallocintercept...successfully Called malloc() from libmallocintercept...Called malloc() from libmallocintercept...

Sadly, gdb was unable to provide a meaningful backtrace for this hang.

Next, I replaced the use of fopen()/fgets()/fgets() with open()/read()/close(). This code no longer reads the config file line-by-line, but instead it reads all of it at once and then "filters out" the result (ignores comment lines, removes end of line markers).

I tried to keep the behaviour of the code the same, but I know at least one difference: Previously, the config file was read line-by-line and lines that began with a comment character were immediately ignored. The new code only reads the config once and then removes comment lines. Since the buffer that is used contains only 256 characters, it is possible that config files that were previously parsed correctly now longer parse. A specific example: if a file begins with 500 '#' characters in its first line and then a timestamp in the second line, the old code was able to parse this file while the new code would only see an empty file.

After this change, the new test no longer hangs. Sadly, I do not actually know its effect on the "actual bug" that I wanted to fix, but since there are no longer any calls to fclose(), there cannot be any hangs inside fclose().

Signed-off-by: Uli Schlachter psychon@znc.in


CC @LtdSauce FYI

wolfcw commented 2 years ago

Thanks! I think the new way how the file is read resembles the previous behaviour close enough, so let's hope this also solves your "actual bug". :-)

abrudin commented 2 years ago

@psychon @wolfcw Just wanted to inform that we have successfully been running libfaketime in our integration tests for years, where we start and restart applications which reads and writes to faketimerc.

After this commit we suddenly started to get errors, where libfaketime crashes when reading from faketimerc with the following message:

libfaketime: In parse_ft_string(), failed to parse FAKETIME timestamp.
Please check specification  with format %Y-%m-%d %T

Using a build from the previous commit (2bfbe19f719c34f07656cd1513b79209f7d22b61) works fine. I don't know exactly what is going on, if there is some race condition or tearing happening when it tries to read the content of the file, but we don't write anything strange and have not changed anything.