Closed LtdSauce closed 2 years ago
This apparently has parallels to our long-standing jemalloc issue (e.g., #130), where jemalloc depends on time-related functions calls while libfaketime depends on memory allocation first during its own initialisation. The information about clang asan's malloc on AddressSanitizerIncompatiblity does not make me very confident that there is anything we can do on libfaketime's end, but I would love to get proven wrong by someone with deeper understanding of asan's internal workings.
I'm leaving this open for now in case someone can bring some new ideas into this issue given that we were not able to solve the similar jemalloc incompatibility (#130) for quite some time. However, there are no solutions paths at the moment we can pursue.
Hi! I came up with a fixhack to make this "work". With this, the example from the OP prints for me:
$ clang -fsanitize=address foo.c && LD_PRELOAD=/path/to/libfaketime/src/libfaketimeMT.so.1 ./a.out
libfaketime: Cannot recover from unexpected recursive calls to clock_gettime().
libfaketime: Please check whether any other libraries are in use that clash with libfaketime.
libfaketime: Returning -1 on clock_gettime() to break recursion now... if that does not work, please check other libraries' error handling.
libfaketime: Cannot recover from unexpected recursive calls to clock_gettime().
libfaketime: Please check whether any other libraries are in use that clash with libfaketime.
libfaketime: Returning -1 on clock_gettime() to break recursion now... if that does not work, please check other libraries' error handling.
The hack is based on the already existing hack to detect recursive calls to clock_gettime
. In my case, there was a __constructor__
call to ftpl_init()
and inside that one happened a call to clock_gettime
and thus another call to ftpl_init
. Thus, there was no recursive call to clock_gettime
and the existing detecting did not, well, detect this. Hence, I added a new variable to track whether ftpl_init()
is already running.
(I also had to fix the usage of the recursion_depth
counter since a decrement is missing)
diff --git a/src/libfaketime.c b/src/libfaketime.c
index f92ecf8..9223934 100644
--- a/src/libfaketime.c
+++ b/src/libfaketime.c
@@ -294,6 +294,7 @@ static bool check_missing_real(const char *name, bool missing)
check_missing_real(#name, (NULL == real_##name))
static int initialized = 0;
+static int initializing = 0;
/* prototypes */
static int fake_gettimeofday(struct timeval *tv);
@@ -2287,7 +2288,7 @@ int clock_gettime(clockid_t clk_id, struct timespec *tp)
fprintf(stderr, "libfaketime: Unexpected recursive calls to clock_gettime() without proper initialization. Trying alternative.\n");
DONT_FAKE_TIME(ftpl_init()) ;
}
- else if (recursion_depth == 3)
+ else if (recursion_depth == 3 || initializing)
{
fprintf(stderr, "libfaketime: Cannot recover from unexpected recursive calls to clock_gettime().\n");
fprintf(stderr, "libfaketime: Please check whether any other libraries are in use that clash with libfaketime.\n");
@@ -2297,6 +2298,7 @@ int clock_gettime(clockid_t clk_id, struct timespec *tp)
tp->tv_sec = 0;
tp->tv_nsec = 0;
}
+ recursion_depth--;
return -1;
}
else {
@@ -2557,6 +2559,7 @@ static void ftpl_init(void)
/* moved up here from below the dlsym calls #130 */
dont_fake = true; // Do not fake times during initialization
dont_fake_final = false;
+ initializing = true;
#ifdef __APPLE__
const char *progname = getprogname();
@@ -2948,6 +2951,7 @@ static void ftpl_init(void)
}
dont_fake = dont_fake_final;
+ initializing = false;
}
Doing this in the earlier branch (so that no time is faked during init) did not make the hang go away.
I'm not saying that this is pretty... but it "works" for me!
@psychon and i found out that his above "hack" does not work with c++/clang++. After a little bit of fiddling he came up with another try. The following patch seems to work with clang and clang++ on our use-cases:
diff --git a/src/libfaketime.c b/src/libfaketime.c
index f92ecf8..2e8e94d 100644
--- a/src/libfaketime.c
+++ b/src/libfaketime.c
@@ -294,6 +294,7 @@ static bool check_missing_real(const char *name, bool missing)
check_missing_real(#name, (NULL == real_##name))
static int initialized = 0;
+static int initializing = 0;
/* prototypes */
static int fake_gettimeofday(struct timeval *tv);
@@ -2282,12 +2283,12 @@ int clock_gettime(clockid_t clk_id, struct timespec *tp)
if (!initialized)
{
recursion_depth++;
- if (recursion_depth == 2)
+ if (recursion_depth == 2 && false)
{
fprintf(stderr, "libfaketime: Unexpected recursive calls to clock_gettime() without proper initialization. Trying alternative.\n");
DONT_FAKE_TIME(ftpl_init()) ;
}
- else if (recursion_depth == 3)
+ else if (recursion_depth >= 2 || initializing)
{
fprintf(stderr, "libfaketime: Cannot recover from unexpected recursive calls to clock_gettime().\n");
fprintf(stderr, "libfaketime: Please check whether any other libraries are in use that clash with libfaketime.\n");
@@ -2297,6 +2298,7 @@ int clock_gettime(clockid_t clk_id, struct timespec *tp)
tp->tv_sec = 0;
tp->tv_nsec = 0;
}
+ recursion_depth--;
return -1;
}
else {
@@ -2557,6 +2559,7 @@ static void ftpl_init(void)
/* moved up here from below the dlsym calls #130 */
dont_fake = true; // Do not fake times during initialization
dont_fake_final = false;
+ initializing = true;
#ifdef __APPLE__
const char *progname = getprogname();
@@ -2948,6 +2951,7 @@ static void ftpl_init(void)
}
dont_fake = dont_fake_final;
+ initializing = false;
}
For the record:
armhf
architecture it seemed to work without the patch-shared-libsan
and ASAN_OPTIONS=verify_asan_link_order=0
it somehow seemed to work on both armhf
and x86_64
. But linking against libclang_rt.asan-<arch>.so
is a little bit painfulWe will stick with the patch above for now. @wolfcw if you think this would make a "good-enough" fix i can file a PR from https://github.com/LtdSauce/libfaketime-asan-fixed.
Sounds interesting and glad it works for you so far!
However, I'm still trying to figure out how your patch works. :-) It looks like you effectively cap the recursion depth counter to 2 (based on the new recursion_depth--) and effectively replace it with a flag that is set during initialisation. It also does no longer attempt ftpl_init() wrapped in DONT_FAKE_TIME() as suggested to improve changes for jemalloc compatibility. If that fixes stuff with libasan, I guess we might simplify it a bit (leave out unreachable code) and wrap it in #ifdef for yet another compile-time flag, yes. :-)
(Hopefully the following is related enough!)
We found that ASan and libfaketime have another issue. When freeing memory a deadlock in libfaketimes Code that reads in the FAKETIME_TIMESTAMP_FILE can happen. On our case it happen quite deterministic when ASans quarantine zone is full and ASan actually starts to deallocate memory.
See the following stacktrace:
futex_wait 0x00007fac60c7d160
__GI___lll_lock_wait 0x00007fac60c7d160
lll_mutex_lock_optimized 0x00007fac60c83ee2
___pthread_mutex_lock 0x00007fac60c83ee2
fake_clock_gettime 0x00007fac65f5e3d8
clock_gettime 0x00007fac65f5c5eb
__sanitizer::MonotonicNanoTime() 0x000000000057d41b
__sanitizer::SizeClassAllocator64<__asan::AP64<__sanitizer::LocalAddressSpaceView> >::MaybeReleaseToOS(__sanitizer::MemoryMapper<__sanitizer::SizeClassAllocator64<__asan::AP64<__sanitizer::LocalAddressSpaceView> > >*, unsigned long, bool) 0x00000000004e6f08
__sanitizer::SizeClassAllocator64<__asan::AP64<__sanitizer::LocalAddressSpaceView> >::ReturnToAllocator(__sanitizer::MemoryMapper<__sanitizer::SizeClassAllocator64<__asan::AP64<__sanitizer::LocalAddressSpaceView> > >*, __sanitizer::AllocatorStats*, unsigned long, unsigned int const*, unsigned long) 0x00000000004e6cb4
__sanitizer::SizeClassAllocator64LocalCache<__sanitizer::SizeClassAllocator64<__asan::AP64<__sanitizer::LocalAddressSpaceView> > >::DrainHalfMax(__sanitizer::SizeClassAllocator64LocalCache<__sanitizer::SizeClassAllocator64<__asan::AP64<__sanitizer::LocalAddressSpaceView> > >::PerClass*, __sanitizer::SizeClassAllocator64<__asan::AP64<__sanitizer::LocalAddressSpaceView> >*, unsigned long) 0x00000000004e6924
__sanitizer::SizeClassAllocator64LocalCache<__sanitizer::SizeClassAllocator64<__asan::AP64<__sanitizer::LocalAddressSpaceView> > >::Deallocate(__sanitizer::SizeClassAllocator64<__asan::AP64<__sanitizer::LocalAddressSpaceView> >*, unsigned long, void*) 0x00000000004e677e
__sanitizer::Quarantine<__asan::QuarantineCallback, __asan::AsanChunk>::DoRecycle(__sanitizer::QuarantineCache<__asan::QuarantineCallback>*, __asan::QuarantineCallback) 0x00000000004e61dd
__sanitizer::Quarantine<__asan::QuarantineCallback, __asan::AsanChunk>::Recycle(unsigned long, __asan::QuarantineCallback) 0x00000000004e5d64
__asan::Allocator::QuarantineChunk(__asan::AsanChunk*, void*, __sanitizer::BufferedStackTrace*) 0x00000000004e7e02
free 0x000000000055f0c6
_IO_deallocate_file 0x00007fac60c6adc7
_IO_new_fclose 0x00007fac60c6adc7
fclose 0x00000000005347b5
read_config_file 0x00007fac65f5e1be
fake_clock_gettime 0x00007fac65f5ea84
clock_gettime 0x00007fac65f5c5eb
clock_gettime 0x0000000000507eab
<unknown> 0x00007fac6141f705
<unknown> 0x00007fac6141ff96
mosquitto_loop 0x00007fac61416597
mosquitto_loop_forever 0x00007fac614167f3
<unknown> 0x00007fac6141f533
start_thread 0x00007fac60c80947
clone 0x00007fac60d10a44
Because ASan calls clock_gettime
when deallocating memory it calls faketime again which wants to read in the file again and deadlocks.
EDIT: Because ASan calls the MonotonicNanoTime
i checked if building with FORCE_MONOTONIC_FIx
resolves this, but it doesn't.
@psychon suggests that this could be fixed by using open
instead of fopen
.
If needed i could try to make a reproducible example. Probably calling malloc/free
in a loop should be sufficient to eventually trigger this deadlock.
I'll check if using -shared-libsan
does have this issue too.
EDIT: At first glance it seems that this issue does not happen with -shared-libsan
... why, we do not know
However, I'm still trying to figure out how your patch works.
Imagine the simplified model of asan's malloc
hook (I actually never looked at the real code):
void *asan_malloc(size_t size) {
pthread_mutex_lock(&some_global_mutex);
clock_gettime(I dont know the arguments and dont care for this example);
void *result = real_malloc(size);
pthread_mutex_unlock(&some_global_mutex);
return result;
}
Important here: some_global_mutex
is non-recursive. Thus, any recursive calls to asan_malloc()
will deadlock.
The other "ingredient" is dlsym()
calling malloc()
.
Let's begin with the simple case:
libfaketime has ftpl_init()
set up with __constructor
, so this function is called after loading the library. ftpl_init()
will now use dlsym()
to find "the real" time functions. This calls malloc
, which we hooked above with asan_malloc()
. This now locks some_global_mutex
and calls clock_gettime()
while the mutex is held.
We are back in libfaketime because it hooked clock_gettime()
with fake_clock_gettime()
(I think I got this function name wrong... whatever). At this point, faketime notices that ftpl_init()
did not yet complete and calls this function again, which calls dlsym()
again, which calls malloc()
which calls pthread_mutex_lock()
again. However, this is a non-recursive mutex and we get a deadlock!
The callchain at this point looks something like this:
pthread_mutex_lock (this hangs because this thread already locked the mutex) was called by
asan_malloc was called by
dlsym was called by
ftpl_init was called by
clock_gettime
asan_malloc
dlsym
ftpl_init
the dynamic loader calling the __constructor__ function of libfaketime.so
In my hack/patch above, I introduced the variable initializing
to "handle" this case. The second call to clock_gettime
will now notice that ftpl_init
is already on the stack and thus will not call it again.
The "new ingredient" here is that the program is linked against libstdc++.so
. This shared object has some constructor that calls malloc()
. I don't know why, but this constructor function is called before libfaketime's. I'll skip the details of the calls and go straight to the resulting call stack (again ordered so that the oldest call is at the end):
pthread_mutex_lock
asan_malloc
dlsym
ftpl_init
clock_gettime
asan_malloc
???
the dynamic loader calling some __constructor__ function of libstdc++.so
Here, there is not get a call to ftpl_init
on the stack, so the new initializing
variable does not help. The other part of my hack/patch breaks the existing recursion_depth
code so that it does not call ftpl_init
for "early calls" (=before ftpl_init()
) of clock_gettime
. That way, we again avoid the recursive call to asan_malloc
.
Edit: This should be fixed with #391
As @LtdSauce wrote:
We found that ASan and libfaketime have another issue
I feel like this is a separate issue, but okay, now I'll have to write down the details here.
We are using libfaketime with the environment variable set that causes it to reload the faketime file all the time (sorry, I forgot the name of this env var; I hope you know what I mean).
Here, we get again into problems with malloc. Well, actually, this time it is free
calling clock_gettime
(but only sometimes?). The backtrace shown above shows that some code calls clock_gettime
(thus, actually fake_clock_gettime
). This reloads the config file. To read the file, this uses fopen
/fgets
/fclose
. These functions maintain an internal buffer of "data already read, but not yet passed to the caller". fclose
has to free this buffer. Somehow, ASAN decides to call clock_gettime
again on this call to free
. This now recursively calls into libfaketime and deadlocks here:
https://github.com/wolfcw/libfaketime/blob/f836ea3eb392378bc5166b3d68283f69e5330542/src/libfaketime.c#L3075
There are two calls to fake_clock_gettime
on the stack. The newest one hangs in pthread_mutex_lock
. That must be in the line above, since that seems to be the only such call.
The earlier call is in read_config_file
-> fclose
. This must be here: https://github.com/wolfcw/libfaketime/blob/f836ea3eb392378bc5166b3d68283f69e5330542/src/libfaketime.c#L3198
The above is inside the critical section which only ends here: https://github.com/wolfcw/libfaketime/blob/f836ea3eb392378bc5166b3d68283f69e5330542/src/libfaketime.c#L3282
Thus, this thread already locked this non-recursive mutex and a second pthread_mutex_lock
is undefined behaviour according to POSIX, which in practice (at least on Linux) means: Deadlock / hang.
Thanks, I'm fine with the initializing
flag, but I guess we could leave out if
s whose conditions will always be false and the recursion depth counter probably could also be simplified to a flag as it effectively can never exceed a value of 2 across function calls and has a minimum value of 1 during operation.
I'm also fine with replacing fopen and related calls with open if that avoids recursive calls due to a buffer we don't need anyway, though it might not be a long-term solution if implementation internals in glibc or libasan change (the force-monontonic-fix you mentioned is a good example for a workaround only needed for some middle versions, but not the older or the current glibc versions).
Hi!
we use AddressSanitizer to test our executables. Recently we wanted to decrease time-of-test and tried libfaketime. We then noticed that programs compiled with clang and with address sanitizer freezes with the following message:
When we don't compile them with address sanitizer everything works as expected. Furthermore, when compiled with gcc and address sanitizer everything works when we load
libasan.so
beforelibfaketime.so
. Gcc links it dynamically by default while clang links it statically. We want to avoid using the non-default linkage in clang and keep it linked statically.I have not found another issue related to address sanitizers.
Environment
OS: Ubuntu Impish 21.10 x86_64 clang: Ubuntu clang version 13.0.0-2 libfaketime: version 0.9.8-9 from apt (and also when build from https://github.com/wolfcw/libfaketime/commit/f26242b655daf672a2e83b6ce669d8dec5f392ff)
When linked against 0.9.6 everything seems to work.
Steps to reproduce
To reproduce this issue we created a little program (or just a simple main):
When compiled with the following:
Running under gdb produces the following stacktrace:
Backtrace for `LD_PRELOAD=libfaketime.so.1 FAKETIME="-15d" gdb --args ./a.out`
``` GNU gdb (Ubuntu 11.1-0ubuntu2) 11.1 Copyright (C) 2021 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or laterBacktrace when compiled with clang++
``` GNU gdb (Ubuntu 11.1-0ubuntu2) 11.1 Copyright (C) 2021 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or laterEDIT: Linking the pthread-enabled .so had no effect. EDIT: linking the sanitizer runtime dynamically like gcc does it had no effect and produced the same backtrace