libfaketime Causes Indefinite Hang Ups When Clock Is Sped Up

fixindan commented 1 year ago

When clock is sped up to x10 or more, pthread_cond_init_232() can be stuck trying to acquire the write lock via pthread_rwlock_wrlock(&monotonic_conds_lock) and unable to do so because another thread is repeatedly calling pthread_cond_timedwait_common() and hogging the monotonic_conds_lock. In other words when one thread has its condition variable initialized and is waiting for notification, it hogs the static monotonic_conds_lock and prevents another thread from initializing another condition variable. Of course this doesn't happen every time, but it happens fairly frequently. The same thing can be observed when one thread is attempting to call pthread_cond_destroy_232() to destroy its condition variable, but can't acquire the same static monotonic_conds_lock. I modified the original lines of code in pthread_cond_init_232() and pthread_cond_destroy_232() from:

if (pthread_rwlock_wrlock(&monotonic_conds_lock) != 0) {
  fprintf(stderr,"can't acquire write monotonic_conds_lock\n");
  exit(-1);
}
HASH_ADD_PTR(monotonic_conds, ptr, e);
pthread_rwlock_unlock(&monotonic_conds_lock);

which hangs sporadically to this:

uint16_t trial = 0;
bool locked = false;
do {
  if (pthread_rwlock_trywrlock(&monotonic_conds_lock) != 0) {
    usleep(1000);
    trial++;
  }
  else {
    locked = true;
    break;
  }
} while (trial < 1000);
if (!locked) {
  fprintf(stderr,"can't acquire write monotonic_conds_lock, but continuing...\n");
}
HASH_ADD_PTR(monotonic_conds, ptr, e);
if (locked) {
  pthread_rwlock_unlock(&monotonic_conds_lock);
}

The modified code basically retries to acquiremonotonic_conds_lock up to 1000 times with a slight delay between each attempt. If it still can't wrestle away the monotonic_conds_lock, it will attempt to proceed as usual rather than exit(-1). So far I've ran my code on repeat with this modified libfaketime speeding up clock and have not seen hang ups or code crashes even when I see can't acquire write monotonic_conds_lock, but continuing... on console from time to time; everything seems to just work. But still it's troubling that the lock can't be acquired sometimes and my modified libfaketime is left to modifying the static structure instance that the lock is protecting without having acquired the lock. Is my modification an acceptable thing to do? If not, what is the recommended thing to do instead? Thanks in advance,

wolfcw commented 1 year ago

Thanks for pointing this out. It seems well possible that the specific libfaketime-internal implementation contributes to some sort of a race condition here and does not only amplify an existing one (which then only shows at "higher speeds").

Arguably, libfaketime's implementation should not bail out with an exit(-1) on a call to pthread_cond_init() but rather, e.g., return 0 with EAGAIN set as error value, or with EBUSY in the pthread_cond_destroy() case.

The approach to re-try for like 1,000 times seems OK to me as a compile-time option, but I'm not sure it's a good default. While libfaketime obviously still works deterministically when solving the problem at hand this way, its behaviour gets harder to predict and I'm not sure whether we'd do deterministic and reproducible tests a favour here, and that is one area where libfaketime is used a lot.

Also, modifying the data even if we failed to acquire the lock after 1,000 tries seems like a suboptimal idea. Sure, it means we tried real hard and not modifying the data will cause different problems, but the root cause of the problem is likely somewhere else and that calls for a different solution - eventually returning with an error value as outlined above?

fixindan commented 1 year ago

Thank you for the quick follow up. I do want to mention that my modified libfaketime is modified via applying a git patch after cloning in the source code, and it's used only to speed up a GoogleTest unit test. As such I would like the unit test to run as much as possible and only terminate if it can no longer do the necessary work. This is why I've decided to continue on despite of failing to acquire the static lock after I've ran my unit test on repeat for, if I'm to guess, thousands of tries; during these thousands of tries I've noticed that it failed to acquire the lock a few dozen times, and each time the unit test was able to continue on and pass. My intent is to inquire if there is a way to "continue on" more safely, perhaps another way to help pthread_cond_init_232() and pthread_cond_destroy_232() wrestle the lock away. If there isn't, I think I prefer to stick to my patch as it seems to work fairly well up to this point.

wolfcw commented 1 year ago

Thanks for #422!

wolfcw / libfaketime

libfaketime Causes Indefinite Hang Ups When Clock Is Sped Up #419