wolfcw / libfaketime

libfaketime modifies the system time for a single application
https://github.com/wolfcw/libfaketime
GNU General Public License v2.0
2.77k stars 331 forks source link

Unexpected return values from pthread_cond_init_232, pthread_cond_destroy_232, pthread_cond_timedwait #464

Open Rob--W opened 8 months ago

Rob--W commented 8 months ago

Even after patching #130 (with #463), I am unable to launch Firefox 123.0 because it crashes. After bisecting libfaketime, I found that the regression is caused by #422. This patch replaced a blocking call with one that does not block in an attempt to address #419. I think that this behavior should be opt-in (at compile time or at runtime), to minimize the risk of triggering unexpected failures at the next libfaketime release.

The Firefox crashes happen because it does not expect these functions to fail:

These functions should ideally not fail, or at the very least libfaketime should try harder to avoid failing too soon. Ordinary programs usually expect these to not fail, and that is also a sentiment echoed through the pthreads documentation:

pthread_cond_init, pthread_cond_signal, pthread_cond_broadcast, and pthread_cond_wait never return an error code. -- https://man.archlinux.org/man/core/man-pages/pthread_cond_init.3.en

oddly, a different source specifies an error as an acceptable result:

The pthread_cond_init() function shall fail if: EAGAIN The system lacked the necessary resources (other than memory) to initialize another condition variable. -- https://www.man7.org/linux/man-pages/man3/pthread_cond_init.3p.html

pthread_cond_timedwait_common in libfaketime can return EAGAIN, but that error is not listed in the pthreads manual: https://www.man7.org/linux/man-pages/man3/pthread_cond_timedwait.3p.html

wolfcw commented 8 months ago

I agree that ordinary programs do not expect any of these calls to fail - libfaketime itself bailed out the hard way before #422, so that's arguably at least some progress. However, EAGAIN apparently is not a good choice for timedwait_common() as it's not documented; on the other hand, none of the documented errors fit here either. Basically, "try it again" would be a quite suitable error message if it'd exist. :-)

Do you have any specific suggestions on making libfaketime "try harder" in these cases? #419 discussed re-trying in a loop like 1,000 times, but as usual with brute force approaches to non-determinism issues, this does not sound very convincing.

Rob--W commented 8 months ago

I would recommend blocking indefinitely like before, or with a long delay (optionally followed by printing an error message), followed by a long or indefinite wait. I don't expect a very long delay in practice, unless there is something such as sync IPVC or I/O blocking the process for some reason.

And then to support the use case that #422 was meant to address, add a compile-time flag, or a runtime environment variable to allow configuration of the behavior.

You expressed concerns about retrying for an arbitrary amount of time, but that exact concern would apply equally to programs that call these library functions. Because programs don't really have a good way to deal with this, I recommend #422 to be opt-in instead of opt-out.