Open Rob--W opened 8 months ago
I agree that ordinary programs do not expect any of these calls to fail - libfaketime itself bailed out the hard way before #422, so that's arguably at least some progress. However, EAGAIN apparently is not a good choice for timedwait_common() as it's not documented; on the other hand, none of the documented errors fit here either. Basically, "try it again" would be a quite suitable error message if it'd exist. :-)
Do you have any specific suggestions on making libfaketime "try harder" in these cases? #419 discussed re-trying in a loop like 1,000 times, but as usual with brute force approaches to non-determinism issues, this does not sound very convincing.
I would recommend blocking indefinitely like before, or with a long delay (optionally followed by printing an error message), followed by a long or indefinite wait. I don't expect a very long delay in practice, unless there is something such as sync IPVC or I/O blocking the process for some reason.
And then to support the use case that #422 was meant to address, add a compile-time flag, or a runtime environment variable to allow configuration of the behavior.
You expressed concerns about retrying for an arbitrary amount of time, but that exact concern would apply equally to programs that call these library functions. Because programs don't really have a good way to deal with this, I recommend #422 to be opt-in instead of opt-out.
Even after patching #130 (with #463), I am unable to launch Firefox 123.0 because it crashes. After bisecting libfaketime, I found that the regression is caused by #422. This patch replaced a blocking call with one that does not block in an attempt to address #419. I think that this behavior should be opt-in (at compile time or at runtime), to minimize the risk of triggering unexpected failures at the next
libfaketime
release.The Firefox crashes happen because it does not expect these functions to fail:
pthread_cond_init
https://searchfox.org/mozilla-central/rev/a3d5a112ddb2d665b0c7ac2919b6f4fc6c97366c/mozglue/misc/ConditionVariable_posix.cpp#76-77pthread_cond_destroy
https://searchfox.org/mozilla-central/rev/a3d5a112ddb2d665b0c7ac2919b6f4fc6c97366c/mozglue/misc/ConditionVariable_posix.cpp#88-89pthread_cond_timedwait
https://searchfox.org/mozilla-central/rev/a3d5a112ddb2d665b0c7ac2919b6f4fc6c97366c/mozglue/misc/ConditionVariable_posix.cpp#140,147,150These functions should ideally not fail, or at the very least
libfaketime
should try harder to avoid failing too soon. Ordinary programs usually expect these to not fail, and that is also a sentiment echoed through the pthreads documentation:oddly, a different source specifies an error as an acceptable result:
pthread_cond_timedwait_common
in libfaketime can returnEAGAIN
, but that error is not listed in the pthreads manual: https://www.man7.org/linux/man-pages/man3/pthread_cond_timedwait.3p.html