Open andrewrk opened 2 years ago
My (@acarrico) initial deadlock test calls std.time.sleep, and in pull request #9850 I mentioned that sleep
might not be appropriate for a test suite. I haven't looked at how my gist was adapted into the PR, but @Vexu commented about reducing the sleep time to 1ms which is probably causing the flaky test. I think it is appropriate to disable this test (and close this issue) unless/until someone adapts the .DEADLK
test to avoid a race condition. Meanwhile, the racy code is useful for developers who are divining actual OS behavior.
The bigger issue is still how to define a comprehensive FcntlError
? In particular, I've found another case where std.os.fcntl
will crash if it hits the unreachable
.BADF
arm of the errno
switch. I've demonstrated this crash on Linux by asking for a exclusive lock on a O.RDONLY
fd. I didn't write a seperate test case yet but basically:
var ro_file = try tmpDir.dir.openFile("file_lock.tmp", std.fs.File.OpenFlags {.read=true});
var struct_flock = std.os.Flock {
.type = std.os.F.WRLCK,
.whence = std.os.SEEK.SET,
.start = 0,
.len = 1,
.pid = undefined
};
if (std.os.fcntl(fd, std.os.F.SETLK, @ptrToInt(&struct_flock))) |_| {} else |_| {}
I should work up a stand alone test and open a separate issue for this one, but it is another data point that the FcntlError error set has problems.
I think it is appropriate to disable this test (and close this issue) unless/until someone adapts the
.DEADLK
test to avoid a race condition.
Actually it would be simple: Use a third fnctl lock to signal when the child is ready for the deadlock test. The parent can detect this by polling.
This would avoid the race, and avoid bringing a separate semaphore API to the test (this assumes it is acceptable to be forking a process in the test suite in the first place).
Observed to fail on FreeBSD too. Unconditionally disabled in 57539a26b4b1a118c9947116f2873ea3c0ced3da.
Is anyone working on this? if not, I'll take it
@InfRandomness what do you propose? Fixing the test case?
@InfRandomness what do you propose? Fixing the test case?
Yeah
Great. Let me know if you have a question.
@acarrico
I've demonstrated this crash on Linux by asking for a exclusive lock
I would assume that file descriptor open mode doesn't match with the type of lock requested
as noted in the manual in section ERRORS
.
reducing the sleep time to 1ms which is probably causing the flaky test
I think using using os.poll on 2 timers with infinite timeout here sounds like a good solution here. Corrected usage is given here in my PR.
Comment improvements, but the main issue is that the parent process must ensure that the child process is currently inside the blocking lock call waiting to get the lock. If Linux would have an API to detect these kind of things, this would be nice.
However, I suspect that this is not possible.
diff --git a/lib/std/os.zig b/lib/std/os.zig
index 63be69a265..bc703a0b0f 100644
--- a/lib/std/os.zig
+++ b/lib/std/os.zig
@@ -4869,7 +4869,8 @@ pub fn fcntl(fd: fd_t, cmd: i32, arg: usize) FcntlError!usize {
.PERM => return error.PermissionDenied,
.MFILE => return error.ProcessFdQuotaExceeded,
.NOTDIR => unreachable, // invalid parameter
- .DEADLK => return error.DeadLock,
+ .FAULT => unreachable, // lock outside accessible address space
+ .DEADLK => return error.DeadLock, // There can be false negative and false positives.
.NOLCK => return error.LockedRegionLimitExceeded,
else => |err| return unexpectedErrno(err),
}
diff --git a/lib/std/os/test.zig b/lib/std/os/test.zig
index 479eef2dea..8d9bb5e3ae 100644
--- a/lib/std/os/test.zig
+++ b/lib/std/os/test.zig
@@ -880,11 +880,6 @@ test "POSIX file locking with fcntl" {
return error.SkipZigTest;
}
- if (true) {
- // https://github.com/ziglang/zig/issues/11074
- return error.SkipZigTest;
- }
-
var tmp = std.testing.tmpDir(.{});
defer tmp.cleanup();
@@ -901,38 +896,42 @@ test "POSIX file locking with fcntl" {
struct_flock.type = std.os.F.RDLCK;
_ = try std.os.fcntl(fd, std.os.F.SETLK, @ptrToInt(&struct_flock));
- // Check the locks in a child process:
+ // Deadlock (exclusive lock xl, shared lock sl):
+ // parent and child have both at beginning: xl0, sl1
const pid = try std.os.fork();
if (pid == 0) {
- // child expects be denied the exclusive lock:
+ // child expects be denied xl0:
struct_flock.start = 0;
struct_flock.type = std.os.F.WRLCK;
try expectError(error.Locked, std.os.fcntl(fd, std.os.F.SETLK, @ptrToInt(&struct_flock)));
- // child expects to get the shared lock:
+ // child expects to get sl1:
struct_flock.start = 1;
struct_flock.type = std.os.F.RDLCK;
_ = try std.os.fcntl(fd, std.os.F.SETLK, @ptrToInt(&struct_flock));
- // child waits for the exclusive lock in order to test deadlock:
+ // child waits for the xl0.
struct_flock.start = 0;
struct_flock.type = std.os.F.WRLCK;
_ = try std.os.fcntl(fd, std.os.F.SETLKW, @ptrToInt(&struct_flock));
// child exits without continuing:
std.os.exit(0);
} else {
- // parent waits for child to get shared lock:
+ // parent waits for child to wait blocking for xl0(fcntl with F.SETLKW):
+ // TODO: poll for parent waiting or
std.time.sleep(1 * std.time.ns_per_ms);
- // parent expects deadlock when attempting to upgrade the shared lock to exclusive:
+
+ // parent expects deadlock on upgrade attempt sl1->xl1
struct_flock.start = 1;
struct_flock.type = std.os.F.WRLCK;
try expectError(error.DeadLock, std.os.fcntl(fd, std.os.F.SETLKW, @ptrToInt(&struct_flock)));
- // parent releases exclusive lock:
+ // parent releases xl0
struct_flock.start = 0;
struct_flock.type = std.os.F.UNLCK;
_ = try std.os.fcntl(fd, std.os.F.SETLK, @ptrToInt(&struct_flock));
- // parent releases shared lock:
+ // parent releases sl1
struct_flock.start = 1;
struct_flock.type = std.os.F.UNLCK;
_ = try std.os.fcntl(fd, std.os.F.SETLK, @ptrToInt(&struct_flock));
+
// parent waits for child:
const result = std.os.waitpid(pid, 0);
try expect(result.status == 0 * 256);
The only reliable solution I know of to detect child processes being stuck/blocked or to get what calls they are doing is to use /proc/[pid]/syscall (since Linux 2.6.27)
suggested here https://tanelpoder.com/2013/02/21/peeking-into-linux-kernel-land-using-proc-filesystem-for-quickndirty-troubleshooting/.
I have no idea how to make deadlock test reliable on non-Linux.
@matu3ba, do you have any insight into that EFAULT
return value? What does the manual mean by, "lock is outside your accessible address space."? These are file locks, so what address space is this?
Your new comments are concise, but I think the old verbose comments are easier to read, especially when you are new to/returning to the issue (as I am right now :) ).
As for the main problem, waiting for the child. I don't think that the proc file system is necessary. There are plenty of standard IPC mechanisms for signalling, but if we want to avoid adding a new IPC mechanism to the test, then why don't we just have the child set a third lock to indicate that it is ready for the parent to proceed with the deadlock test? The parent would have to do a polling loop on that lock with a very short sleep time.
When I submitted the initial patch, I was a little surprised that this test was included at all, since it forks, but if that is ok for the test suite, then a polling loop that will certainly succeed quickly should be ok too.
In real life, I'd use an eventfd, or write to a pipe, or some other event signalling mechanism, but as I said above it would be nice to keep the test free of other IPC mechanisms if possible.
Do you follow?
Thank you for giving attention to the issue, it can be resolved, and that will be one less open issue :).
@matu3ba Another way to avoid the test might be to setup the deadlock test before forking, then do the test in the child, and report back to the parent. I can't remember the semantics of inheriting posix locks so this would need to be researched.
This brings up another question about forking in the test suite. What exactly happens if you fail a test in the child? I feel like the child should be reporting the results to the parent, and the parent should fail the test after waiting for the child process to terminate.
@matu3ba check out my changes. The child signals the parent, fixing the race condition. Also moves the tests to the parent, so failing tests should happen where the test suite can detect them.
@matu3ba, I only touched the flaky test. I didn't add:
.FAULT => unreachable, // lock outside accessible address space
to the the library. I did notice that EFAULT is only listed in the Linux man page fcntl(2), not the posix one fcntl(3p). Still doesn't make sense to me, I guess the kernel source would be the answer.
Anyway, this should probably be a different change set, since the flaky test fix stands alone.
Still doesn't make sense to me, I guess the kernel source would be the answer.
I think thats viable to add, since F_SETOWN uses EPERM to distinguish lock ownership transfer and memory locks on 1. unmapped memory and 2. inaccessible memory (say for overlayfs). I'll look around abit.
UPDATE: Unmapped memory would kill the program, so thats obviously wrong.
@acarrico You can see the usage in fs/fcntl.c
:
ret = copy_from_user(&owner, owner_p, sizeof(owner));
if (ret)
return -EFAULT;
ret = copy_to_user(owner_p, &owner, sizeof(owner));
if (ret)
ret = -EFAULT;
Oh, the struct flock
could be bad. True enough.
I think this was meant to be resolved with https://github.com/acarrico/zig/commit/cda125fd6476f350b81340fc3357e3fb2b6a10c0? The PR description was probably badly formatted and GitHub left this one open.
That commit was never merged.
Ah, my bad, I saw the message "andrewrk pushed a commit to ..." but I missed that it was in a fork.
As far as I remember, it tests the (accepted) changes to fcntl, but for some reason it failed on an arm platform that nobody with access to the platform tried to diagnose.
Trying to reboot my memory of this issue... let me be a little bit more clear:
The "flaky test" should not have been committed with the original patch, it was racy code used to develop the patch and not appropriate for the test runner. It should remain disabled, or better yet replaced with the proper test.
Unfortunately, that pull is closed---it failed on ARM test runner. The comments indicate that I could not reproduce the failure on a Raspberry PI, and @matu3ba could not reproduce the failure on qemu.
That closed pull is the test you want, not this disabled test, which was known to be racy from the start. The fact that it uncovered a problem with fnctl on the ARM test runner, means that somebody should probably get the test runner to try it again.
Thanks.
Zig Version
0.10.0-dev.1167+e74382b60
Steps to Reproduce
Confirmed to happen on x86_64-linux-gnu and aarch64-linux-gnu.
Expected Behavior
All tests passed.
Actual Behavior