ziglang / zig

General-purpose programming language and toolchain for maintaining robust, optimal, and reusable software.
https://ziglang.org
MIT License
33.33k stars 2.43k forks source link

std.ChildProcess after spawn SIGSTOP signal program will never return thus hang the host program #18548

Open liyu1981 opened 7 months ago

liyu1981 commented 7 months ago

Zig Version

0.12.0-dev.1828+225fe6ddb

Steps to Reproduce and Observed Behavior

let us consider a simple c program

// exit_stop.c
#include <signal.h>
int main() {
    raise(SIGTSTP);
}

it will raise SIGTSTP to itself cause itself be stopped after execution. (In real life this could be a long run external program be kill -STOP by other user). Compile it with zig cc exit_stop.c -o exit_stop

if I use following zig code to run it (within same dir of exit_stop)

/// run_child_test.zig
const std = @import("std");

test "stop test" {
    const rr = try std.ChildProcess.run(.{
        .allocator = @import("std").testing.allocator,
        .argv = &[_][]const u8{"./exit_stop"},
    });
    _ = rr;
}

now run it zig test run_child_test.zig, will see run_child_test.zig run forever, and will not return a rr with rr.term.Stopped be true.

The reason for this is, as I investigated a bit, in std/child_process.zig, around line 406, I see

const res: os.WaitPidResult = res: {
            if (self.request_resource_usage_statistics) {
                switch (builtin.os.tag) {
                    .linux, .macos, .ios => {
                        var ru: std.os.rusage = undefined;
                        const res = os.wait4(self.id, 0, &ru);   // mark 1
                        self.resource_usage_statistics.rusage = ru;
                        break :res res;
                    },
                    else => {},
                }
            }

            break :res os.waitpid(self.id, 0); // mark 2
        };

in either mark 1 or mark 2, we call os.wait4 with option 0, and if we read https://pubs.opengroup.org/onlinepubs/9699919799/functions/waitpid.html, we can find that only with option

WUNTRACED
The status of any child processes specified by pid that are stopped, and whose status has not yet been reported since they stopped, shall also be reported to the requesting process.

will enable the return from a stopped child process. So if we use option 0, will not return.

The second founding is that in later line 484 of child_process.zig

fn statusToTerm(status: u32) Term {
        return if (os.W.IFEXITED(status))
            Term{ .Exited = os.W.EXITSTATUS(status) }
        else if (os.W.IFSIGNALED(status))
            Term{ .Signal = os.W.TERMSIG(status) }
        else if (os.W.IFSTOPPED(status))
            Term{ .Stopped = os.W.STOPSIG(status) }. // mark 1
        else
            Term{ .Unknown = status };  // mark 2
    }

both mark1 and mark2 should be unreachable or @panic. For mark1, we know before we fix the above issue, it can never be true. For mark2, if we read https://pubs.opengroup.org/onlinepubs/9699919799/functions/exit.html, we will notice that

The value of status may be 0, EXIT_SUCCESS, EXIT_FAILURE, [[CX](javascript:open_code('CX'))] [Option Start]  or any other value, though only the least significant 8 bits (that is, status & 0377) shall be available from [wait()](https://pubs.opengroup.org/onlinepubs/9699919799/functions/wait.html) and [waitpid()](https://pubs.opengroup.org/onlinepubs/9699919799/functions/waitpid.html); the full value shall be available from [waitid()](https://pubs.opengroup.org/onlinepubs/9699919799/functions/waitid.html) and in the siginfo_t passed to a signal handler for SIGCHLD. [Option End]

so status will never be bigger than int 255, so it will never reach end of our ifs.

Expected Behavior

at least first SIGSTOP part should be fixed, by defaultly use WUNTRACED | WCONTINUED as suggested by POSIX, or allow user to pass in option param.

for second Unknown, strongly suggest to change to @panic, so if really that happens can leave some traces.

matu3ba commented 7 months ago

The problem analysis is correct.

However, I disagree to bundle multiple return values to enforce upper time bounds of return values, since the user would need to handle all the edge cases now and those are hidden from the user.

Instead we should use WNOHANG in https://pubs.opengroup.org/onlinepubs/9699919799.2018edition/functions/waitid.html and read siginfo_t https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/signal.h.html to fix the more structural problem of incoming signaling being racy, for example if 2 or more children terminate at the same time.

Multiple identical signals from the same process could still be lost, which is a more fundamental OS design problem.

This would still not fix the sending part being racy, see pidfd #18508 for that (sadly linux only).