std.os.ChildProcess should use clone instead of fork on linux

andrewrk commented 6 years ago

vfork pauses the parent process until the child process either calls execve or exit. This prevents Copy On Write pages, which makes spawning child processes avoid double the parent resident memory requirements when overcommit is off.

andrewrk commented 6 years ago

Interesting notes from the man page:

NOTES
       Some  consider the semantics of vfork() to be an architectural blemish,
       and the 4.2BSD man page stated: "This system call  will  be  eliminated
       when  proper  system  sharing mechanisms are implemented.  Users should
       not depend on the memory sharing semantics of vfork() as  it  will,  in
       that case, be made synonymous to fork(2)."  However, even though modern
       memory management hardware has  decreased  the  performance  difference
       between  fork(2)  and  vfork(), there are various reasons why Linux and
       other systems have retained vfork():

       *  Some performance-critical applications require the small performance
          advantage conferred by vfork().

       *  vfork()  can be implemented on systems that lack a memory-management
          unit (MMU), but  fork(2)  can't  be  implemented  on  such  systems.
          (POSIX.1-2008 removed vfork() from the standard; the POSIX rationale
          for the posix_spawn(3) function notes that that function, which pro‐
          vides functionality equivalent to fork(2)+exec(3), is designed to be
          implementable on systems that lack an MMU.)

       *  On systems where memory is constrained, vfork() avoids the  need  to
          temporarily commit memory (see the description of /proc/sys/vm/over‐
          commit_memory in proc(5)) in order to execute a new program.   (This
          can  be especially beneficial where a large parent process wishes to
          execute a small helper program in a child  process.)   By  contrast,
          using  fork(2) in this scenario requires either committing an amount
          of memory equal to the size of the parent process (if  strict  over‐
          committing  is in force) or overcommitting memory with the risk that
          a process is terminated by the out-of-memory (OOM) killer.

bnoordhuis commented 6 years ago

I don't think that can work. With vfork() you're only allowed to call execve() or _exit(). The setup that spawnPosix() does, like changing uid/gid and the cwd, is incompatible with vfork() - it would be visible to other threads in the parent process.

andrewrk commented 6 years ago

My understanding is that changes to memory would affect the parent process, but the actual pid is a new child pid. So things like current directory, uid/gid, would be independent from the parent process. I'm doing some experiments now. Currently I'm getting a confusing segfault with the following code which I am troubleshooting:

const std = @import("std");
const os = std.os;
const linux = os.linux;

pub fn main() void {
    var x: i32 = 1234;
    const rc = linux.vfork();
    if (rc == 0) {
        // we are the child
        x += 1;
        linux.exit(0);
    }
    std.debug.warn("x={}\n", x);
}

$ ./test
Segmentation fault

gdb has no stack trace, and the program prints 1234 with valgrind.

andrewrk commented 6 years ago

Figured it out. vfork is a function, and when the child does a return it clobber's the parent's stack. I fixed it like this:

/// This must be inline, and inline call the syscall function, because if the
/// child does a return it will clobber the parent's stack.
pub inline fn vfork() usize {
    return @inlineCall(syscall0, SYS_vfork);
}

However it's still problematic, because the example code now does this in debug builds:

x=1235

But does this in release builds:

x=1234

This is because the optimizer sees x+=1; unreachable; and then concludes that if you don't hit the if statement, the x += 1 never could have happened. The vfork breaks the compiler's understanding of control flow.

There's a better way to do this, which is to use clone followed by execve. You can see that this is how musl implements posix_spawn.

bnoordhuis commented 6 years ago

Sorry, I thought posix_setregid() and posix_setreuid() did the "signal other threads" dance that glibc and musl do. I see now that they only update the calling thread.

changeCurDir() and posixExecve() on the other hand allocate memory. That will be visible to other threads because the fork shares the address space with the parent.

Apropos signals, signal handlers are one reason I never used vfork() in libuv. Signals delivered to the fork will be observable in the parent when the signal handler writes to memory. And signal handlers exist for their side effects so that's probably an all too common pitfall.

andrewrk commented 6 years ago

Agreed, this issue is no longer to use vfork. Now it's to use clone() instead of fork() on Linux. This is how musl implements posix_spawn. That API doesn't have a way to change directory or setgid/setuid, but that's not a linux limitation; that's an API limitation.

With clone(), the memory allocations will be no problem. The parent can preallocate the memory before cloning, and then free once the child does exit or execve.

daurnimator commented 6 years ago

Why wouldn't you just use posix_spawn directly?

matu3ba commented 2 years ago

Here are perf comparisons between different spawn implementations with tradeoffs and a description with discussion etc.

tldr;

there is a cool library with more advantages than posix_spawn called aspawn [GPLv2 licensed]
paper "A fork() in the road" by Baumann et al. describes things more accurately
privilege separation needs special attention and checks
high-level: (posix_)spawn/vfork, low-level: clone

(eventual) future without a bazillion flags to clone: "new process starts as an empty address space, and an advanced user may manipulate it in a piecemeal fashion, populating its address-space and kernel context prior to execution, without needing to clone the parent nor run code in the context of the child"

benches aspawn vs alternatives

Benchmark                           Time             CPU   Iterations
---------------------------------------------------------------------
BM_aspawn_no_reuse              18009 ns        17942 ns        38943
BM_aspawn/threads:1             14500 ns        14446 ns        48339
BM_vfork_with_shared_stack      46545 ns        16554 ns        44027
BM_fork                         54583 ns        54527 ns        12810
BM_posix_spawn                 125061 ns        29091 ns        24483

daurnimator commented 2 years ago

BM_posix_spawn 125061 ns 29091 ns 24483

This is surprising

Which posix_spawn implementation is it? Different implementations are likely to be significantly different. Make sure to compare the different libcs, and then even different versions of them.
There may just be an argument here to improve the posix_spawn implementation rather than use something else.

matu3ba commented 2 years ago

Why wouldn't you just use posix_spawn directly?

posix_spawn is too limited in user modification of the startup code. Take as example a lib that wants to create multiple pipes without the user having to worry about, but still providing the user with the open pipe handle for reading/writing.

Standard option: child uses dup() and close() the unused file handles such that writer.writerAll works as expected and doesn't need a "Read until x" to prevent infinite wait on data by the consumer. However, at least to my understanding of the code, posix_spawn doesnt offer functionality to get the open file descriptor (an integer). Non-standard option: User must dup() and close().

And using explicit numbers for file desriptor does only work by chance xor forces user code to use explicit file descriptors.

daurnimator commented 2 years ago

posix_spawn is too limited in user modification of the startup code. Take as example a lib that wants to create multiple pipes without the user having to worry about, but still providing the user with the open pipe handle for reading/writing.

Standard option: child uses dup() and close() the unused file handles such that writer.writerAll works as expected and doesn't need a "Read until x" to prevent infinite wait on data by the consumer. However, at least to my understanding of the code, posix_spawn doesnt offer functionality to get the open file descriptor (an integer). Non-standard option: User must dup() and close().

The standard option is to just do the dup via posix_spawn_file_actions_adddup2.

Create pipe with pipe2(,O_CLOEXEC) (these file descriptors will be any random fd number)
Because of O_CLOEXEC both ends would be closed in any exec
Use posix_spawn_file_actions_adddup2 to dup the file descriptor to the fd where the child expects it (e.g. if you want the pipe to be stdin, then dup it to fd 0). This will also make it inheritable by the child process.

This is outlined in the man page for posix_spawn_file_actions_addclose under "RATIONALE" on my linux system.

matu3ba commented 2 years ago

The standard option is to just do the dup via posix_spawn_file_actions_adddup2.

Yes. The problem however is that you dont know which file descriptor you will get back from calling dup and you cant just use a specific one "that is guaranteed to be available" like the standard streams (stdin=0, stdout=1, stderr=2).

We indeed know the "original pipe ends" before calling fork/vfork/clone, but we need the return of dup before exec is run in the child: https://git.musl-libc.org/cgit/musl/tree/src/process/posix_spawn.c#n150 stored somewhere.

Offering the user "where+how to store system resources" sounds like the best option here to me, but I dont know best practice to "store stuff before exec and access it after exec" for this use case.

daurnimator commented 2 years ago

The problem however is that you dont know which file descriptor you will get back from calling dup

You tell dup which file descriptor you want. So you can e.g. specify 0/1/2 for stdin/stdout/stderr; Or you could just pick a number, say, 42; you could inform the child program where that is expected via e.g. an environment variable or argument etc.

matu3ba commented 2 years ago

You tell dup which file descriptor you want.

The parent process might use the file descriptor already for something different and expect the child not to overwrite/bambuzzle it. For example, file descriptor x might get used by the parent for logging and the child gets that info via stdin. If we accidentically overwrite file descriptor x with dup() in child process, we break the code.

daurnimator commented 2 years ago

The parent process might use the file descriptor already for something different and expect the child not to overwrite/bambuzzle it. For example, file descriptor x might get used by the parent for logging and the child gets that info via stdin. If we accidentically overwrite file descriptor x with dup() in child process, we break the code.

The dup only happens after the fork; it doesn't affect the parent at all.

matu3ba commented 2 years ago

The dup only happens after the fork; it doesn't affect the parent at all.

Yes, though thats not the point.

You cant just willy-nilly use file descriptors, because they might be already used for something else (and intentionally by user not automatically closed with close-on-exec via FD_CLOEXEC).

If you want to get one fresh from Kernel via dup(), posix_spawn does not allow to modify the function child to extract the necessary information (the file descriptor number) ie via pointer to stack, environment variable or other means.

For stdin,stdout and stderr we dont need that, because they must be always available and have a fixed numbering schema.

andrewrk commented 8 months ago

Why wouldn't you just use posix_spawn directly?

Because that's not actually a syscall. That's a libc function. Not to mention, it's trash.

KilianHanich commented 4 days ago

The dup only happens after the fork; it doesn't affect the parent at all.

Yes, though thats not the point.

You cant just willy-nilly use file descriptors, because they might be already used for something else (and intentionally by user not automatically closed with close-on-exec via FD_CLOEXEC).

If you want to get one fresh from Kernel via dup(), posix_spawn does not allow to modify the function child to extract the necessary information (the file descriptor number) ie via pointer to stack, environment variable or other means.

For stdin,stdout and stderr we dont need that, because they must be always available and have a fixed numbering schema.

I would argue that the API (on our side) should require the caller to explicitly specify which handles/file descriptors (or however the OS it runs on names it) the child should inherit.

Everything else is racy (as in, the child could receive unwanted handles when multiple threads try to create a child).

On Unix-likes this can be done on Zig's side by always using CLO_EXEC and then removing it from the specified fds (be it via dup{,2,3}, fcntl etc.

Windows has an option for that in its CreateProcess API (see https://learn.microsoft.com/en-us/windows/win32/procthread/inheritance#inheriting-handles, https://learn.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-updateprocthreadattribute and PROC_THREAD_ATTRIBUTE_HANDLE_LIST).

And I know from some research OSes where one must explicitly specify it anyway.

So, being explicit would be less-racy/more secure, can already be implemented, and be more "future-proof". Additionally I would argue that it would be a quite clear API. Sure, one could circumvent it on some platforms, but not on all.

ziglang / zig

std.os.ChildProcess should use clone instead of fork on linux #1620