Open andrewrk opened 6 years ago
Interesting notes from the man page:
NOTES
Some consider the semantics of vfork() to be an architectural blemish,
and the 4.2BSD man page stated: "This system call will be eliminated
when proper system sharing mechanisms are implemented. Users should
not depend on the memory sharing semantics of vfork() as it will, in
that case, be made synonymous to fork(2)." However, even though modern
memory management hardware has decreased the performance difference
between fork(2) and vfork(), there are various reasons why Linux and
other systems have retained vfork():
* Some performance-critical applications require the small performance
advantage conferred by vfork().
* vfork() can be implemented on systems that lack a memory-management
unit (MMU), but fork(2) can't be implemented on such systems.
(POSIX.1-2008 removed vfork() from the standard; the POSIX rationale
for the posix_spawn(3) function notes that that function, which pro‐
vides functionality equivalent to fork(2)+exec(3), is designed to be
implementable on systems that lack an MMU.)
* On systems where memory is constrained, vfork() avoids the need to
temporarily commit memory (see the description of /proc/sys/vm/over‐
commit_memory in proc(5)) in order to execute a new program. (This
can be especially beneficial where a large parent process wishes to
execute a small helper program in a child process.) By contrast,
using fork(2) in this scenario requires either committing an amount
of memory equal to the size of the parent process (if strict over‐
committing is in force) or overcommitting memory with the risk that
a process is terminated by the out-of-memory (OOM) killer.
I don't think that can work. With vfork() you're only allowed to call execve() or _exit(). The setup that spawnPosix() does, like changing uid/gid and the cwd, is incompatible with vfork() - it would be visible to other threads in the parent process.
My understanding is that changes to memory would affect the parent process, but the actual pid is a new child pid. So things like current directory, uid/gid, would be independent from the parent process. I'm doing some experiments now. Currently I'm getting a confusing segfault with the following code which I am troubleshooting:
const std = @import("std");
const os = std.os;
const linux = os.linux;
pub fn main() void {
var x: i32 = 1234;
const rc = linux.vfork();
if (rc == 0) {
// we are the child
x += 1;
linux.exit(0);
}
std.debug.warn("x={}\n", x);
}
$ ./test
Segmentation fault
gdb has no stack trace, and the program prints 1234
with valgrind.
Figured it out. vfork is a function, and when the child does a return
it clobber's the parent's stack. I fixed it like this:
/// This must be inline, and inline call the syscall function, because if the
/// child does a return it will clobber the parent's stack.
pub inline fn vfork() usize {
return @inlineCall(syscall0, SYS_vfork);
}
However it's still problematic, because the example code now does this in debug builds:
x=1235
But does this in release builds:
x=1234
This is because the optimizer sees x+=1; unreachable;
and then concludes that if you don't hit the if statement, the x += 1
never could have happened. The vfork
breaks the compiler's understanding of control flow.
There's a better way to do this, which is to use clone
followed by execve
. You can see that this is how musl implements posix_spawn
.
Sorry, I thought posix_setregid() and posix_setreuid() did the "signal other threads" dance that glibc and musl do. I see now that they only update the calling thread.
changeCurDir() and posixExecve() on the other hand allocate memory. That will be visible to other threads because the fork shares the address space with the parent.
Apropos signals, signal handlers are one reason I never used vfork() in libuv. Signals delivered to the fork will be observable in the parent when the signal handler writes to memory. And signal handlers exist for their side effects so that's probably an all too common pitfall.
Agreed, this issue is no longer to use vfork
. Now it's to use clone()
instead of fork()
on Linux. This is how musl implements posix_spawn
. That API doesn't have a way to change directory or setgid/setuid, but that's not a linux limitation; that's an API limitation.
With clone()
, the memory allocations will be no problem. The parent can preallocate the memory before cloning, and then free once the child does exit
or execve
.
Why wouldn't you just use posix_spawn
directly?
tldr;
(eventual) future without a bazillion flags to clone
: "new process starts as an empty address space,
and an advanced user may manipulate it in a piecemeal fashion, populating its address-space and kernel context prior to
execution, without needing to clone the parent nor run code in the context of the child"
benches aspawn vs alternatives
Benchmark Time CPU Iterations
---------------------------------------------------------------------
BM_aspawn_no_reuse 18009 ns 17942 ns 38943
BM_aspawn/threads:1 14500 ns 14446 ns 48339
BM_vfork_with_shared_stack 46545 ns 16554 ns 44027
BM_fork 54583 ns 54527 ns 12810
BM_posix_spawn 125061 ns 29091 ns 24483
BM_posix_spawn 125061 ns 29091 ns 24483
This is surprising
Why wouldn't you just use posix_spawn directly?
posix_spawn is too limited in user modification of the startup code. Take as example a lib that wants to create multiple pipes without the user having to worry about, but still providing the user with the open pipe handle for reading/writing.
Standard option: child uses dup() and close() the unused file handles such that writer.writerAll works as expected and doesn't need a "Read until x" to prevent infinite wait on data by the consumer. However, at least to my understanding of the code, posix_spawn doesnt offer functionality to get the open file descriptor (an integer). Non-standard option: User must dup() and close().
And using explicit numbers for file desriptor does only work by chance xor forces user code to use explicit file descriptors.
posix_spawn is too limited in user modification of the startup code. Take as example a lib that wants to create multiple pipes without the user having to worry about, but still providing the user with the open pipe handle for reading/writing.
Standard option: child uses dup() and close() the unused file handles such that writer.writerAll works as expected and doesn't need a "Read until x" to prevent infinite wait on data by the consumer. However, at least to my understanding of the code, posix_spawn doesnt offer functionality to get the open file descriptor (an integer). Non-standard option: User must dup() and close().
The standard option is to just do the dup
via posix_spawn_file_actions_adddup2
.
pipe2(,O_CLOEXEC)
(these file descriptors will be any random fd number)O_CLOEXEC
both ends would be closed in any exec
posix_spawn_file_actions_adddup2
to dup the file descriptor to the fd where the child expects it (e.g. if you want the pipe to be stdin, then dup it to fd 0). This will also make it inheritable by the child process.This is outlined in the man page for posix_spawn_file_actions_addclose
under "RATIONALE" on my linux system.
The standard option is to just do the dup via posix_spawn_file_actions_adddup2.
Yes. The problem however is that you dont know which file descriptor you will get back from calling dup
and you cant just use a specific one "that is guaranteed to be available" like the standard streams (stdin=0, stdout=1, stderr=2).
We indeed know the "original pipe ends" before calling fork/vfork/clone
, but we need the return of dup
before exec
is run in the child: https://git.musl-libc.org/cgit/musl/tree/src/process/posix_spawn.c#n150 stored somewhere.
Offering the user "where+how to store system resources" sounds like the best option here to me, but I dont know best practice to "store stuff before exec and access it after exec" for this use case.
The problem however is that you dont know which file descriptor you will get back from calling
dup
You tell dup
which file descriptor you want.
So you can e.g. specify 0/1/2 for stdin/stdout/stderr;
Or you could just pick a number, say, 42; you could inform the child program where that is expected via e.g. an environment variable or argument etc.
You tell dup which file descriptor you want.
The parent process might use the file descriptor already for something different and expect the child not to overwrite/bambuzzle it.
For example, file descriptor x might get used by the parent for logging and the child gets that info via stdin.
If we accidentically overwrite file descriptor x with dup()
in child process, we break the code.
The parent process might use the file descriptor already for something different and expect the child not to overwrite/bambuzzle it. For example, file descriptor x might get used by the parent for logging and the child gets that info via stdin. If we accidentically overwrite file descriptor x with
dup()
in child process, we break the code.
The dup
only happens after the fork
; it doesn't affect the parent at all.
The dup only happens after the fork; it doesn't affect the parent at all.
Yes, though thats not the point.
You cant just willy-nilly use file descriptors, because they might be already used for something else (and intentionally by user not automatically closed with close-on-exec via FD_CLOEXEC
).
If you want to get one fresh from Kernel via dup()
, posix_spawn does not allow to modify the function child
to extract the necessary information (the file descriptor number) ie via pointer to stack, environment variable or other means.
For stdin,stdout and stderr we dont need that, because they must be always available and have a fixed numbering schema.
Why wouldn't you just use
posix_spawn
directly?
Because that's not actually a syscall. That's a libc function. Not to mention, it's trash.
The dup only happens after the fork; it doesn't affect the parent at all.
Yes, though thats not the point.
You cant just willy-nilly use file descriptors, because they might be already used for something else (and intentionally by user not automatically closed with close-on-exec via
FD_CLOEXEC
).If you want to get one fresh from Kernel via
dup()
, posix_spawn does not allow to modify the functionchild
to extract the necessary information (the file descriptor number) ie via pointer to stack, environment variable or other means.For stdin,stdout and stderr we dont need that, because they must be always available and have a fixed numbering schema.
I would argue that the API (on our side) should require the caller to explicitly specify which handles/file descriptors (or however the OS it runs on names it) the child should inherit.
Everything else is racy (as in, the child could receive unwanted handles when multiple threads try to create a child).
On Unix-likes this can be done on Zig's side by always using CLO_EXEC and then removing it from the specified fds (be it via dup{,2,3}, fcntl etc.
Windows has an option for that in its CreateProcess API (see https://learn.microsoft.com/en-us/windows/win32/procthread/inheritance#inheriting-handles, https://learn.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-updateprocthreadattribute and PROC_THREAD_ATTRIBUTE_HANDLE_LIST).
And I know from some research OSes where one must explicitly specify it anyway.
So, being explicit would be less-racy/more secure, can already be implemented, and be more "future-proof". Additionally I would argue that it would be a quite clear API. Sure, one could circumvent it on some platforms, but not on all.
vfork pauses the parent process until the child process either calls execve or exit. This prevents Copy On Write pages, which makes spawning child processes avoid double the parent resident memory requirements when overcommit is off.