ziglang / zig

General-purpose programming language and toolchain for maintaining robust, optimal, and reusable software.
https://ziglang.org
MIT License
34.61k stars 2.53k forks source link

handle impossible errors from the kernel with error.Unexpected instead of unreachable #6389

Open jorangreef opened 4 years ago

jorangreef commented 4 years ago

On Linux, e.g. in os.zig the std lib will often interrogate errno and map to Zig errors, or unreachable if the std lib wants to assert that the std lib implementation would never cause an EINVAL or EFAULT, e.g.:

EINVAL => unreachable,
EFAULT => unreachable,

However, while implementing https://github.com/ziglang/zig/pull/6356, I was about to follow this pattern but then I realized that the kernel often overloads errors in new kernel versions, which is particularly the case for the io_uring syscalls.

This means that we might think our std lib implementation cannot cause EINVAL, and then the kernel adds a new feature which could, leading to undefined behavior instead of a safe error.

In other words, we need to start going through the std lib and make this usage of unreachable an anti-pattern because there's no way we can assert what the kernel can or cannot be returning like this.

jorangreef commented 4 years ago

Either that, or we need an unreachable which doesn't become undefined behavior in fast builds, or another separate keyword. To be clear, I don't mean to suggest it's necessarily a good idea to add keywords...

pfgithub commented 4 years ago

Either that, or we need an unreachable which doesn't become undefined behavior in fast builds, or another separate keyword.

I think @panic() does this

Rocknest commented 4 years ago

Do you have any example of a syscall that have changed its behaviour, except ioring stuff? If its specific to that feature, then do not use this pattern with these apis.

jorangreef commented 4 years ago

I haven't looked, but it's not about changing behavior so much as adding behavior, i.e. two error states (one old, one new) map to one error code, and Zig only checks one error state (because that was the only error state that existed at the time in the man pages) or assumes unreachable.

rohlem commented 4 years ago

I don't know much about Linux; maybe I'm spouting nonsense, but couldn't the opposite, hypothetically, be true as well? If a syscall is "upgraded" to loosen its preconditions, then what was previously an error now succeeds. If the program relies on that logic, it breaks semantically (which is its own class of undefined behaviour).

To me it seems like the actual solution would be to instead put an upper limit on the supported kernel version (so do a version check and @panic at run time if it is too high/new). While it's a shame supporting every new release would mean having to manually verify none of the syscall behaviour has changed, fwict that's the only reliable way to go about it. (Really there'd need to be a "compatibility profile" mode when these changes happen, like OpenGL has, though it's of course easier to just not change the interface. Assuming the latter, I too would be inclined to think this is an io_uring-specific growth pain / symptom, though again, I know nothing.)

jorangreef commented 4 years ago

but couldn't the opposite, hypothetically, be true as well? If a syscall is "upgraded" to loosen its preconditions, then what was previously an error now succeeds.

Thanks @rohlem, for the interesting inverse example. That's certainly true, but I think at least it wouldn't lead to undefined behavior from a safety point of view? The program simply wouldn't receive an error from the kernel, so this kind of code branch wouldn't be reached.

Unless, then again, it could be a safety issue if the program required the kernel to return an error, and fell back to unreachable if it did not.

To me it seems like the actual solution would be to instead put an upper limit on the supported kernel version (so do a version check and @panic at run time if it is too high/new). While it's a shame supporting every new release would mean having to manually verify none of the syscall behaviour has changed, fwict that's the only reliable way to go about it.

Perhaps at first glance, but remember the kernel is free to backport features. Ultimately, checking kernel versions is a slippery slope and should be avoided in favor of relying on the error mechanism the kernel already exposes, albeit without using unreachable to assert what the kernel can/cannot return as an error.

Rocknest commented 4 years ago

Its expected that OSes have some backward compatibility, if the system has a breaking change thats not our problem (everything would break if for example read syscall suddenly changed). All documented errors should be handled and nothing more.

andrewrk commented 2 years ago

My only concern here is with the debug experience. For example the following application:

const std = @import("std");

test "close() twice" {
    var file = try std.fs.cwd().createFile("aoeuaoeu", .{});
    defer file.close();

    file.close();
}

The debug experience here is excellent:

[nix-shell:~/Downloads/zig/build-release]$ ./zig test test.zig 
Test [1/1] test "close() twice"... thread 2813 panic: reached unreachable code
/home/andy/Downloads/zig/lib/std/os.zig:252:18: 0x209312 in std.os.close (test)
        .BADF => unreachable, // Always a race condition.
                 ^
/home/andy/Downloads/zig/lib/std/fs/file.zig:188:21: 0x2080fa in std.fs.file.File.close (test)
            os.close(self.handle);
                    ^
/home/andy/Downloads/zig/build-release/test.zig:5:21: 0x20689d in test "close() twice" (test)
    defer file.close();
                    ^
/home/andy/Downloads/zig/lib/std/special/test_runner.zig:77:28: 0x22d302 in std.special.main (test)
        } else test_fn.func();
                           ^
/home/andy/Downloads/zig/lib/std/start.zig:525:22: 0x225eec in std.start.callMain (test)
            root.main();
                     ^
/home/andy/Downloads/zig/lib/std/start.zig:477:12: 0x20986e in std.start.callMainWithArgs (test)
    return @call(.{ .modifier = .always_inline }, callMain, .{});
           ^
/home/andy/Downloads/zig/lib/std/start.zig:391:17: 0x208486 in std.start.posixCallMainAndExit (test)
    std.os.exit(@call(.{ .modifier = .always_inline }, callMainWithArgs, .{ argc, argv, envp }));
                ^
/home/andy/Downloads/zig/lib/std/start.zig:304:5: 0x208292 in std.start._start (test)
    @call(.{ .modifier = .never_inline }, posixCallMainAndExit, .{});
    ^
error: the following test command crashed:
zig-cache/o/82032e287c7c598235bf900b138d0986/test /home/andy/Downloads/zig/build-release/zig

Let's just make sure it remains excellent after the changes incurred by this proposal.

jorangreef commented 2 years ago

My only concern here is with the debug experience.

Yes, thanks for the example.

On second thought I think we should actually leave the status quo as is, because it's a better experience, and safe in Debug and ReleaseSafe even if the kernel does overload errors. It's also something that can be maintained across different kernel versions and they're not going to be overloading errors that often.

Alternatively, simply a @panic with clear explanation.

richiejp commented 1 year ago

Just to complicate things. Any syscall can produce any error due to seccompbpf or an LSM although usually it would be PERM, but maybe some people would use NOSYS. This is a very common scenario now with sandboxing and containers.

Syscalls that act on FDs can produce any error because FDs can point to anything. You can create a file system that represents DNS and start returning NSRCNAMELOOP on writes or reads. It doesn't have to be a mainline kernel, it can be an external module or FUSE.

Even with just the mainline kernel, if you actually mapped all the error codes that a syscall like read or setsockopt can return and someone may want to deal with, then you'll have a big long list of expected errors.