ziglang / zig

General-purpose programming language and toolchain for maintaining robust, optimal, and reusable software.
https://ziglang.org
MIT License
34.3k stars 2.51k forks source link

use case: ability to recover from illegal behavior in safe build modes #3516

Open mogud opened 4 years ago

mogud commented 4 years ago

In my situation, most game servers I designed so far use service as an abstraction of everything. And millions of service could be in only a single process. For the purpose of robustness, service manager catches errors/exceptions from all running services and chooses proper operations to them(kill service or just ignore it). It seems zig will panic at runtime when something like division by zero occurs, and it's not recoverable. So is it possible to add an option for this situation? As far as I know, nim has many compiler check switches to make these edge errors as runtime exceptions. rust can do catch_unwind after a panic. go has a recover() buitin funtion.

Rocknest commented 4 years ago

There are no runtime exceptions in zig. Also it is undefined behavior if runtime safety is turned off (release-fast etc.)

JesseRMeyer commented 4 years ago

And millions of service could be in only a single process.

This is probably not what you want. If a single fatal error occurs, then the entire process is destroyed. Instead, you want each service to run in its own process that communicates to other processes using some standard format. That way, if, say, the login service fails, players who are already logged in and playing are not booted from their session. Also, it makes it trivial to distribute across machines in a network. While that is a notable increase in complexity, the alternative of solving who catches which exception thrown by what when is probably just a rats nest waiting to happen.

mogud commented 4 years ago

Instead, you want each service to run in its own process that communicates to other processes using some standard format.

It's not possible to have millions of processes.

Also, it makes it trivial to distribute across machines in a network.

In fact, user space codes always use RPC for communication, and do not need a concern abount if it is across machines or not. A gateway may keep players' connections, but typically, more than one thound players' game logic must be handled within a single process. It's not acceptable that players are all kicked out only because of a division by zero error. A proper way I think is to record the log and report it to the maintainers. And they will decide if it is neccesary to shutdown game server and fix it. So, if we can assure defers/errdefers work well and have a way to stop unwind by an compiler option when a division by zero happend, we have more choices.

DaseinPhaos commented 4 years ago

Instead, you want each service to run in its own process that communicates to other processes using some standard format.

It's not possible to have millions of processes.

Besides that, the question remains on who gets to decide how fatal an error is.

DaseinPhaos commented 4 years ago

Probably Relevent: #395, @thejoshwolfe 's comment on error handling

JesseRMeyer commented 4 years ago

It's not possible to have millions of processes.

Yes, it is possible, especially on a distributed network of servers. But its feasibility depends on your definition of a service, kernel and related architectural choices. Whether we process a single user or tens of thousands of them on a single process is an important decision, and error propagation does seem to have a say here, regardless of my architecture comments.

Zig maintains the mantra of no hidden control flow, and software exceptions violate that principle outright. But I agree that users should wield full control over error handling. If the runtime already catches these errors, it should first propagate them to the user process to see if it cares and wants to handle it directly, and if not, return it back for the default behavior.

emekoi commented 4 years ago

what's wrong with

fn safe_div(a: var, b: @typeOf(a)) !@typeOf(a) {
    @setRuntimeSafety(false);
    if (b == 0) return error.DivisionByZero;
    return a / b;
}

and with #489, some of the performance hit from the check can be optimized away.

JesseRMeyer commented 4 years ago

@emekoi Are you suggesting that as a user or Standard Library function?

Here's why -- I do not want to pollute every div() callsite I make with error handling, especially when I know that the dividend is not 0, as inputs are often sanitized long before computations on them are performed. This is the crux of the problem, if we address this at too fine a granularity then the whole structure around it pays the cost in support. I suppose in those cases, the binary / would suffice, so maybe renaming this to safe_div() would indicate its purpose.

mogud commented 4 years ago

@emekoi

  1. It is verbose enough that everywhere I must use div function call instead of a simple binary operator.
  2. It is awful to review others' codes in order to make sure they follow the right way, or I have to create an static analyzer.
  3. It is hard to reuse third-party libraries, because obviously, they use /.
  4. overflow is also an unrecoverable error, and does this means all builtin arithmetic operators cannot be used? I think it's really really inconvenient.
Rocknest commented 4 years ago

@mogud It is a bad idea to run 'untrusted' code in a single monolithic process, regardless it is zig/c or assembly, if you want to run it safely use some kind of sandbox. For example you can compile your 'services' to wasm.

emekoi commented 4 years ago

@JesseRMeyer if you know that the dividend is not zero, then just use /. there shouldn't be an issue if your input is already sanitized. as for overflow, we have compiler intrinsics that handle overflow. you can also just catch the exceptions from the OS, and go from there.

furthermore, if you're running a game server i think it is in your and your users best interests, if you carefully review all the libraries that you use...

JesseRMeyer commented 4 years ago

you can also just catch the exceptions from the OS, and go from there

How do we accomplish this in Zig?

emekoi commented 4 years ago

it depends on the OS, but for windows you can use Structured Exception Handling like you would in C and for unix systems you can use signal handlers. we already use these to catch segmentation faults in debug mode on supported systems. the relevant code is from this line down.

JesseRMeyer commented 4 years ago

Thanks.

If user code can explicitly override Zig's safety features with their own, then that makes me glad.

rohlem commented 4 years ago

Other related issues that haven't been mentioned yet: #1740 , #426 (note: rejected), #1356 (note: only tangentially related, discussion seemed to disfavour recover-like mechanisms).

In debug builds, a Zig panic calls the root source file's panic handler (doesn't seem documented yet - mentioned in documentation of @panic ). You are free to provide an implementation with f.e. a longjmp -- anything that holds the noreturn return type, so doesn't expect to return directly to the panic-ed stack.

The main issue is that in completely-optimized builds (ReleaseFast, ReleaseSmall), the LLVM IR that is emitted results in undefined behaviour. If you want uncompromised speed, you need to compromise recoverability (as far as I understand it). How recoverable that currently resulting undefined behaviour ends up being is left for the backend, currently LLVM, to decide.

If your main concern is correctness/stability, then allocating separate stack memory for each service invocation and having a longjmp-or-equivalent return plan from the panic handler might be an acceptable solution.

I also thought I remembered (but now can't find) another more in-depth discussion about turning each instance of detectable illegal behaviour into returning a standard error code - again, this prevents full-fledged optimizations. Note that whatever judgement mainline Zig ends up pasing, with Zig's parser being part of the standard library, it might be feasible for you to add a compilation step that replaces certain unsafe expressions (like panicking operators) with safer function calls (like the error-returning alternatives from std.math, or a non-error fallback return value).

mogud commented 4 years ago

@mogud It is a bad idea to run 'untrusted' code in a single monolithic process, regardless it is zig/c or assembly, if you want to run it safely use some kind of sandbox. For example you can compile your 'services' to wasm.

But if it is well structured and won't crash, it's really useful design for extremely performance(no ipc, no serialization) . Think about go with so many goroutines. A service may consists of two or three goroutines. And I can make it never crash. As a matter of fact, our current game server use this pattern since 2 years ago. And it only crashes once by a deeply concurrent issue. So this is a real use case, we need that such errors or panics can be catch or handled more safely.

mogud commented 4 years ago

@emekoi You are right. But as I mentioned aboved, at least, we need the language guarantees that defers/errdefers must be proccessed as expected. Or it's not safely recoverable.

mogud commented 4 years ago

@rohlem Thanks. By the way, catch_unwind is indeed what I want personally. Or I have to embed other script language like lua for convenience and robustness.

mogud commented 4 years ago

rust's recover(catch panic) rfcs

Rocknest commented 4 years ago

@mogud panic is a debug tool, do not abuse it. If you want crash resilient program you have to pay for it in some way or another.

mogud commented 4 years ago

@mogud panic is a debug tool, do not abuse it.

I never said I use panic in zig as it's indeed a debug tool right now. I use it in go.

If you want crash resilient program you have to pay for it in some way or another.

Which way? I do know multiprocess can promote servers' robustness, but it completely breaks the origin design which can do perfectly in go, rust, nim and all other vm languages.

JesseRMeyer commented 4 years ago

@mogud

Well there's the way offered by @emekoi a few replies up on how to override Zig's panic handler. Maybe this facility can be expanded on.

andrewrk commented 4 years ago

Hi @mogud. Thank you for opening this issue. I want to start by affirming that this is a valid and important use case, and the Zig project needs to have an answer for how this use case is recommended to be solved, even if the language does not address it, and such a recommendation is "use processes" or "a different language would be a better fit for this use case".

I think the Rust RFC you linked does a great job of explaining the situation, especially with regards to broken invariants of data structures.

@rohlem is correct about ReleaseFast mode vs ReleaseSafe mode. In ReleaseFast modes, the optimizer will assume illegal behavior, such as division by zero, does not occur. For a game server where it is important to not crash, ReleaseSafe will be a better choice for the global build mode, and this issue is suggesting that detected illegal behavior can be recovered from. @mogud I hope you don't mind that I rename this issue in light of #2402.

Given that a panic can happen in defer expressions, recovering from a panic is generally unsound, unless one very specific thing is done: arena-based resource management. When one creates an arena for resources, this creates a "recovery" point. If you think about it, this is why process-based recovery works so well - the OS creates an "arena" for you which cleans up all resources if the process crashes. Importantly, it also creates a thread of execution where control flow being abruptly terminated does not affect other threads of execution.

One thing to consider here is that in this use case in zig, it's extremely likely that the software would be written with event-based I/O. So a proposal to make detected illegal behavior recoverable would have to solve the problem that jumping straight to the panic function from an async function would leave the awaiter hanging. If an async function does not make it to the return statement, its awaiter will hang forever, likely leaking resources, or worse, breaking invariants of data structures.

Rocknest commented 4 years ago

@andrewrk panic recovering sounds like runtime exceptions reborn. I dont think its a good idea to support such use case in a language without runtime and with direct access to the system's resources. Illegal behavior means there is a bug in the software, doesn't it?

andrewrk commented 4 years ago

Illegal behavior means there is a bug in the software, doesn't it?

Yes. This use case is to have the ability to handle bugs in a large codebase without crashing.

It is currently considered to be out of scope of the language, and there are no open proposals to change this.

mogud commented 4 years ago

@andrewrk Thanks for your patience.

My English is not very well, maybe I cannot accurately tell the full story. So I may just point out what I think is more important.

  1. I think single threaded multiprocessing is great for reliable servers.
  2. We use single process(not accurate) because we built a very reusable RPC-based framework for different categories of game, like FPS, SLG, MMOARPG. We do not care abount if we have a server named mail or bill, They're all services and can be in any node by different launch configs. So we also can use processes even has communication costs. In order to make the whole system reliable, the base framework must be very fast and robust, that's why I cannot accept it crashes so easy.

Given that a panic can happen in defer expressions, recovering from a panic is generally unsound, unless one very specific thing is done: arena-based resource management.

In most game server's development, logic programmers cannot directly manage resources. For example, they can load data from db service, but cannot have a handle of db connection. Resources management codes often written by advanced programmers, and do not change for a long while so can be full tested.

One thing to consider here is that in this use case in zig, it's extremely likely that the software would be written with event-based I/O. So a proposal to make detected illegal behavior recoverable would have to solve the problem that jumping straight to the panic function from an async function would leave the awaiter hanging. If an async function does not make it to the return statement, its awaiter will hang forever, likely leaking resources, or worse, breaking invariants of data structures.

Framework must guarantees it's safety and should be transparent to the users about this.

At last, I'm sorry that I cannot open a proposal for my poor English.

JesseRMeyer commented 4 years ago

@andrewrk Would you please explain why panicking in a defer context is problematic? Can't any defer scenario be composed without defer in the first place?

andrewrk commented 4 years ago

panicking in a defer is not problematic. It just means that when you're in the panic handler, you've already potentially leaked resources and potentially have data structures with broken invariants.

Yes to your second question.

rohlem commented 4 years ago

So a proposal to make detected illegal behavior recoverable would have to solve the problem that jumping straight to the panic function from an async function would leave the awaiter hanging. If an async function does not make it to the return statement, its awaiter will hang forever, likely leaking resources, or worse, breaking invariants of data structures.

Currently @panic only receives a message, and the implementation of std.debug.panic retrieves the stack frame information via other means. Assuming we can (note: limited to safe build modes) query whether the current stack is async, we could expose builtins @currentAwaiter() ?*anyawaiter and @returnToAwaiter(*anyawaiter) noreturn. Then the panic implementation could do:

fn panic(...) noreturn {
    any_panic_impl(...); //print stack trace etc.
    if(@currentAwaiter()) |awaiter| {
        //Potentially fill/initialize the return value the awaiter is awaiting; trickier, see below.
        @returnToAwaiter(awaiter); //note: of type noreturn
    }
    os.abort(); //or whatever else you do if you panic on the main stack (or on a stack currently without an awaiter)
}

This way the recoverability is a completely optional feature (maybe even opt-in compile-time toggle-able, akin to --single-threaded). Since we already have safety features for resuming non-suspended functions, I'm 90% sure that this would already be implementable.

Filling the awaiter's return value seems a little tricky: We could have a builtin to provide *@OpaqueType() that can be cast if the type is consistent across all async functions in the codebase. Maybe error unions could be generalized in their layout to the point where the builtin can provide a *anyerror for any anyerror!T ; that would make it quite elegant to use, actually.

Otherwise switching on the type would require some runtime representation of the type (maybe via an auto-collected builtin enum similar to how anyerror is populated), but these ideas sound overcomplicating to me; for this particular use case a userland protocol would be sufficient:

//scheduler
var succeeded: bool = false;
const success_result = async failable_afunc(&succeeded, ...);
if(succeeded){
    //use success_result ...
}else{
    //handle failure... | success_result is undefined, do not access!
}

fn failable_afunc(succeeded: *bool) T {
    defer succeeded.* = true; //we need to somehow prohibit the optimizer from executing the assignment any earlier, which might not appear observable locally.
        //application logic implementation
}

(As an alternative to @currentAwaiter() we could introduce a separate panicAsync(?*awaiter) T, and @panic decides which one to use depending on if it's called on an async stack. Then the call of @returnToAwaiter and maybe also setting the awaiter's awaited return value could be hidden after the return of panicAsync (maybe of return type anyerror ?). This would reduce both complexity and flexibility/control of the feature in my eyes.)

suirad commented 4 years ago

Since @panic is somewhat of an exception to the zig rule of no hidden control flow, out of necessity; perhaps it could be a tool in the modes in which it is available(debug/release-safe). It seems to me that something to the effect of a temporary panic handlers for a single scope could be feasible. Perhaps purity of the scope could determine the eligibility of code/functions used within it, since side effects affect the recover-ability of state.

shawnl commented 4 years ago

There are no runtime exceptions in zig. Also it is undefined behavior if runtime safety is turned off (release-fast etc.)

LLVM does not make all of these undefined behavior, but downgrades what can to only produce undefined values. Zig should know the difference, and be able to recover from undefined values. The big exception to this is divide by zero, which raises SIGFPE.

The general fix for this is to add setjmp()/longjmp() support to zig, which is #1656.

ityonemo commented 4 years ago

Just wanted to add in that in my use case (FFI with the erlang VM) I'd like to turn on a panic trapping feature when I drive external unit test suites, so that a zig panic can record unreachable/undefined behavior inside zig from the calling VM in release-safe/release-debug. and not disrupt test counts/test tracking/CI. An opt-in ability to somehow trap a panic would be very useful. Conceptually this could easily take the form of a setjmp/longjmp that I can drop in at the zig/erlang boundary and recover from in the event of a panic. If zig doesn't want to support this that's fine, a panic during unit tests is also a valid way of alerting that there's something wrong with the code.

iacore commented 1 year ago

@panic seem to send SIGABRT. You can catch this inside the process, or in its parent process.

Use shared memory & exit code to send error message back to parent.

matu3ba commented 1 year ago

recover from illegal behavior

Recovery of errors (as to not run into failures) requires to specify what safe and well-defined states are. How should Zig know this? Once you can specify them: Why can you not code them?

Asking more specific: What is the recoverable state classes + execution context classes that Zig should support?

It seems zig will panic at runtime when something like division by zero occurs, and it's not recoverable.

The purpose of optimization compilers and languages defining them is to provide optimal machine code for the supported performance use cases and requires to "explicitly write" possible code semantics. Zig has performance defaults for math stuff, which includes a / b trapping the CPU and crashing your program in safe modes.

As I understand you, you ask to enable the caller to change source code semantics, like what macros or operator overloading are used in C/C++ etc (typically to workaround bugs of called code (ie not intended for the use case). Is that correct?