zellij-org / zellij

A terminal workspace with batteries included
https://zellij.dev
MIT License
21.63k stars 654 forks source link

Thread 'stdin_handler' panicked along suspending system into sleep mode #1429

Open matu3ba opened 2 years ago

matu3ba commented 2 years ago

Unfortunately the issue is not easily reproducable and I dont have a backtrace/the panic handler did not tell what code broke. Further, I broke the logs in /tmp/zellij-1000/zellij-log/ by starting a fresh instance, but I expect those would not tell much either.

The following output information were printed:

  × Thread 'stdin_handler' panicked.
  ├─▶ At /home/misterspoon/.cargo/registry/src/github.com-1ecc6299db9ec823/zellij-client-0.29.1/src/os_input_output.rs:256:14
  ╰─▶ could not poll stdin for readiness: Os { code: 4, kind: Interrupted, message: "Interrupted system call" }
  help: If you are seeing this message, it means that something went wrong.
        Please report this error to the github issue.
        (https://github.com/zellij-org/zellij/issues)

        Also, if you want to see the backtrace, you can set the `RUST_BACKTRACE` environment variable to `1`.

Basic information

zellij --version: zellij 0.29.1 tput lines: 30 (not relevant) tput cols: 126 (not relevant) uname -av or ver(Windows): Linux pc 5.17.9-arch1-1

List of programs you interact with as, PROGRAM --version: output cropped meaningful, for example: nvim --version: NVIM v0.5.0-dev+1299-g1c2e504d5 (used the appimage release) alacritty --version: alacritty 0.7.2 (5ac8060b)

Further information To reproduce, I suspect that suspending system into sleep mode and waking up should reproduce the issue eventually. Signaling code is very hard, so I would not be surprised that some dependency does not handle the signal interrupt mask or handlers correctly.

If you give me some pointers what you use to poll stdin I'll fix the signaling code myself. Ideally a portable solution should be used. My suspection is that signal interrupts are not handled at all such that the bubbled up failure to poll crashes the application.

tlinford commented 2 years ago

i think it could be related to https://github.com/zellij-org/zellij/issues/1334, which might explain why you are having trouble reproducing it consistently.

raphCode commented 2 years ago

Does this crash your session / the zellij server or are you able to reattach to your session afterwards?

raphCode commented 2 years ago

We actually have a dedicated signal handling thread here: https://github.com/zellij-org/zellij/blob/a7ddfe1acc798adb4c8dfba2b674fd9e133b4c15/zellij-client/src/lib.rs#L239 https://github.com/zellij-org/zellij/blob/a7ddfe1acc798adb4c8dfba2b674fd9e133b4c15/zellij-client/src/os_input_output.rs#L155

So I am not sure why any signal still interrupts the polling syscall (probably poll or select returning EINTR) here: https://github.com/zellij-org/zellij/blob/a7ddfe1acc798adb4c8dfba2b674fd9e133b4c15/zellij-client/src/os_input_output.rs#L254-L256

matu3ba commented 2 years ago

Thanks for the pointers!

  1. Is EINTR already handled in Rust by retrying or are C semantics used?
  2. SIGSTOP and SIGCONT signals are not handled, but I need to check first what the signaling semantics of mio poll are.
  3. .expect("could not poll stdin for readiness"); reads like it doesnt handle this edge case (but I must check first)
raphCode commented 2 years ago

Disclaimer: I do not know this area of the code well, take the following with a grain of salt:

Judging from my Rust knowlegde, poll() very likely returns a Result which indicates success or failure of the call and contains the return value (on success) or error information (on failure). unwrap() called on a Result returns the wrapped success value, or panics if the Result indicates failure. Since the latter happened, I assume that poll() returned a failure. To answer question 1, I think nothing is retried automatically, but instead the error is bubbled up.

The best way to fix this would be in my opinion to make sure the poll() doesn't get interrupted / all signals are delivered only to the signal thread (do signals work this way?). Second best option is probably to just retry the poll if we receive that error.

matu3ba commented 2 years ago

do signals work this way?

Signal semantics on the OS work in the following way: There is a process mask and thread mask of "always guaranteed to be delivered signals" and realtime signals (user-definable), which are defined as list. Signal handlers are registered process-wide, while signals can be send to each individual thread. Each individual thread has a signalmask it (or other threads) can can modify (sigprocmask). Except for pid 1, SIGKILL can not be ignored and leads to thread process teardown after Kernel tasks succeded (no custom handler possible). SIGABRT can neither be ignored (unless pid 1), but the application can install a custom handler with LONGJMP or do nothing to not run execute the process teardown. man signal-safety and man 7 signal are a good read.

Overall, signal handling is a pile of non-portable mess as signals are kinda like small interrupt handlers. So its best to mask everything and only allow necessary things in specified regions of code.

See also my PR on the abort() call in Zig https://github.com/ziglang/zig/pull/11565 or musl abort(). One of the worst nightmares is that its not UB to call the signal handler start code from the signal handler (leading to recursive, potentially infinite, loops) and the horrible recursive locking and unlocking code in glibc.