High CPU usage on Windows, happening in wezterm_ssh / WSAPoll

rgwood commented 2 weeks ago

What Operating System(s) are you seeing this problem on?

Windows

Which Wayland compositor or X11 Window manager(s) are you using?

No response

WezTerm version

20240513-190755-f1a08426

Did you try the latest nightly build to see if the issue is better (or worse!) than your current version?

Yes, and I updated the version box above to show the version of the nightly that I tried

Describe the bug

I started using WezTerm on Windows a few months ago, and once or twice a week WezTerm starts using 100% of a CPU core until I close some tabs. I have not experienced this issue with WezTerm on Linux or macOS.

I hadn't been able to figure out why, but earlier this week I compiled WezTerm from main with debug symbols and I was able to profile WezTerm (using Superluminal) when it happened again this morning. The full ETL file from the profile is available here.

It looks like WezTerm was spending most of its time in wezterm_ssh:sessioninner::SessionInner::request_loop:

I have to run but I wanted to get this written down while it's still fresh in my mind. If there's any other info I can offer that might help, please let me know!

To Reproduce

Open WezTerm on Windows, launch a few tabs (maybe with 1 SSH'd to a server), wait a few days. Sorry, I wish I could reproduce this more reliably!

Configuration

Possibly relevant: I set ssh_backend to Ssh2 to get my Windows SSH agent working on WSL: https://github.com/rgwood/dotfiles/blob/b99df1466e19d3bebdadb9f0e3961247ae7f8abf/wezterm/wezterm.lua#L32-L34

Expected Behavior

No response

Logs

Debug Overlay wezterm version: 20240513-190755-f1a08426 x86_64-pc-windows-msvc Window Environment: Windows Lua Version: Lua 5.4 OpenGL: Intel(R) Iris(R) Xe Graphics 4.5.0 - Build 31.0.101.5448 Enter lua statements or expressions and hit Enter. Press ESC or CTRL-D to exit 19:54:14.026 ERROR mux::connui > Configuration Error: runtime error: [string "C:\Users\reill.wezterm.lua"]:36: attempt to compare string with nil stack traceback: [string "C:\Users\reill.wezterm.lua"]:36: in main chunk 19:54:14.043 ERROR wezterm_gui::termwindow > Failed to apply config overrides to window: runtime error: [string "C:\Users\reill.wezterm.lua"]:36: attempt to compare string with nil stack traceback: [string "C:\Users\reill.wezterm.lua"]:36: in main chunk: nil 21:06:17.381 ERROR wezterm_ssh::sessioninner > error in resize_pty: [Session(-7)] Unable to send window-change packet 21:06:17.638 ERROR wezterm_ssh::sessioninner > error in resize_pty: [Session(-7)] Unable to send window-change packet 07:34:37.791 ERROR wezterm_ssh::sessioninner > error in resize_pty: [Session(-7)] Unable to send window-change packet 07:34:38.066 ERROR wezterm_ssh::sessioninner > error in resize_pty: [Session(-7)] Unable to send window-change packet

Anything else?

Possibly relevant: this might have been happening in a tab that was SSH'd to a remote Linux server. It looked like I had 1 of those open when I noticed the issue (saw it in the tab bar) but sometime when I was investigating the issue it disappeared and CPU usage went back down to ~0.

rgwood commented 2 weeks ago

Expanding the call graph a bit, it seems like most of the time is being spent in the internals of various WS* Windows functions. My guess is that WezTerm is polling too often but I haven't confirmed that yet.

wez commented 2 weeks ago

Thanks for this! I don't have time to investigate this personally, but wanted to share my insight in case you wanted to dig further:

In general, when an application is repeatedly calling WSAPoll or a similar function, and it turns into a busy loop like this, it suggests that the underlying descriptors/handles are in a signalled/ready state and are somehow not being cleared out of that state, so instead of poll putting the thread to sleep, it immediately returns.

There are two broad categories of situation that might cause this sort of thing: 1) A socket is readable, but not all of the available data was consumed 2) A socket is in an error state, but that hasn't been fully noticed by the application, which continues to poll it instead of closing it out and stop passing it to poll.

The first case might happen if wezterm-ssh is somehow returning early from its own internal dispatch loop, so never gets to read every descriptor that was signalled as ready.

The second case might be something like forgetting to clear out some element in the set of tracking descriptors/handles when EOF is detected.

rgwood commented 2 weeks ago

That’s very helpful, thanks. Will see if I can figure this out when I have time.

rgwood commented 2 weeks ago

There are two broad categories of situation that might cause this sort of thing:

A socket is readable, but not all of the available data was consumed

A socket is in an error state, but that hasn't been fully noticed by the application, which continues to poll it instead of closing it out and stop passing it to poll.

I suspect 2 but I'm not sure. It seems notable that revents is not checked for error values (it is only confirmed that it is nonzero):

https://github.com/wez/wezterm/blob/396c8d9a886d06c4f839a24d3942f8612b6b84d0/wezterm-ssh/src/sessioninner.rs#L461-L466

I imagine that it would make sense to stop polling when we encounter some of the possible error values (POLLHUP and POLLNVAL sound unrecoverable):

POLLERR | An error has occurred. POLLHUP | A stream-oriented connection was either disconnected or aborted. POLLNVAL | An invalid socket was used.

wez commented 1 week ago

the pattern that is used here is that we use a non-zero revents to decide that we should call into the underlying ssh library to get it to advance its state; we shouldn't decide to close immediately at the wezterm-ssh layer because the underlying library may have buffered state that needs to be "clocked through" (such as: pending output) before we just give up on it.

If it is the error case that is bad, it may imply that there is an issue with either the wezterm-ssh interface logic for the ssh backend, or the ssh backend's own rust-wrapping, or the underlying ssh backend library. (TBH, the same reasoning applies to both of the cases I outlined)

rgwood commented 1 week ago

Happened again and I was able to get a bit more info.

The culprit was an SSH tab (I'd opened it with the Shell: New Tab (Domain SSH:my-host-name) action a few days ago). I ctrl-tabbed to that tab, pressed a letter key ('L', I was about to type ls assuming the session was still open), the tab instantly closed itself and CPU usage went back down to ~zero.

wez / wezterm