zellij-org / zellij

A terminal workspace with batteries included
https://zellij.dev
MIT License
22.06k stars 674 forks source link

High memory usage and latency when a program produces output too quickly #525

Closed kxt closed 3 years ago

kxt commented 3 years ago

version 0.12.0

Reproduction

Run time cat 1_000_000_lines.txt and compare it with the time measured by wall clock. The amount of time time reports is significantly less than what is observed. Having a slow computer and using non-optimized (non-release) build helps.

Root cause

A program's output is being processed by the Pty thread. This thread reads the output and propagates it to the Screen thread for rendering through an unbounded channel called to_screen. If the program produces output faster than what Screen can render, to_screen's queue starts to grow indefinitely. This results in growing memory usage and input latency.

An alternative explanation (kudos to @aeinstein): running nearly at the speed of light, cat is experiencing time dilation compared to a stationary observer.

As PR #523 increases the throughput of Pty thread, it exacerbates this issue.

kxt commented 3 years ago

Currently there is a workaround in Pty calling task::sleep to slow it down.

However, as there is no right amount of sleep that works for every machine, the proper solution would be to have back pressure between Pty and Screen, so if Screen can't keep up, Pty will stop reading from the program until the queue clears up. This'd need a bounded queue between Pty and Screen, however, as Pty is async, we can't simply replace the queue with an mpsc SyncSend, as it is blocking. This means an async queue is needed, like async-std's bounded channel.

kxt commented 3 years ago

I made a proof-of-concept using a bounded to_screen queue, and I ran into a deadlock:

TL;DR: pty thread has filled to_screen, wasm thread is blocked by the filled to_screen, and screen thread (which should clear up to_screen) is blocked by the blocked wasm thread.

So simply changing to_screen to a bounded queue is not a solution.

kxt commented 3 years ago

I found a workable way to add back pressure between pty and screen threads without triggering the mentioned deadlock.

The idea is to have a dedicated bounded channel between pty and screen, so we gain back pressure, while keeping the unbounded channel between every other component and screen, to avoid deadlocks.

This required a change from mpsc to crossbeam, as mpsc lacks a way to select between multiple Receivers.

@imsnif If this is an acceptable approach (poc patch here) I'll whip up a PR-able implementation.

imsnif commented 3 years ago

@kxt - at a brief glance this looks like a good direction (though I haven't run it or looked too deeply at it yet). I'm good with moving forward with a PR so we can look deeper. I might want to bring in @kunalmohan to also take a look (if he has the time) since he's done a lot of work in this area.

kunalmohan commented 3 years ago

@kxt great work in the #523! The throughput has increased by a LOT! Regarding this issue, I personally didn't notice this issue in release builds (on main as well as in #523), but it is quite evident in the debug builds. Your proposal of using crossbeam channels with select feature looks like a very good solution. Since it might be some time before the rendering throughput is increased, I think we should go ahead with this. So go ahead and draft a PR and we'll discuss this further :)

kxt commented 3 years ago

I've created two PRs. The first one contains some refactors preparing for the fix, including switching from mpsc to crossbeam, they are ready for review and merge.

The second draft PR contains the actual fix, seems to work OK. However, some of the testcases fail unless sleep was introduced at random points, which is not great. I wasn't able to figure out why it is necessary.

I've noticed that many testcases already contain sleeps, and there is are also a generous amount of sleep in read_from_stdin, which makes running the testcases slow.

Can someone help me figuring out what's going on with the testcases? @kunalmohan ?

imsnif commented 3 years ago

Hey @kxt - I added a review in the first PR. About the tests in the draft PR (which I think you didn't yet create? unless I missed it?): our testing infrastructure is one of the most broken parts of the code-base. It was one of the first pieces of Zellij and is very long overdue for an overhaul.

Since it works by comparing snapshots of the app state (essentially what's on screen in certain situations) and the app includes some asynchronous elements, instead of creating a proper synchronization we just kept adding sleeps to make it work properly. I am not proud of this, but I am honest about it. :) I plan on devoting attention to it very soon after all the other urgent stuff.

So I'm afraid the best answer I can give you now is to add sleeps where necessary to wait for data to come from the fake pty before sending the next command. (also, there might still be some atomicity issues left where stdin is concerned, though I think we got most of those already).

If you want further specific help with this, let me know.

kxt commented 3 years ago

Sorry, I did not link the draft PR properly properly, it's in my fork of zellij right now.

I figured that it's because of the lots of cross-thread communication makes it hard to figure out if something has already happened or not.

If you say sleeps are ok as of now, then I'll open a proper PR with those two patches.

Do you have any concrete plans on the test overhaul?

imsnif commented 3 years ago

If you say sleeps are ok as of now, then I'll open a proper PR with those two patches.

Sounds good.

Do you have any concrete plans on the test overhaul?

An idea we've been throwing around the team is to move whatever makes sense in the integration tests to be unit tests, and make the rest be proper end-to-end tests. Our idea is to spin up a docker container and communicate with it through ssh, thus simulating the full app cycle. This still doesn't solve the synchronization problem though, but I have a feeling that's the sort of issue you solve as you work on it.

If you have any other ideas, they would of course be most welcome.