xtermjs / xterm.js

A terminal for the web
https://xtermjs.org/
MIT License
17.4k stars 1.62k forks source link

parser worker isolation considerations #3368

Closed jerch closed 2 years ago

jerch commented 3 years ago

Once again some considerations about the idea to separate the parser / VT emulation from the main thread, this time with its now dedicated issue.

Where are we at? Currently xterm.js is 100% main thread bound, thus everything from reading chunks in term.write to rendering happens on the main thread.

The Good... This design works pretty nicely for casual integrations, where the terminal does not do much. Furthermore it is quite convenient to deal with the terminal object and its API for custom extensions, everything is at your hands in the main thread context.

The Bad... The picture changes for more complex integrations, where the main thread is already under high pressure for whatever reason, where any further computational stress might cause lag spikes or make things feel sluggish. This scenario can be triggered quite easy with xterm.js - just try to run multiple terminal instances within one embedding document (like multiplexing) and give the terminals some IO. Here the terminals will behave greedy with the main thread runtime as limited resource, every instance will try to saturate the main thread on its own (based on a 60 fps approximation, thus 2 instances --> 30 fps, 4 --> 15 fps and so on) not taking outer stress into the equation. Any outer stress will simply prolong the update cycles further (up to lag spikes), multiple terminals in one page will slow down each other.

The Ugly... Workers for the rescue? As we all know the real limiting resource beneath all that are the horse powers of your machine's CPU (plus a few more like memory, cache sizes, bus speed, ...). But JS with its strong single thread affiliation does not do a good job in utilizing CPU cores per se, we are basically bound to one core with cooperative multitasking in JS, while the underlying OS offers measures to saturate more cores with preemptive style. Dont want to go into the "Why is it that bad in JS" rant, I understand that the language's flexibility creates a global context burden, thats hard to parallelize, still I wonder about the general future of JS here, as CPU designs clearly go the multicore path (Time to get an Erlang clone into the browser?) Well in JS we got workers as a rather expensive way to still benefit from multicore design. Can we use it for xterm.js to lower the main thread stress? Well yes and no - welcome to the Ugly.

What could be "workered"? The main workload in xterm.js for high input comes from the parsing and state changes itself, which is synchronous blocking code with none to a few references to the outer browser (main thread) env. With some rework of the internal interfaces this would be a hot candidate for a worker. The second highest in the workload books is the rendering task with quite some differences across selected renderer type (as parsing:rendering ratio):

By its nature rendering has to deal with much more browser env main thread references, and it is almost impossible to offload this solely into a worker (beside going full OffscreenCanvas, which I dont consider a viable goal atm as the support is still pretty lousy across browser engines). The remaining tasks are user input / event handling, which are typically <1% in workload, thus not worth to be delegated to a worker. (Some actions like resize still create very high workload, but are buffer bound, thus would automatically happen in a worker if we moved parsing there.) As an upper approximation - we could relieve the mainthread of 60% - 90% workload created by xterm.js depending on the selected renderer.

Why "The Ugly"? Offloading the parser/VT emulation introduces several hard to swallow downsides:

Is it worth the trouble? I am not sure about that. The promise to lower main thread occupation by 60% - 90% is a really big one, on the other hand the coding efforts to get something reliable with SABs done are huge as well (baby steps into that direction are already outlined by those nodejs build target PRs). Furthermore we prolly would need a main thread fallback as SABs cannot be used everywhere due their security constraints (Do we need different build targets for SAB-worker based?). In any case a worker solution always creates higher integration needs due to separated JS resources.

Personally I like challenges like that and I am up for trying at least. But from a more realistic perspective I think this will involve tons of changes to xterm.js both internally and its API (>30% rewrite), and I am not sure if we can manage to maintain the same convenience for integrators in the end. So any attempt into the worker direction might turn out as a dead end for not being easy usable/customizable.

Up for discussion. :smile_cat:

Tyriar commented 3 years ago

I've gone back and forth a bit over time on this and ended up landing here:

I think we should not do only workers, but offload only when incoming data pressure is high (eg. if the parsing time limit was hit in a frame). If we took this approach is should be a lot easier to manage as we'd keep snappy latency, not needing to jump to another thread for most cases, but do switch to another thread when latency isn't as important. Additionally we don't need to go all in on ArrayBuffers for data transfer because of this, we would probably still need to refactor such that we have a model object or something that can easily be transferred back and forth.

Graceful degradation is also extremely important here:

This approach would also be a lot easier to do incrementally; model refactors, worker adoption, sab support with fallback, etc.

jerch commented 3 years ago

@Tyriar Yes, I agree with the degradation needs, a proper worker+SAB setup is not easy anymore. In any case we would need a failover logic (maybe just drop back to fully mainthread-driven variant).

I am not so sure about your middle ground idea - doesnt that mean we'd have to implement the full deal for both cases, additional we'd need to implement the "high pressure" transition logic. And about "all in on ArrayBuffers" - you mean like partially using array buffers, but transmit other states by messaging? Doesn't that create a much more cumbersome situation, as we'd have to synchronize across several worker interaction models (e.g. SAB atomics + the message states). Furthermore as soon as we use messaging for any state transmission, we need active code parts on both ends providing/digesting the message. Not so with a pure SAB solution.

Well I still have to think about that idea, maybe I just dont see the benefits yet. :thinking:

Tyriar commented 3 years ago

doesnt that mean we'd have to implement the full deal for both cases

I'd think the code would be almost entirely shared between main/worker.

Doesn't that create a much more cumbersome situation, as we'd have to synchronize across several worker interaction models (e.g. SAB atomics + the message states).

It's trying to solve it only when it becomes a problem, if the main thread is about to be busy, offload to a worker until it calms down. If we do fallback right it probably wouldn't be too cumbersome to switch between the 2 modes.

mofux commented 3 years ago

From may experimentation with rendering the terminal through a Monaco Editor, I just recently faced the same questions.

My solution now is to run xterm.js in a node.js thread which also spawns the pty.

I'm hooking to xterm._core._inputHandler.onRequestRefreshRows() which fires whenever the buffer content gets changed. I'm then computing the diff between those changes and send them to the frontend, which will apply these changes to the monaco editor text model.

In my case, these changes operate on a line level granularity, and they basically contain the line text that has changed, and an array of line decorations (range, foreground, background colors, font styles etc.). I'm also sending some buffer state to the frontend that is necessary for styling (e.g. if mouse events are enabled, so we can draw a pointer):

{
  buffer: {
    normal: boolean,
    rows: number,
    cols: number,
    startId: number,
    endId: number,
    mouseEvents: boolean,
    hasScrollback: boolean,
    scrollbackVisible: boolean,
    applicationCursorKeys: boolean,
    bracketedPasteMode: boolean,
    sendFocus: boolean,
    cursor: {
      visible: boolean,
      lineId: number,
      colIndex: number
    }
  },
  edits: [
    startLineNumber: number,
    startColumn: number,
    endLineNumber: number,
    endColumn: number,
    text?: string
  ][],
  decorations: [
    lineId: number,
    decorations: [
      lineNumber: number,
      startColumn: number,
      endColumn: number,
      attrs: TerminalRendererDecorationAttributes
    ][]
  ][]
}

Note that with this kind of design I don't have to synchronise the terminal buffer of the backend with the frontend. I only send "render" instructions to the frontend, which really lowers the memory consumption and overall computation time of the frontend.

I discovered this design has several advantages:

Caveats

Running xterm.js in a node.js process requires some hacks at the moment. There is internal code that relies on the global window and self Objects to be available at load time - these globals do not exist in node context. This can be easily worked around via:

global.self = global.window = {};
const { Terminal } = require('xterm');
const { Unicode11Addon } = require('xterm-addon-unicode11');
delete global.self;
delete global.window;

Additionally, you cannot initialise the rendering part of xterm.js via terminal.open(element: HTMLElement). Unfortunately some of the Mouse and Keyboard event handling code is not abstract enough to use them in a non-browser environment, so I had to reimplement those parts myself.

Last but not least you need a way to track an individual line in the buffer via an ID. This is very important for diff computing, because our Buffer is circular and will remove lines at the top of the viewport once the maximum number of scrollback lines are reached. It's therefore not feasible to use the line index of the buffer for this. I'm hooking to the Buffer.onTrim() event to increase a counter whenever a line at the top of the viewport gets removed. My id then is this counter plus the line index.

Conclusion

IMHO utilising a WebWorker for the parser may create more overhead than it tries to solve. I think we should rather optimize towards a scenario where the xterm.js core can be run in the same process as the pty (in the node.js backend), and then create an interface that is specifically crafted to update a renderer in the frontend. There also have to be interfaces in the other direction to propagate abstracted mouse and keyboard events from the renderer to the core.

Extra

Here's a capture of running seq 1000000 in my Monaco Editor implementation (left) and VSCode (right) in parallel.

https://user-images.githubusercontent.com/2785983/121707805-420f2400-cad7-11eb-9338-c6d39a32ad64.mp4

jerch commented 3 years ago

@mofux

Isn't

I think we should rather optimize towards a scenario where the xterm.js core can be run in the same process as the pty (in the node.js backend), and then create an interface that is specifically crafted to update a renderer in the frontend.

the same as utlizing a webworker for the core parts (+ IO chunk handling in your case, which only applies to electron/nodejs apps), while you say

IMHO utilising a WebWorker for the parser may create more overhead than it tries to solve.

Can you explain, what would be different to a webworker isolation of the parser/VT emulation?

Edit: Maybe my text above was misleading - with parser/VT emulation I mean the whole inner state handling including the terminal buffer itself (not only the parser).

mofux commented 3 years ago

Yes it's technically the same, except for the data flow.

Using a Web Worker means that all of the PTY stream data (backend) has to get to the render thread of the browser (usually via WebSockets or IPC), then from there it has to get to the WebWorker where the xterm.js core is running (usually via postMessage).

I think it would be smarter if the PTY stream and the xterm.js core were living in the same backend process, where the stream data Buffers could be easily shared (by reference) without getting copied across multiple communication layers (as outlined above).

The idea is very simple: The parser (xterm.js core) must consume the whole PTY stream without gaps to keep the internal Terminal Buffer state consistent. A renderer does not. It can at any point in time simply get a view on that Terminal Buffer state and it can skip frames if the updates come in too often.

Imagine you were implementing a new Canvas renderer for xterm.js. With every render cycle, you iterate the Buffer lines that have changed since your last render cycle and create draw instructions from them. Now, instead of actually drawing them to the Canvas, you serialize the instructions as JSON. Now imagine this renderer was living in the xterm.js core and runs in the backend.

Now in the Browser frontend, you create an implementation that can request a render from your backend, gets the serialized instructions and draws them to the real Canvas.

That is basically what I am proposing, and what I've done with the Monaco Editor renderer. Now the PTY data stream never has to pass the Browser render thread. It can stay in the backend with minimal overhead. All that passes to the browser are drawing instructions whenever desirable.

jerch commented 3 years ago

Some greenfield out-of-the-box thinking - ideally things could work like this:

VT emulation worker:

mainthread:

The rendering parts needs a bit more thinking, whether we are fully SAB'ed or not. Again thinking it from an ideal world we have everything in SABs. Then it is quite easy:

This SAB model has one big advantage - it does not need any active messaging code on both ends, it can be orchestrated solely by some locks on the SAB itself for the rendering, thus no stocking up of messages in queues, no need to drop back to event loop at all on parser side. Technically even the locks are not needed (well we still would want them to not get totally screwed up output in between).

Ofc it is not that easy in conjunction with key/mouse event delegation anymore, they prolly need some messaging to the worker again (more tricky to abstract with SAB, though not impossible), then the worker would have to ground back to top level event loop from time to time again.

Well that's basically what I think could work and would be most performant solution (not lock-free, but almost without any data copying between mainthread/worker). As far I understand Mofux's solution, it uses hardcopy messaging of changed bufferlines, which creates the need to store things on both ends (waste of memory) and is much slower in the transfer. (Note: I dont really care if the parser worker is a real webworker or a nodejs worker thread - as long the memory sharing still works, it should not make much of a difference.)

Using a Web Worker means that all of the PTY stream data (backend) has to get to the render thread of the browser (usually via WebSockets or IPC)

Why do you think so? Imho websockets are fine within webworkers.

mofux commented 3 years ago

Why do you think so? Imho websockets are fine within webworkers.

It's technically possible, but I think impractical for most application architectures - they likely establish only one single websocket connection for the whole frontend session (dealing with authentication and stuff) and dispatch messages over that single connection. Creating additional websocket connections means that you have to expose additional websocket endpoints on your server, map them to sessions, make sure they are authenticated and what not. It's certainly possible, but impractical IMHO.

Also, when looking into the electron world, I'm not so sure if a WebWorker can create an IPC communication directly with Main thread.

That being put aside - my point was to keep the PTY data stream as close to the core as possible, whenever possible 😁 Of course, if the backend is not node.js, then you'll likely run the xterm core in frontend anyway.

I'd love to carry these discussions forward via a video call. Unfortunately my vacation starts tomorrow and I'll not be back on my PC until mid July 🥺

Tyriar commented 3 years ago

I discovered this design has several advantages:

The biggest issue imo is that other actions are bound to input latency like scrolling, also a model like this break some assumptions made and end up breaking complex integrations like task problem matchers and live share in VS Code.

I think it would be smarter if the PTY stream and the xterm.js core were living in the same backend process, where the stream data Buffers could be easily shared (by reference) without getting copied across multiple communication layers (as outlined above).

I'd like to support the ability to do this via the node target of xterm.js https://github.com/xtermjs/xterm.js/issues/2749 and the serialize addon (this could build smarter diffs too). I don't think it should be the default or recommended setup though because it's much more complicated to setup and comes with its own set of problems for little pay off imo. Having this as a it's possible to do thing seems like enough imo. Those caveats seem like improvements could be done in the lib so you don't need the hacks.

The idea is very simple: The parser (xterm.js core) must consume the whole PTY stream without gaps to keep the internal Terminal Buffer state consistent. A renderer does not. It can at any point in time simply get a view on that Terminal Buffer state and it can skip frames if the updates come in too often.

Another idea is to just allow almost all of xterm.js to live inside the worker, use transferControlToOffscreen and the main thread side just acts as a passthrough. I don't think this would work with the DOM renderer though (a fallback case). That way we don't need to care about transfer of data, it all lives on the worker. The main downside being an extra hop = increased latency, even if it's just local.

Also, when looking into the electron world, I'm not so sure if a WebWorker can create an IPC communication directly with Main thread.

I don't think this is possible, I'd love it if it was though https://github.com/microsoft/vscode/issues/113705#issuecomment-756785659

I'd love to carry these discussions forward via a video call. Unfortunately my vacation starts tomorrow and I'll not be back on my PC until mid July 🥺

We can always set one up 👍, it's just tricky aligning everyone's schedule.

jerch commented 3 years ago

It's technically possible, but I think impractical for most application architectures - they likely establish only one single websocket connection for the whole frontend session (dealing with authentication and stuff) and dispatch messages over that single connection. Creating additional websocket connections means that you have to expose additional websocket endpoints on your server, map them to sessions, make sure they are authenticated and what not. It's certainly possible, but impractical IMHO.

Kinda dont get it - wouldn't that be an argument against your own idea, to move the chunk IO closer to ??? - sorry, I currently dont know what ??? is for you, because you say worker is it not. But workers are the only option for a browser env, that dont involve really nasty mutli-page/iframe hacks (which is a clear nogo). Furthermore if peeps dont want fast terminal IO delivery, instead push everything through one single socket in the frontend - well, then they have to dispatch things on their own including the performance penalty. All we could do about it - provide some alternative convenient way to inject data to the worker reachable from the mainthread.

I'd love to carry these discussions forward via a video call. Unfortunately my vacation starts tomorrow and I'll not be back on my PC until mid July pleading_face

This topic is not pressing at all, we should take all the time we need to get through the cumbersome details. Enjoy your holidays :smiley:

Tyriar commented 3 years ago

@jerch @mofux sent out an invite

jerch commented 3 years ago

I think it would be smarter if the PTY stream and the xterm.js core were living in the same backend process, where the stream data Buffers could be easily shared (by reference) without getting copied across multiple communication layers (as outlined above).

I'd like to support the ability to do this via the node target of xterm.js #2749 and the serialize addon (this could build smarter diffs too). I don't think it should be the default or recommended setup though because it's much more complicated to setup and comes with its own set of problems for little pay off imo. Having this as a it's possible to do thing seems like enough imo. Those caveats seem like improvements could be done in the lib so you don't need the hacks.

Doing the VT emulation in the backend, like on the server for browser based apps? :scream: Imho thats not a good idea as general purpose solution, kinda all web driven cluster orchestrators with potentially tons of terminals attached will fall short here. Beside the really high costs for the server side emulation, constantly pushing screen updates over the wire will create much more traffic and show bad latency. With the PTY data stream we kinda already have the best deal in terms of packrate/latency.

What would be possible: Once we got the browser refs completely out of common, we can create several build targets:

This might be important to note: For me electron is the special case here and should not be the driving force behind all that, there are many other frameworks with different semantics, and I see a vanilla browser as the default integration env. You might tend to disagree here, I know that you all are very electron centric, but imho moving xterm.js from a pure browser frontend target to more electron specific needs by default would be a big mistake.

jerch commented 3 years ago

(Slightly offtopic and more a reminder for later on ...)

There is another possibility linked to the parser worker isolation - a WASM implementation of the utf32 converter + the parser class (inner sequence parsing logic). Not sure yet if that is a viable goal as it raises the maintenance complexity, yet it would be easy possible with a C or Rust replacement, since both classes are very C-ish operating on fixed borrowed memory.

Some rough estimations of possible gains here: Did a quick test with data.sixel from https://github.com/jerch/sixel-bench, which is a 119 MB file of sixel video frames. The type of data triggers roughly this call path during parsing:

... --> InputParser.parse --> EscapeSequenceParser.parse --> DCS.hook|put|unhook
               |                                                  |
        Utf8ToUtf32.decode                                   (image decoding in worker)
For this 119 MB pile of data I see the following runtime numbers in Chrome and Firefox (all self time in ms): call Chrome Firefox
InputParser.parse 30 47
Utf8ToUtf32.decode 335 1030
EscapeSequenceParser.parse 514 1400
main thread runtime 2500 4500
image decoding (worker) 5200 unknown
total runtime 6800 10400

Ofc most work load is created by image decoding on the worker, which is not further considered here. The interesting bits on main thread are the self time of Utf8ToUtf32.decode and EscapeSequenceParser.parse, which add up to 849ms on Chrome resp. 2430ms on Firefox (or in % of the main thread script runtime: 34% resp. 54%).

These numbers are surprising for 2 reasons:

And thats where a WASM implementation can help to level out the field. With WASM there is no JS JIT involved anymore, the code runs from call one on pre-optimized. Also those big engine difference should go away (it still depends on the engine's WASM speed, which I dont know for Firefox).

Ofc, whether we can gain anything for these tasks remains uncertain until we actually tried. Going back to my Rust/WASM experiments during the buffer reshaping 2ys ago, I remember that the parser logic was ~30% faster compared to JS numbers in nodejs/Chrome. I have no real numbers for Firefox here, but if the promise of WASM, to perform fairly equal, is true, that would speed up the inner parsing ~3 times on Firefox.

TL;DR Utf8ToUtf32.decode and EscapeSequenceParser.parse are runtime hungry tasks during parsing, that prolly can perform better with a WASM implementation. Estimated speedup is ~30% on Chrome and ~3 times on Firefox.

Edit: Indeed, a first dummy test just for Utf8ToUtf32.decode shows roughly the same runtime of only 250ms for both browsers with Rust/WASM. No clue why Firefox performs this bad in the first place, maybe we use V8 idiomatic JS code, idk. The sad part about WASM - it currently does not support copy-free mapping of arraybuffers from JS to WASM and vice versa, thus the dummy test contains a nonsense initial copy step (see https://github.com/WebAssembly/design/issues/1162). This lowers the general usefulness of WASM for xterm.js alot, as most tasks are quite data intensive. The nice gains from optimized logic execution would get eaten again by useless mem copies. What a silly trade.

Edit2: Oh well, seems C/WASM is actually faster than looping in JS for uint8 --> uint32 buffer copies ("byte spreading"). Given:

const src = new Uint8Array(FIXED_SIZE);
const dst = new Uint32Array(FIXED_SIZE);

and trying to pump the 119MB from above chunkified as 64kB blocks (all Chrome numbers):

Thats really surprising, both WASM impls are way faster than "native" JS loops, emscripten being quite ahead. Would not have thought, that a loop scope/context is that cheap in WASM (or this expensive in JS). Update: Oh dear - with -msimd128 switch emscripten runs in ~75 ms. (No clue what changed, file got bigger, but my inspector refuses to dissemble it with the simd instructions).

Summary of the short WASM interlude (wont further hijack the thread): With the UTF8 decoder and the sequence parser we have two candidates in core, that would benefit alot from a wasm rewrite (prolly at least doubling the parsing chain throughput even on Chrome). While the UTF8 decoder is a no-brainer, the parser itself is much work (I have a C parser flying around, but that does not compile with emscripten due to computed gotos and other incompatible interfaces). Alot of rework up to bootstrapping code (wasm modules have to be awaited) + much harder maintenance afterwards with different languages/buildtools in codebase - guess it is a NO for now. (Though I might come up with a C/WASM SIXEL decoder drop-in for the image addon. Will see.)

jerch commented 2 years ago

Out of scope for me, thus closing.