Buffer performance improvements

Tyriar commented 7 years ago

Problem

Memory

Right now our buffer is taking up too much memory, particularly for an application that launches multiple terminals with large scrollbacks set. For example, the demo using a 160x24 terminal with 5000 scrollback filled takes around 34mb memory (see https://github.com/Microsoft/vscode/issues/29840#issuecomment-314539964), remember that's just a single terminal and 1080p monitors would likely use wider terminals. Also, in order to support truecolor (https://github.com/sourcelair/xterm.js/issues/484), each character will need to store 2 additional number types which will almost double the current memory consumption of the buffer.

Slow fetching of a row's text

There is the other problem of needing to fetch the actual text of a line swiftly. The reason this is slow is due to the way that the data is laid out; a line contains an array of characters, each having a single character string. So we will construct the string and then it will be up for garbage collection immediately afterwards. Previously we didn't need to do this at all because the text is pulled from the line buffer (in order) and rendered to the DOM. However, this is becoming an increasingly useful thing to do though as we improve xterm.js further, features like the selection and links both pull this data. Again using the 160x24/5000 scrollback example, it takes 30-60ms to copy the entire buffer on a Mid-2014 Macbook Pro.

Supporting the future

Another potential problem in the future is when we look at introducing some view model which may need to duplicate some or all of the data in the buffer, this sort of thing will be needed to implement reflow (https://github.com/sourcelair/xterm.js/issues/622) properly (https://github.com/sourcelair/xterm.js/pull/644#issuecomment-298058556) and maybe also needed to properly support screen readers (https://github.com/sourcelair/xterm.js/issues/731). It would certainly be good to have some wiggle room when it comes to memory.

This discussion started in https://github.com/sourcelair/xterm.js/issues/484, this goes into more detail and proposes some additional solution.

I'm leaning towards solution 3 and moving towards solution 5 if there is time and it shows a marked improvement. Would love any feedback! /cc @jerch, @mofux, @rauchg, @parisk

1. Simple solution

This is basically what we're doing now, just with truecolor fg and bg added.

// [0]: charIndex
// [1]: width
// [2]: attributes
// [3]: truecolor bg
// [4]: truecolor fg
type CharData = [string, number, number, number, number];

type LineData = CharData[];

Pros

Very simple

Cons

Too much memory consumed, would nearly double our current memory usage which is already too high.

2. Pull text out of CharData

This would store the string against the line rather than the line, this would probably see very large gains in selection and linkifying and would be more useful as time goes on having quick access to a line's entire string.

interface ILineData {
  // This would provide fast access to the entire line which is becoming more
  // and more important as time goes on (selection and links need to construct
  // this currently). This would need to reconstruct text whenever charData
  // changes though. We cannot lazily evaluate text due to the chars not being
  // stored in CharData
  text: string;
  charData: CharData[];
}

// [0]: charIndex
// [1]: attributes
// [2]: truecolor bg
// [3]: truecolor fg
type CharData = Int32Array;

Pros

No need to reconstruct the line whenever we need it.
Lower memory than today due to the use of an Int32Array

Cons

Slow to update individual characters, the entire string would need to be regenerated for single character changes.

3. Store attributes in ranges

Pulling the attributes out and associating them with a range. Since there can never be overlapping attributes, this can be laid out sequentially.

type LineData = CharData[]

// [0]: The character
// [1]: The width
type CharData = [string, number];

class CharAttributes {
  public readonly _start: [number, number];
  public readonly _end: [number, number];
  private _data: Int32Array;

  // Getters pull data from _data (woo encapsulation!)
  public get flags(): number;
  public get truecolorBg(): number;
  public get truecolorFg(): number;
}

class Buffer extends CircularList<LineData> {
  // Sorted list since items are almost always pushed to end
  private _attributes: CharAttributes[];

  public getAttributesForRows(start: number, end: number): CharAttributes[] {
    // Binary search _attributes and return all visible CharAttributes to be
    // applied by the renderer
  }
}

Pros

Lower memory than today even though we're also storing truecolor data
Can optimize application of attributes, rather than checking every single character's attribute and diffing it to the one before
Encapsulates the complexity of storing the data inside an array (.flags instead of [0])

Cons

Changing attributes of a range of characters inside another range is more complex

4. Put attributes in a cache

The idea here is to leverage the fact that there generally aren't that many styles in any one terminal session, so we should not create as few as necessary and reuse them.

// [0]: charIndex
// [1]: width
type CharData = [string, number, CharAttributes];

type LineData = CharData[];

class CharAttributes {
  private _data: Int32Array;

  // Getters pull data from _data (woo encapsulation!)
  public get flags(): number;
  public get truecolorBg(): number;
  public get truecolorFg(): number;
}

interface ICharAttributeCache {
  // Never construct duplicate CharAttributes, figuring how the best way to
  // access both in the best and worst case is the tricky part here
  getAttributes(flags: number, fg: number, bg: number): CharAttributes;
}

Pros

Similar memory usage to today even though we're also storing truecolor data
Encapsulates the complexity of storing the data inside an array (.flags instead of [0])

Cons

Less memory savings than the ranges approach

5. Hybrid of 3 & 4

type LineData = CharData[]

// [0]: The character
// [1]: The width
type CharData = [string, number];

class CharAttributes {
  private _data: Int32Array;

  // Getters pull data from _data (woo encapsulation!)
  public get flags(): number;
  public get truecolorBg(): number;
  public get truecolorFg(): number;
}

interface CharAttributeEntry {
  attributes: CharAttributes,
  start: [number, number],
  end: [number, number]
}

class Buffer extends CircularList<LineData> {
  // Sorted list since items are almost always pushed to end
  private _attributes: CharAttributeEntry[];
  private _attributeCache: ICharAttributeCache;

  public getAttributesForRows(start: number, end: number): CharAttributeEntry[] {
    // Binary search _attributes and return all visible CharAttributeEntry's to
    // be applied by the renderer
  }
}

interface ICharAttributeCache {
  // Never construct duplicate CharAttributes, figuring how the best way to
  // access both in the best and worst case is the tricky part here
  getAttributes(flags: number, fg: number, bg: number): CharAttributes;
}

Pros

Protentially the fastest and most memory efficient
Very memory efficient when the buffer contains many blocks with styles but only from a few styles (the common case)
Encapsulates the complexity of storing the data inside an array (.flags instead of [0])

Cons

More complex than the other solutions, it may not be worth including the cache if we already keep a single CharAttributes per block?
Extra overhead in CharAttributeEntry object
Changing attributes of a range of characters inside another range is more complex

6. Hybrid of 2 & 3

This takes the solution of 3 but also adds in a lazily evaluates text string for fast access to the line text. Since we're also storing the characters in CharData we can lazily evaluate it.

type LineData = {
  text: string,
  CharData[]
}

// [0]: The character
// [1]: The width
type CharData = [string, number];

class CharAttributes {
  public readonly _start: [number, number];
  public readonly _end: [number, number];
  private _data: Int32Array;

  // Getters pull data from _data (woo encapsulation!)
  public get flags(): number;
  public get truecolorBg(): number;
  public get truecolorFg(): number;
}

class Buffer extends CircularList<LineData> {
  // Sorted list since items are almost always pushed to end
  private _attributes: CharAttributes[];

  public getAttributesForRows(start: number, end: number): CharAttributes[] {
    // Binary search _attributes and return all visible CharAttributes to be
    // applied by the renderer
  }

  // If we construct the line, hang onto it
  public getLineText(line: number): string;
}

Pros

Lower memory than today even though we're also storing truecolor data
Can optimize application of attributes, rather than checking every single character's attribute and diffing it to the one before
Encapsulates the complexity of storing the data inside an array (.flags instead of [0])
Faster access to the actual line string

Cons

Extra memory due to hanging onto line strings
Changing attributes of a range of characters inside another range is more complex

Solutions that won't work

Storing the string as an int inside an Int32Array will not work as it takes far to long to convert the int back to a character.

mofux commented 7 years ago

Another approach that could be mixed in: use indexeddb, websql or filesystem api to page out inactive scrollback entries to disk 🤔

parisk commented 7 years ago

Great proposal. I agree with that 3. is the best way to go for now as it lets us save memory while supporting true color as well.

If we reach there and things continue to go well, we can then optimize as proposed in 5. or in any other way that comes in our minds at that time and makes sense.

3. is great 👍.

parisk commented 7 years ago

@mofux, while there is definitely a use case for using disk-storage-backed techniques to reduce the memory footprint, this can degrade the user experience of the library in browser environments that ask the user for permission for using disk storage.

mofux commented 7 years ago

Regarding Supporting the future: The more I think about it, the more the idea of having a WebWorker that does all the heavy work of parsing the tty data, maintaining the line buffers, matching links, matching search tokens and such appeals to me. Basically doing the heavy work in a separate background thread without blocking the ui. But I think this should be part of a separate discussion maybe towards a 4.0 release 😉

AndrienkoAleksandr commented 7 years ago

+100 about WebWorker in the future, but I think we need change list versions of browsers which we are supporting because not all off them can use it...

Tyriar commented 7 years ago

When I say Int32Array, this will be a regular array if it is not supported by the environment.

@mofux good thinking with WebWorker in the future 👍

@AndrienkoAleksandr yeah if we wanted to use WebWorker we would need to still support the alternative as well via feature detection.

jerch commented 7 years ago

Wow nice list :)

I also tend to lean towards 3. since it promises a big cut in memory consumption for over 90% of the typical terminal usage. Imho memory optimisation should be the main goal at this stage. Further optimization for specific use cases might be applicable on top of this (what comes to my mind: "canvas like apps" like ncurses and such will use tons of single cell updates and kinda degrade the [start, end] list over time).

@AndrienkoAleksandr yeah I like the webworker idea too since it could lift some burden from the mainthread. Problem here is (beside the fact that it might not be supported by all wanted target systems) is the some - the JS part is not such a big deal anymore with all the optimizations, xterm.js has seen over the time. Real issue performance-wise is the layouting/rendering of the browser...

@mofux The paging thing to some "foreign memory" is a good idea, though it should be part of some higher abstraction and not of the "gimme an interactive terminal widget thing" that xterm.js is. This could be achieved by an addon imho.

Offtopic: Did some tests with arrays vs. typedarrays vs. asm.js. All I can say - OMG, it is like 1 : 1,5 : 10 for simple variable loads and sets (on FF even more). If pure JS speed really starts to hurt, "use asm" might be there for the rescue. But I would see this as a last resort since it would imply fundamental changes. And webassembly is not yet ready to ship.

Tyriar commented 7 years ago

Offtopic: Did some tests with arrays vs. typedarrays vs. asm.js. All I can say - OMG, it is like 1 : 1,5 : 10 for simple variable loads and sets (on FF even more)

@jerch to clarify, is that arrays vs typedarrays is 1:1 to 1:5?

jerch commented 7 years ago

Woops nice catch with the comma - i meant 10:15:100 speed wise. But only on FF typed arrays were slightly faster than normal arrays. asm is at least 10 times faster than js arrays on all browsers - tested with FF, webkit (Safari), blink/V8 (Chrome, Opera).

Tyriar commented 7 years ago

@jerch cool, a 50% speed up from typedarrays in addition to better memory would definitely be worth investing in for now.

jerch commented 7 years ago

Idea for memory saving - maybe we could get rid of the width for every character. Gonna try to implement a less expensive wcwidth version.

Tyriar commented 7 years ago

@jerch we need to access it quite a bit, and we can't lazy load it or anything because when reflow comes we will need the width of every character in the buffer. Even if it was fast we might still want to keep it around.

Might be better to make it optional, assuming 1 if it's not specified:

type CharData = [string, number?]; // not sure if this is valid syntax

[
  // 'a'
  ['a'],
  // '文'
  ['文', 2],
  // after wide
  ['', 0],
  ...
]

jerch commented 7 years ago

@Tyriar Yeah - well since Ive already written it, please have a look at it in PR #798 Speedup is 10 to 15 times on my computer for the cost of 16k bytes for the lookup table. Maybe a combination of both is possible if still needed.

Tyriar commented 7 years ago

Some more flags we'll support in the future: https://github.com/sourcelair/xterm.js/issues/580

Tyriar commented 7 years ago

Another thought: Only the bottom portion of the terminal (Terminal.ybase to Terminal.ybase + Terminal.rows) is dynamic. The scrollback which makes up the bulk of the data is completely static, perhaps we can leverage this. I didn't know this until recently, but even things like delete lines (DL, CSI Ps M) don't bring the scrollback back down but rather insert another line. Similarly, scroll up (SU, CSI Ps S) deletes the item at Terminal.scrollTop and inserts an item at Terminal.scrollBottom.

Managing the bottom dynamic portion of the terminal independently and pushing to scrollback when the line is pushed out could lead to some significant gains. For example, the bottom portion could be more verbose to favor modifying of attributes, faster access, etc. whereas the scrollback can be more of an archival format as proposed in the above.

Tyriar commented 7 years ago

Another thought: it's probably a better idea to restrict CharAttributeEntry to rows as that's how most applications seem to work. Also if the terminal is resized then "blank" padding is added to the right which doesn't share the same styles.

eg:

To the right of the red/green diffs are unstyled "blank" cells.

jerch commented 6 years ago

@Tyriar Any chance to get this issue back on the agenda? At least for output intensive programs a different way of holding the terminal data might save a lot of memory and time. Some hybrid of 2/3/4 will give a huge throughput boost, if we could avoid splitting and saving single chars of the input string. Also saving the attributes only once they changed will help to save memory.

Example: With the new parser we could save a bunch of input characters without messing around with the attributes, since we know that they will not change in the middle of that string. The attributes of that string could be saved in some other data structure or attribute along with wcwidths (yeah we still need those to find they line break) and line breaks and stops. This would basically give up the cell model when data is incoming. Problem arises if something steps in and wants to have a drill-down representation of the terminal data (e.g. the renderer or some escape sequence/user wants to move the cursor). We still have to do the cell calculations if that happens, but it should be sufficient to do this only for the content within the terminal cols and rows. (Not sure yet about the scrolled out content, that might be even further cachable and cheap to redraw.)

Tyriar commented 6 years ago

@jerch I'll be meeting @mofux one day in Prague in a couple of weeks and we were going to do/start some internal improvements of how text attributes are handled which covers this 😃

Tyriar commented 6 years ago

From https://github.com/xtermjs/xterm.js/pull/1460#issuecomment-390500944

The algo is kinda expensive since every char needs to be evaluate twice

@jerch if you have any ideas on faster access of text from the buffer, let us know. Currently most of it is just a single character as you know, but it could be an ArrayBuffer, a string, etc. I've been thinking we should think about taking more advantage of scrollback being immutable somehow.

jerch commented 6 years ago

Well I have experimented alot with ArrayBuffers in the past:

they are slightly worse than Array regarding runtime for the typical methods (maybe still less optimized by engine vendors)
new UintXXArray is far worse than literal array creation with []
they pay off several times if you can prealloc and reuse the data structure (up to 10 times), this is where the linked list nature of mixed arrays eats the performance due to heavy alloc and gc behind the scenes
for string data the forth and back conversion eats all benefits - a pity JS does not provide native string to Uint16Array converters (partly doable with TextEncoder though)

My findings about ArrayBuffer suggest not to use them for string data due to the conversion penalty. In theory the terminal could use ArrayBuffer from node-pty up to the terminal data (this would save several conversions on the way to the frontend), not sure if the rendering can be done that way, I think to render stuff it always needs a final uint16_t to string conversion. But even that one last string creation will eat most of the runtime saved - and furthermore, would turn the terminal internally into an ugly C-ish beast. Therefore I gave up this approach.

TL;DR ArrayBuffer is superior if you can prealloc and reuse the data structure. For everything else normal arrays are better. Strings are not worth to be squeezed into ArrayBuffers.

A new idea I came up with tries to lower string creation as much as possible, esp. tries to avoid the nasty splits and joins. It is kinda based on your 2nd idea above with the new InputHandler.print method, wcwidth and line stops in mind:

print now gets whole strings up to several terminal lines
save those strings in a simple pointer list without any alteration (no string alloc or gc, list alloc can be avoided if used with a prealloc'd structure) along with current attributes
advance cursor by wcwidth(string) % cols
special case \n (hard line break): advance cursor by one line, mark position in pointer list as hard break
special case line overflow with wrapAround: mark position in string as soft line break
special case \r: load last line content (from current cursor position to last line break) into some line buffer to get overwritten
data flows like above, despite the \r case no cell abstraction nor string splitting is needed
attribute changes are no problem, as long as noone requests the real cols x rows representation (they just change the attr flag that gets saved along with the whole string)

Btw the wcwidths are a subset of the grapheme algo, so this might be interchangeable in the future.

Now the hazardous part 1 - someone wants to move the cursor within the cols x rows:

move cols backwards in line breaks - the beginning of the current terminal content
every line break denotes a real terminal line
load stuff into cell model for just one page (not sure yet if this can also be omitted with clever string positioning)
do the nasty work: if attribute changes are requested we are kinda out of luck and have to fall back to either the full cell model or a string split & insert model (the latter might introduce bad performance)
data flows again, now with degraded strings & attrs data in the buffer for that page

Now the hazardous part 2 - the renderer wants to draw something:

kinda depends on the renderer if we need to drilldown to a cell model or can just provide the string offsets with line breaks and text attrs

Pros:

very fast data flow
optimized for the most common InputHandler method - print
makes reflowing lines upon terminal resize possible

Cons:

almost every other InputHandler method will be hazardous in the sense of interrupting this flow model and the need of some intermediate cell abstraction
renderer integration unclear (to me at least atm)
might degrade performance for curses like apps (they typically contain more "hazardous" sequences)

Well this is a rough draft of the idea, far from being useable atm since many details are not covered yet. Esp. the "hazardous" parts could get nasty with many performance problems (such as degrading the buffer with even worse gc behavior etc pp)

Tyriar commented 6 years ago

@jerch

Strings are not worth to be squeezed into ArrayBuffers.

I think Monaco stores its buffer in ArrayBuffers and is pretty high performance. I haven't looked too deeply into the implementation yet.

esp. tries to avoid the nasty splits and joins

Which ones?

I've been thinking we should think about taking more advantage of scrollback being immutable somehow.

One idea was to separate scrollback from the viewport section. Once a line goes to scrollback it gets pushed into the scrollback data structure. You could imagine 2 CircularList objects, one whose lines are optimized for never changing, one for the opposite.

jerch commented 6 years ago

@Tyriar About the scrollback - yes since this is never reachable by the cursor it might save some memory to just drop the cell abstraction for scrolled out lines.

jerch commented 6 years ago

@Tyriar It makes sense to store strings in ArrayBuffer if we can limit the conversion down to one (maybe the final one for render output). That is slightly better than string handling all over the place. This would be doable since node-pty can give raw data as well (and also the websocket can give us raw data).

jerch commented 6 years ago

esp. tries to avoid the nasty splits and joins

Which ones?

The whole approach is ~~to avoid~~ minimize splits ~~at all~~. If noone requests cursor jumps into the buffered data, strings would never be split and could go right to the renderer (if supported). No cell splits and later joins at all.

Tyriar commented 6 years ago

@jerch well it can if the viewport is expanded, I think we may also pull in the scrollback when a line is deleted? Not 100% sure on that or even if it's correct behavior.

jerch commented 6 years ago

@Tyriar Ah right. Not sure about the latter either, I think native xterm allows this only for real mouse or scrollbar scrolling. Even SD/SU does not move scrollbuffer content back into the "active" terminal viewport.

Could you point me to the source of the monaco editor, where the ArrayBuffer is used? Seems I cant find it myself :blush:

jerch commented 6 years ago

Hmm just reread the TextEncoder/Decoder spec, with ArrayBuffers from node-pty up to the frontend we are basically stuck with utf-8, unless we translate it the hard way at some point. Making xterm.js utf-8 aware? Idk, this would involve many intermediate codepoint calculations for the higher unicode chars. Proside - it would save memory for ascii chars.

Tyriar commented 6 years ago

@rebornix could you give us some pointers to where monaco stores the buffer?

jerch commented 6 years ago

Here are some numbers for typed arrays and the new parser (was easier to adopt):

UTF-8 (Uint8Array): print action jumps from 190 MB/s to 290 MB/s
UTF-16 (Uint16Array): print action jumps from 190 MB/s to 320 MB/s

Overall UTF-16 performs much better, but that was expected since the parser is optimized for that. UTF-8 suffers from the intermediate codepoint calculation.

The string to typed array conversion eats ~4% JS runtime of my benchmark ls -lR /usr/lib (always far below 100 ms, done via a loop in InputHandler.parse). I did not test the reverse conversion (this is implicitly done atm in InputHandller.print on the cell by cell level). The overall runtime is slightly worse than with strings (the saved time in the parser does not compensate the conversion time). This might change when other parts are also typed array aware.

jerch commented 6 years ago

And the corresponding screenshots (tested with ls -lR /usr/lib):

with strings: grafik

with Uint16Array: grafik

Note difference for EscapeSequenceParser.parse, which can profit from a typed array (~30% faster). The InputHandler.parse does the conversion, thus it is worse for the typed array version. Also GC Minor has more to do for typed array (since I throw the array away).

Edit: Another aspect can be seen in the screenshots - the GC becomes relevant with ~20% runtime, the long running frames (red flagged) are all GC related.

jerch commented 6 years ago

Just another somewhat radical idea:

Create own arraybuffer based virtual memory, something big (>5 MB) If the arraybuffer has a length of multiples of 4 transparent switches from int8 to int16 to int32 types are possible. The allocator returns a free index on the Uint8Array, this pointer can be converted to a Uint16Array or Uint32Array position by a simple bit shift.
Write incoming strings into the memory as uint16_t type for UTF-16.
Parser runs on the string pointers and calls methods in InputHandler with pointers to this memory instead of string slices.

Create the terminal data buffer inside the virtual memory as a ring buffer array of a struct like type instead of native JS objects, maybe like this (still cell based):

struct Cell {
uint32_t *char_start;  // start pointer of cell content (JS with pointers hurray!)
uint8_t length;        // length of content (8 bit here is sufficient)
uint32_t attr;         // text attributes (might grow to hold true color someday)
uint8_t width;         // wcwidth (maybe merge with other member, always < 4)
.....                  // some other cell based stuff
}

Pros:

omits JS objects and thus GC where possible (only few local objects will remain)
only one initial data copy into the virtual memory needed
almost no malloc and free costs (depends on cleverness of the allocator/deallocator)
will save alot of memory (avoids the JS objects memory overhead)

Cons:

Welcome to the Cavascript Horror Show :scream:
hard to implement, changes kinda everything
speed benefit is unclear until really implemented

:smile:

Tyriar commented 6 years ago

~~hard~~fun to implement, changes kinda everything 😉

This is closer to how Monaco works, I remembered this blog post which discusses the strategy for storing character metadata https://code.visualstudio.com/blogs/2017/02/08/syntax-highlighting-optimizations

jerch commented 6 years ago

Yup thats basically the same idea.

rebornix commented 6 years ago

Hope my answer to where monaco stores the buffer is not too late.

Alex and I are in favor of Array Buffer and most of the time it gives us good performance. Some places we use ArrayBuffer:

Line break offsets in Text Buffer https://github.com/Microsoft/vscode/blob/rebornix/review/src/vs/editor/common/model/pieceTreeTextBuffer/pieceTreeBase.ts#L31 .
Offset <--> Position mapping https://github.com/Microsoft/vscode/blob/rebornix/review/src/vs/editor/common/viewModel/prefixSumComputer.ts#L26
Tokens https://github.com/Microsoft/vscode/blob/rebornix/review/src/vs/editor/common/model/textModelTokens.ts#L1 You may be interested in this one. We tokenize the source code in another thread and store them in Array Buffer, and then we use the same array buffer as the backing store of the View Model. I'm even thinking about communicating with the tokenizer (node-oniguruma) in a web worker and then the UI code doesn't need to worry about responsiveness and the tokens can be shared to the main thread by SharedArrayBuffer

We use simple strings for text buffer instead of Array Buffer as V8 string is easier to manipulate

We do the encoding/decoding at the very beginning of loading a file, so files are converted JS string. V8 decides whether to use one byte or two to store a character.
We do edits on the text buffer very often, strings are easier to handle.
We are using nodejs native module and have access to V8 internals when necessary.

Tyriar commented 6 years ago

Pinned @rebornix's links so they don't break when commits are added 😃

jerch commented 6 years ago

The following list is just a quick summary of interesting concepts I stumbled over that might help to lower memory usage and/or runtime:

FlatJS (https://github.com/lars-t-hansen/flatjs) - meta language to help coding with arraybuffer based heaps
http://2ality.com/2017/01/shared-array-buffer.html (announced as part of ES2017, future might be uncertain due to Spectre, beside that very promising idea with real concurrency and real atomics)
webassembly/asm.js (current state? usable yet? Have not followed its development for some time, used emscripten to asm.js years ago with a C lib for a game AI with impressive results though)
https://github.com/AssemblyScript/assemblyscript

jerch commented 6 years ago

To get sumthing rolling here, here is a quick hack how we could "merge" text attributes.

The code is mainly driven by the idea to save memory for the buffer data (runtime will suffer, not tested yet how much). Esp. the text attributes with RGB for foreground and background (once supported) will make xterm.js eat tons of memory by the current cell by cell layout. The code tries to circumvent this by the usage of a resizable ref-counting atlas for attributes. This is imho an option since a single terminal will hardly hold more than 1M cells, which would grow the atlas to 1M * entry_size if all cells differ.

The cell itself just needs to hold the index of the attribute atlas. On cell changes the old index needs to be unref'd and the new one ref'd. The atlas index would replace the attribute attribute of the terminal object and will be changed itself in SGR.

The atlas currently only addresses text attributes, but could be extended to all cell attributes if needed. While the current terminal buffer holds 2 32bit numbers for attribute data (4 with RGB in the current buffer design) the atlas would reduce it to only one 32bit number. The atlas entries can be packed further too.

interface TextAttributes {
    flags: number;
    foreground: number;
    background: number;
}

const enum AtlasEntry {
    FLAGS = 1,
    FOREGROUND = 2,
    BACKGROUND = 3
}

class TextAttributeAtlas {
    /** data storage */
    private data: Uint32Array;
    /** flag lookup tree, not happy with that yet */
    private flagTree: any = {};
    /** holds freed slots */
    private freedSlots: number[] = [];
    /** tracks biggest idx to shortcut new slot assignment */
    private biggestIdx: number = 0;
    constructor(size: number) {
        this.data = new Uint32Array(size * 4);
    }
    private setData(idx: number, attributes: TextAttributes): void {
        this.data[idx] = 0;
        this.data[idx + AtlasEntry.FLAGS] = attributes.flags;
        this.data[idx + AtlasEntry.FOREGROUND] = attributes.foreground;
        this.data[idx + AtlasEntry.BACKGROUND] = attributes.background;
        if (!this.flagTree[attributes.flags])
            this.flagTree[attributes.flags] = [];
        if (this.flagTree[attributes.flags].indexOf(idx) === -1)
            this.flagTree[attributes.flags].push(idx);
    }

    /**
     * convenient method to inspect attributes at slot `idx`.
     * For better performance atlas idx and AtlasEntry
     * should be used directly to avoid number conversions.
     * @param {number} idx
     * @return {TextAttributes}
     */
    getAttributes(idx: number): TextAttributes {
        return {
            flags: this.data[idx + AtlasEntry.FLAGS],
            foreground: this.data[idx + AtlasEntry.FOREGROUND],
            background: this.data[idx + AtlasEntry.BACKGROUND]
        };
    }

    /**
     * Returns a slot index in the atlas for the given text attributes.
     * To be called upon attributes changes, e.g. by SGR.
     * NOTE: The ref counter is set to 0 for a new slot index, thus
     * values will get overwritten if not referenced in between.
     * @param {TextAttributes} attributes
     * @return {number}
     */
    getSlot(attributes: TextAttributes): number {
        // find matching attributes slot
        const sameFlag = this.flagTree[attributes.flags];
        if (sameFlag) {
            for (let i = 0; i < sameFlag.length; ++i) {
                let idx = sameFlag[i];
                if (this.data[idx + AtlasEntry.FOREGROUND] === attributes.foreground
                    && this.data[idx + AtlasEntry.BACKGROUND] === attributes.background) {
                    return idx;
                }
            }
        }
        // try to insert into a previously freed slot
        const freed = this.freedSlots.pop();
        if (freed) {
            this.setData(freed, attributes);
            return freed;
        }
        // else assign new slot
        for (let i = this.biggestIdx; i < this.data.length; i += 4) {
            if (!this.data[i]) {
                this.setData(i, attributes);
                if (i > this.biggestIdx)
                    this.biggestIdx = i;
                return i;
            }
        }
        // could not find a valid slot --> resize storage
        const data = new Uint32Array(this.data.length * 2);
        for (let i = 0; i < this.data.length; ++i)
            data[i] = this.data[i];
        const idx = this.data.length;
        this.data = data;
        this.setData(idx, attributes);
        return idx;
    }

    /**
     * Increment ref counter.
     * To be called for every terminal cell, that holds `idx` as text attributes.
     * @param {number} idx
     */
    ref(idx: number): void {
        this.data[idx]++;
    }

    /**
     * Decrement ref counter. Once dropped to 0 the slot will be reused.
     * To be called for every cell that gets removed or reused with another value.
     * @param {number} idx
     */
    unref(idx: number): void {
        this.data[idx]--;
        if (!this.data[idx]) {
            let treePart = this.flagTree[this.data[idx + AtlasEntry.FLAGS]];
            treePart.splice(treePart.indexOf(this.data[idx]), 1);
        }
    }
}

let atlas = new TextAttributeAtlas(2);
let a1 = atlas.getSlot({flags: 12, foreground: 13, background: 14});
atlas.ref(a1);
// atlas.unref(a1);
let a2 = atlas.getSlot({flags: 12, foreground: 13, background: 15});
atlas.ref(a2);
let a3 = atlas.getSlot({flags: 13, foreground: 13, background: 16});
atlas.ref(a3);
let a4 = atlas.getSlot({flags: 13, foreground: 13, background: 16});
console.log(atlas);
console.log(a1, a2, a3, a4);
console.log('a1', atlas.getAttributes(a1));
console.log('a2', atlas.getAttributes(a2));
console.log('a3', atlas.getAttributes(a3));
console.log('a4', atlas.getAttributes(a4));

Edit: The runtime penalty is almost zero, for my benchmark with ls -lR /usr/lib it adds less than 1 msec to the total runtime of ~2.3 s. Interesting side note - the command sets less than 64 different text attribute slots for the output of 5 MB of data and will save more than 20 MB once fully implemented.

jerch commented 6 years ago

Made some prototype PRs to test some changes to the buffer (see https://github.com/xtermjs/xterm.js/pull/1528#issue-196949371 for the general idea behind the changes):

PR #1528 : attribute atlas
PR #1529 : remove wcwidth and charCode from buffer
PR #1530 : replace string in buffer by codepoints / cell storage index value

Tyriar commented 6 years ago

@jerch it might be a good idea to stay away from the word atlas for this so that "atlas" always means "texture atlas". Something like store or cache would probably be better?

jerch commented 6 years ago

oh ok, "cache" is fine.

jerch commented 6 years ago

Guess I am done with the testbed PRs. Please also have a look at the PR comments to get the background of the following rough summary.

Proposal:

Build an AttributeCache to hold everything needed to style a single terminal cell. See #1528 for an early ref counting version that can also hold true color specs. The cache can also be shared between different terminal instances if needed to saved further memory in multiple terminal apps.
Build a StringStorage to hold short terminal content data strings. The version in #1530 even avoids storing single char strings by "overloading" the pointer meaning. wcwidth should be moved here.
Shrink the current CharData from [number, string, number, number] to [number, number], where the numbers are pointers (index numbers) to:
- AttributeCache entry
- StringStorage entry

The attributes are unlikely to change alot, so a single 32bit number will save much memory over time. The StringStorage pointer is a real unicode codepoint for single chars, therefore can be used as the code entry of CharData. The actual string can be accessed by StringStorage.getString(idx). The fourth field wcwidth of CharData could be accessed by StringStorage.wcwidth(idx) (not yet implemented). There is almost no runtime penalty from getting rid of code and wcwidth in CharData (tested in #1529).

Move the shrunk CharData into a dense Int32Array based buffer implementation. Also tested in #1530 with a stub class (far from fully functional), the final benefits are likely to be:
- 80% smaller memory footprint of the terminal buffer (from 5.5 MB to 0.75 MB)
- ~~slightly faster (not testable yet, I expect to gain 20% - 30% speed)~~
- Edit: much faster - script runtime for ls -lR /usr/lib dropped to 1.3s (master is at 2.1s) while the old buffer is still active for cursor handling, once removed I expect the runtime to drop below 1s

Downside - step 4 is quite alot work since it will need some rework on the buffer interface. But hey - for saving 80% of the RAM and still gaining runtime performance it is no biggy, is it? :smile:

jerch commented 6 years ago

There is another issue I stumbled over - the current empty cell representation. Imho a cell can have 3 states:

empty: initial cell state, nothing got written to it yet or the content was deleted. It has a width of 1 but no content. Currently used in blankLine and eraseChar, but with a space as content.
null: cell after a full width char to indicate, that it has no width for visual representation.
normal: cell holds some content and has a visual width (1 or 2, maybe bigger once we support real grapheme/bidi stuff, not sure about that yet lol)

Problem I see here is that an empty cell is not distinguishable from a normal cell with a space inserted, both look the same at buffer level (same content, same width). I did not write any of the renderer/output code but I expect this to lead to awkward situations at the output front. Esp. the handling of the right end of a line might get cumbersome. A terminal with 15 cols, first some string output, that got wrapped around:

1: 'H', 'e', 'l', 'l', 'o', ' ', 't', 'e', 'r', 'm', 'i', 'n', 'a', 'l', ' '
2: 'w', 'o', 'r', 'l', 'd', '!', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' '

versus a folder listing with ls:

1: 'R', 'e', 'a', 'd', 'm', 'e', '.', 'm', 'd', ' ', ' ', ' ', ' ', ' ', ' '
2: 'f', 'i', 'l', 'e', 'A', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' '

The first example contains a real space after the word 'terminal', the second example never touched the cells after 'Readme.md'. The way it is represented at buffer level makes perfect sense for the standard case to print the stuff as terminal output to the screen (the room needs to be taken anyways), but for tools that try to deal with the content strings like a mouse selection or a reflow manager it is not clear anymore, where the spaces come from.

More or less this leads to the next question - how to determine the actual content length in a line (amount of cells that contain something from left side)? A simple approach would count the empty cells from the right side, again the double meaning from above makes this hard to determine.

Proposal: Imho this is easy fixable by using some other placeholder for empty cells, e.g. a control char or the empty string and replace those in the render process if needed. Maybe the screen renderer can also benefit from this, since it might not have to handle those cells at all (depends on the way the output is generated).

Btw, for the wrapped string above this also leads to the isWrapped problem, that is essential for a reflowing resize or a correct copy&paste selection handling. Imho we cannot remove that, but need to integrate it better than it is atm.

Tyriar commented 6 years ago

@jerch impressive work! :smiley:

1 Build an AttributeCache to hold everything needed to style a single terminal cell. See #1528 for an early ref counting version that can also hold true color specs. The cache can also be shared between different terminal instances if needed to saved further memory in multiple terminal apps.

Made some comments on #1528.

2 Build a StringStorage to hold short terminal content data strings. The version in #1530 even avoids storing single char strings by "overloading" the pointer meaning. wcwidth should be moved here.

Made some comments on #1530.

4 Move the shrunk CharData into a dense Int32Array based buffer implementation. Also tested in #1530 with a stub class (far from fully functional), the final benefits are likely to be:

Not totally sold on this idea yet, I think it will bite us hard when we implement reflow. It does look like each of these steps can pretty much be done in order so we can see how things go and see if it makes sense to do this once we get 3 done.

There is another issue I stumbled over - the current empty cell representation. Imho a cell can have 3 states

Here's an example of a bug that came out of this https://github.com/xtermjs/xterm.js/issues/1286, :+1: to differentiating whitespace cells and "empty" cells

Btw, for the wrapped string above this also leads to the isWrapped problem, that is essential for a reflowing resize or a correct copy&paste selection handling. Imho we cannot remove that, but need to integrate it better than it is atm.

I see isWrapped going away when we tackle https://github.com/xtermjs/xterm.js/issues/622 as CircularList will only contain unwrapped rows.

jerch commented 6 years ago

Not totally sold on this idea yet, I think it will bite us hard when we implement reflow. It does look like each of these steps can pretty much be done in order so we can see how things go and see if it makes sense to do this once we get 3 done.

Yes I am with you (still fun to play around with this totally different approach). 1 and 2 could be cherrypicked, 3 can be applied depending on 1 or 2. 4 is optional, we could just stick to the current buffer layout. The memory savings are like this:

1 + 2 + 3 in CircularList: saves 50% (~2.8MB of ~5.5 MB)
1 + 2 + 3 + 4 halfway - just put the the line data in a typed array but stick to the row index access: saves 82% (~0.9MB)
1 + 2 + 3 + 4 fully dense array with pointer arithmetics: saves 87% (~0.7MB)

1. is very easy to implement, the memory behavior with bigger scrollBack will still show the bad scaling as shown here https://github.com/xtermjs/xterm.js/pull/1530#issuecomment-403542479 but on a less toxic level 2. Slightly harder to implement (some more indirections at line level needed), but will make it possible to keep the higher API of Buffer intact. Imho the option to go for - big mem save and still easy to integrate. 3. 5% more mem save than option 2, hard to implement, will change touch all API and thus literally the whole code base. Imho more of academical interest or for boring rainy days to be implemented lol.

jerch commented 6 years ago

@Tyriar I did some further tests with rust for webassembly usage and rewrote the parser. Note that my rust skills are abit "rusty" since I did not get yet deeper into it, therefore the following might be the result of weak rust code. Results:

The data handling within the wasm part is slightly faster (5 - 10%).
Calls from JS into wasm create some overhead and eats all the benefits from above. In fact it was ~20% slower.
The "binary" will be smaller than the JS counterpart (not really measured since I did not implement all stuff).
To get JS <--> wasm transition easily done some bloatcode is needed to handle the JS types (only did the string translation).
We cannot avoid JS to wasm translation since the browser DOM and events are not accessible there. It could only be used for core parts, which are not that performance critical anymore (beside the mem consumption).

Unless we want to rewrite the whole core libs in rust (or any other wasm capable language) we cannot gain anything from moving to a wasm lang imho. A plus of nowadays wasm langs is the fact that most support explicit memory handling (could help us with the buffer problem), downsides are the introduction of a totally different language into a primarily TS/JS focused project (a high barrier for code additions) and the translation costs between wasm and JS land.

TL;DR xterm.js is to broadly into general JS stuff like DOM and events to gain anything from webassembly even for a rewrite of the core parts.

Tyriar commented 6 years ago

@jerch nice investigation :smiley:

Calls from JS into wasm create some overhead and eats all the benefits from above. In fact it was ~20% slower.

This was [the major problem for monaco going native](this https://code.visualstudio.com/blogs/2018/03/23/text-buffer-reimplementation#_why-not-native) as well which has primarily informed my stance on (though that was with native node module, not wasm). I believe working with ArrayBuffers where ever possible should give us the best balance between perf and simplicity (easy of impl, barrier to entry).

jerch commented 6 years ago

@Tyriar Gonna try to come up with an AttributeStorage to hold the RGB data. Not sure yet about the BST, for the typical use case with only a few color settings in a terminal session this will be worse in runtime, maybe this should be a runtime drop-in once the colors surpass a given theshold. Also memory consumption will raise alot again, though it will still save memory since the attributes are stored only once and not along with every single cell (worst case scenario with every cell holding different attributes will suffer though). Do you know why the current fg and bg 256 colors value is 9 bit based instead of 8 bits? What is the additional bit used for? Here: https://github.com/xtermjs/xterm.js/blob/6691f809069a549b4808cd2e055398d2da15db37/src/InputHandler.ts#L1596 Could you give me the current bit layout of attr? I think a similar approach like the "double meaning" for the StringStorage pointer can further save memory but that would require the MSB of attr to be reserved for the pointer distinction and not used for any other purpose. This might limit the possibility to support further attribute flags later on (because FLAGS already uses 7 bits), are we still missing some fundamental flags that are likely to come?

A 32 bit attr number in the term buffer could be packed like this:

# 256 indexed colors
32:       0 (no RGB color)
31..25:   flags (7 bits)
24..17:   fg (8 bits, see question above)
16..9:    bg
8..1:     unused

# RGB colors
32:       1 (RGB color)
31..25:   flags (7 bits)
24..1:    pointer to RGB data (address space is 2^24, which should be sufficient)

This way the storage only needs to hold the RGB data in two 32 bit numbers while the flags can stay in the attr number.

Tyriar commented 6 years ago

@jerch by the way I sent you an email, probably got eaten by the spam filter again 😛

Tyriar commented 6 years ago

Do you know why the current fg and bg 256 colors value is 9 bit based instead of 8 bits? What is the additional bit used for?

I think it's used for the default fg/bg color (which could be dark or light), so it's actually 257 colors.

https://github.com/xtermjs/xterm.js/pull/756/files

Could you give me the current bit layout of attr?

I think it's this:

19+:     flags (see `FLAGS` enum)
18..18:  default fg flag
17..10:  256 fg
9..9:    default bg flag
8..1:    256 bg

You can see what I landed on for truecolor in the old PR https://github.com/xtermjs/xterm.js/pull/756/files:

/**
 * Character data, the array's format is:
 * - string: The character.
 * - number: The width of the character.
 * - number: Flags that decorate the character.
 *
 *        truecolor fg
 *        |   inverse
 *        |   |   underline
 *        |   |   |
 *   0b 0 0 0 0 0 0 0
 *      |   |   |   |
 *      |   |   |   bold
 *      |   |   blink
 *      |   invisible
 *      truecolor bg
 *
 * - number: Foreground color. If default bit flag is set, color is the default
 *           (inherited from the DOM parent). If truecolor fg flag is true, this
 *           is a 24-bit color of the form 0xxRRGGBB, if not it's an xterm color
 *           code ranging from 0-255.
 *
 *        red
 *        |       blue
 *   0x 0 R R G G B B
 *      |     |
 *      |     green
 *      default color bit
 *
 * - number: Background color. The same as foreground color.
 */
export type CharData = [string, number, number, number, number];

So in this I had 2 flags; one for the default color (whether to ignore all color bits) and one for truecolor (whether to do 256 or 16 mil color).

This might limit the possibility to support further attribute flags later on (because FLAGS already uses 7 bits), are we still missing some fundamental flags that are likely to come?

Yes we want some room for additional flags, for example https://github.com/xtermjs/xterm.js/issues/580, https://github.com/xtermjs/xterm.js/issues/1145, I'd say at least leave > 3 bits where possible.

Instead of pointer data inside the attr itself, there could be another map that holds references to the the rgb data? mapAttrIdxToRgb: { [idx: number]: RgbData

jerch commented 6 years ago

@Tyriar Sorry, was a few days not online and I fear the email got eaten by the spam filter. Could you resend it please? :blush:

Played abit with more clever lookup data structures for the attrs storage. Most promising regarding space and search/insert runtime are trees and a skiplist as a cheaper alternative. In theory lol. In practice neither can outperform my simple array search which seems very weird to me (bug in the code somewhere?) I uploaded a test file here https://gist.github.com/jerch/ff65f3fb4414ff8ac84a947b3a1eec58 with array vs. a left-leaning red–black tree, that tests up to 10M entries (which is almost the complete addressing space). Still the array is far ahead compared to the LLRB, I suspect the break-even to be around 10M though. Tested on my 7ys old laptop, maybe someone can test it as well and even better - point me to some bugs in the impl/tests.

Here are a some results (with running numbers):

prefilled             time for inserting 1000 * 1000 (summed up, ms)
items                 array        LLRB
100-10000             3.5 - 5      ~13
100000                ~12          ~15
1000000               8            ~18
10000000              20-25        21-28

What really surprises me is the fact that the linear array search does not show any growing in the lower regions at all, it is up to 10k entries stable at ~4ms (might be cache related). The 10M test shows for both a worse runtime than expected, maybe due to mem paging whatsoever. Maybe JS is to far away from the machine with the JIT and all the opts/deopts happening, still I think they cant eliminate a complexity step (though the LLRB seems to be heavy on a single n thus moving the break even point for O(n) vs. O(logn) upwards)

Btw with random data the difference is even worse.

xtermjs / xterm.js