character joiner api unsuitable for graphemes

PerBothner commented 1 year ago

The character joiner api is invoked at render time. However, grapheme clusters need to be known no later than line breaking/overflow time, including reflow on window size resize. The logical place to detect clusters would seem to in the InputHandler print function. This can be combined with wide-character detection, using a single lookup. DomTerm uses an efficient combined trie lookup. which I have ported/converted to TypeScript (not yet checked in).

As a "heads up": I'm working on a solution for this. Unfortunately, this involves a re-design of the BufferLine implementation. On the other hand, the re-write has other benefits, which I will discuss later.

Tyriar commented 1 year ago

Wouldn't we be over extending our hand by changing how wrapping works to accommodate graphemes? Yes it will lead to a less than optimal reading experience, but it will be good enough imo without introducing all this complexity.

For my own info, I see a bunch of Hangul handling, I wouldn't have thought graphemes would have anything to do with hangul as they are always double width?

PerBothner commented 1 year ago

Wouldn't we be over extending our hand by changing how wrapping works to accommodate graphemes?

Take a look at this DomTerm screenshot (2nd screenshot on the page).

Consider 'woman+zwj+woman+zwj+boy'. If you calculate the number of columns without being grapheme-aware, you get 6 columns. However, it's supposed to be 2 columns. If you do layout and line-breaking assuming it's 6 columns and then render it as 2 columns, that's not a "less than optimal reading experience" - it's wrong.

For my own info, I see a bunch of Hangul handling, I wouldn't have thought graphemes would have anything to do with hangul as they are always double width?

Hangul be be encoded using either the Hangul syllables block or the Hangul Jamo block. The former are "pre-composed" but only support the common/modern syllables. Jamo are like building-blocks and you can use them to construct many more forms. Similar to how é can be written either pre-composed or as "letter a + acute accent". Note that normalized text should only contain Jamo.

Both forms of the same text should use 2 columns - but using the EastAsianWidth.tx table with grapheme handling gives 5 columns for the example shown. I suspect the expectation for "proper Korean text handling" these days means getting this right. I notice that both Gnome Terminal and KDE Konsole (but not xterm) seem to handle the Hangul example correctly - though most of the other tests are wrong.

In my fork the grapheme tables and lookup are in a separate xterm-addon-unicode-graphemes, so it is opt-in. However, I believe the cost is modest enough to make it a recommended default: xterm-addon-unicode-graphemes.js is 23724 bytes non-minimized. I don't have time measurements (I haven't yet done any profiling) but I'm guessing it's comparable to wcwidth.

I haven't found a Unicode specification for how grapheme cluster should be mapped to narrow wide characters, but I use the general rule that the cluster is the width of the widest character in it. Regional Indicators are special: A RI by itself is 1 column, but a pair is 2 columns.

Tyriar commented 1 year ago

Consider 'woman+zwj+woman+zwj+boy'. If you calculate the number of columns without being grapheme-aware, you get 6 columns. However, it's supposed to be 2 columns. If you do layout and line-breaking assuming it's 6 columns and then render it as 2 columns, that's not a "less than optimal reading experience" - it's wrong.

I was under the impression that emoji composition would be handled entirely in the parser, it would compose the characters as they come in and then hand off the combined char to the buffer line? One of my concerns is due to the size of your branch it will probably take a long time to get in as my time for reviewing/testing this sort of thing is quite limited these days.

Also you seem to have ripped out character joiners all together, these are how ligatures are implemented?

Hangul be be encoded using either the Hangul syllables block or the Hangul Jamo block. The former are "pre-composed" but only support the common/modern syllables. Jamo are like building-blocks and you can use them to construct many more forms. Similar to how é can be written either pre-composed or as "letter a + acute accent". Note that normalized text should only contain Jamo.

Both forms of the same text should use 2 columns - but using the EastAsianWidth.tx table with grapheme handling gives 5 columns for the example shown. I suspect the expectation for "proper Korean text handling" these days means getting this right. I notice that both Gnome Terminal and KDE Konsole (but not xterm) seem to handle the Hangul example correctly - though most of the other tests are wrong.

I actually know a bit of Korean, I'm not sure I understand still. We seem to handle it just fine as it's the IME's job to take care of composing the character before sending it to the shell:

Recording 2023-05-11 at 06 54 28

PerBothner commented 1 year ago

I was under the impression that emoji composition would be handled entirely in the parser, it would compose the characters as they come in and then hand off the combined char to the buffer line?

That is impossible - in many cases composed characters don't exist as separate Unicode characters. Consider the regional indicators - or "woman+zwj+woman+zwj+boy" from the screenshot linked in my first message.

One of my concerns is due to the size of your branch it will probably take a long time to get in as my time for reviewing/testing this sort of thing is quite limited these days.

If I wasn't clear: My fork is nowhere ready for use. Some simple things work, but a bunch of things don't work, and some that work need to be optimized. Grapheme clusters don't quite work yet, but it's close - most of the pieces are ready.

No expectation for a review anytime soon.

I also want to emphasize: My changes to BufferLine aren't solely motivated by grapheme clusters: By giving up guaranteed O(1) column indexing we gain a lot of flexibility. (I can talk more about this when more is implemented.) We also get a more compact data structure.

Also you seem to have ripped out character joiners all together, these are how ligatures are implemented?

I've ripped them out for now while getting things working. I think the existing logic (i.e. ligatures as a render-time-only functionality) can probably be re-enabled with minor changes. The main difference is that JoinedCellData would no longer be recommended for grapheme clusters, but would be suitable for width-preserving substitutions including ligatures.

Another option is that ligatures can stored in the BufferLine, which could be a lot faster, as well as supporting non-width-preserving ligatures. The more flexible BufferLine data structure could store in the buffer both the "logical character(s)" and also the "ligature to render", with flags to control which data is used in which context.

About Korean: IME only handles user input - we also need to handle output. Wcwidth returns a width of 2 for base characters (0x1100 to 0x1160) and a width of 0 for "combining characters". The seems to do the right thing for the few cases I've tried, and may be enough to handle most or all cases that will appear in practice. I have no idea how robust or general it is. (I don't know Korean.)

PerBothner commented 1 year ago

My purpose with these changes is to possibly replace the DomTerm DOM-based layout with one based on xterm.js. The idea is to combine the advantages of xterm.js (primarily performance) with as much as possible of the existing and future DomTerm functionality. That will require some major changes to xterm.js, starting with the BufferLine implementation. I expect I will have to maintain some of the changes in a fork, but the fewer changes I have relative to upstream the better for everyone. (I imagine that some of the features I'm hoping to add will be useful to both Jupyter and VsCode.)

Some features that I'm hoping will be eased by the BufferLine changes:

Grapheme cluster handling.
A "line" that contain a general Element constructed from user HTML (images, SVG, rich text). Very useful for REPL-style applications.
Pretty-printing (more flexible line-breaking).
Shell integration.
Ultimately variable-width text.

This is all a long-term dream. I'll work a bit at a time.

Tyriar commented 1 year ago

No expectation for a review anytime soon.

👌 I just skimmed it, I want to set some expectations that it may be tough to get some of this stuff merged, don't want to waste your time.

That is impossible - in many cases composed characters don't exist as separate Unicode characters. Consider the regional indicators - or "woman+zwj+woman+zwj+boy" from the screenshot linked in my first message.

This is what the the _combined property was meant to do; have a slower way to access complex data made up of multiple characters

https://github.com/Tyriar/xterm.js/blob/8c9e44c2191aa3b18b282eecb1d057b553658cf7/src/common/buffer/BufferLine.ts#L63

By giving up guaranteed O(1) column indexing we gain a lot of flexibility. (I can talk more about this when more is implemented.) We also get a more compact data structure.

I'm curious about this, it seems like it's as compact as it can get basically currently via the Uint32Array for data. We intentionally optimized for the common case where the data would fix into a Uint32. Any changes to buffer line would be best discussed in a separate issue as it seems like it's own distinct addition.

Another option is that ligatures can stored in the BufferLine

Ligatures should be done purely at the renderer level since whether a set of characters are ligatures depends on the current font family.

The seems to do the right thing for the few cases I've tried, and may be enough to handle most or all cases that will appear in practice. I have no idea how robust or general it is. (I don't know Korean.)

I haven't used any CLIs that use Korean, I suspect you may be trying to support something in Korean that never actually happens in practice on the command line.

A "line" that contain a general Element constructed from user HTML (images, SVG, rich text). Very useful for REPL-style applications.

This sort of thing is why the decorations API was introduced. We use this API to implement find highlighting and line overlays like vscode's shell integration indicator and command navigation:

Shell integration.

FYI VS Code has a pretty robust shell integration setup now, we intentionally kept it separate from xterm.js as it's custom/opinionated and relies on a script actually running. This is one of the main reasons the register*Handler calls are on the API. If you're interested:

PerBothner commented 1 year ago

I just skimmed it, I want to set some expectations that it may be tough to get some of this stuff merged, don't want to waste your time.

If it has to be private fork, it can still be winning for DomTerm. But of course it is always better if it can be upstreamed.

_This is what the the combined property was meant to do; have a slower way to access complex data made up of multiple characters

I figured that out. The DataKind.CLUSTER_w1 and DataKind.CLUSTER_w2 kinds replace the _combined map. The actual _combined text is stored in the _text string.

I'm curious about this, it seems like it's as compact as it can get basically currently via the Uint32Array for data.

A line consisting of 80 single-width BMP characters with default attributes requires the BufferLine object itself, an 80-character string, and a Uint32Array with room for a single element. That's a lot more compact. Notice that in this case you still have O(1) indexing.

Any changes to buffer line would be best discussed in a separate issue as it seems like it's own distinct addition.

Absolutely. I will create a separate issue before I create a pull request. I guess I can do so now. It might make sense to wait until more of the implementation is done, though early feedback might be helpful.

[embedded DOM elements is the] sort of thing is why the decorations API was introduced.

I have to study the decorator API. However, I suspect my use cases aren't exactly a match for decorators (though they could build on top of it). Images/SVG/DOM as part of the output should be part of the data model: They should be included in serialization and selections (including selecting just part of HTML rich text). Navigable in view mode. Preferably, height not constrained to an integer number of rows.

Tyriar commented 1 year ago

For decorations, we could relatively easily add the selections/copy by adding a delegate to fetch selection data to IDecorationOptions, serialization as in saving and loading state would need to be done separately though. It's essentially just an element overlay on top of an existing cell that fires the onRender event when it needs to be updated. I'm hesitant to try introduce a competing system when this solves a similar problem. We intentionally wanted to keep the HTMLElement inside the buffer idea as a layer on top of buffer lines, not integrated into it, as it's very much a non-standard terminal feature.

jerch commented 1 year ago

Sorry for chimming in late and not having read any sources yet mentioned here - I just want to point out, how I intended to solve the grapheme cluster issue in a much earlier draft PR (in 2019 or 2020):

aggregate clusters in Inputhandler.print, basically replacing wcwidth with clustering + wcwidth calcs on that cluster
allow a single cell to contain clusters with arbitrary width (not strictly in 1 or 2, still multiples of 1 as we are still in a monospace env)

This was what I found to be in line with what most other cluster-ware TEs do. This is still very flawed in itself, but would resemble what most other do.

The reason I did not go forward with that approach was mostly because of:

runtime penalty (my clustering approach took ~3 times longer than without)
size penalty (clustering lookup tables took ~8 KB more space)

which I found not to be justified with the big ambiguity of Unicode specs, how to treat xy constellation. The ambiguity mainly raises from the weak "East Asian Width" property handling (def. discouraged by Unicode), which gets worse for clusters in non LCG script systems with dependencies on fonts and their certain glyph metrics (yes the unicode spec left that to font devs on purpose, which creates a nightmare for us terminals). Idk how to solve that, beside restricting either xterm.js' fonts to those, that were tested to work as intended (basically every glyph would have to be tested in its width), or calling back to the unicode consortium to spec things more properly for monospace envs. Thats the point were I gave up trying to solve that issue...

I still dont know a proper solution to that issue. Unicode as specced atm, cannot be supported to a satifying level for terminals. We are in a wicked spec state here...

PerBothner commented 1 year ago

aggregate clusters in Inputhandler.print, basically replacing wcwidth with clustering + wcwidth calcs on that cluster

That's my plan - and basically what I'm currently doing in DomTerm.

allow a single cell to contain clusters with arbitrary width (not strictly in 1 or 2, still multiples of 1 as we are still in a monospace env)

Currently planning on sticking to 1 or 2 as I don't know a need for 3 or more. (I have been thinking about non-monospace text in a TE but that is not relevant at this point.)

runtime penalty (my clustering approach took ~3 times longer than without)

I'm hoping that by combining width lookup with clustering property and doing a single lookup in an efficient trie data structure the overhead will be modest - but we'll find out once it is implemented and tuned. In my prototype, the actual lookup will be opt-in, in an addon.

size penalty (clustering lookup tables took ~8 KB more space)

Is 8kB really a serious amount of space? Especially if it will be opt-in (in an addon)?

Tyriar commented 1 year ago

It's fine if it's an addon imo, the main thing is we want it to be optional/lazy loadable. Similar to what we're doing with unicode.

jerch commented 1 year ago

Currently planning on sticking to 1 or 2 as I don't know a need for 3 or more.

We had somewhere a discussion about 🗺 (U+1F5FA, world map), which renders in fonts in 2, 3 or 4 cells (yes font dependent, really annoying). For single codepoints thats the only one Ive seen so far breaking with the 1 vs 2 "convention". With clusters widths get funny in some indian script systems, but Idk enough about those. Ofc RTL scripts are a problem class of its own with their bidi mechs, and I think there are also combining constellations that would extend widths. For LGC scripts all combining chars I've come across wont change width (always sticking with the base chars width), but I am not sure if that is a general rule for those, or I just missed the exotic stuff.

Is 8kB really a serious amount of space? Especially if it will be opt-in (in an addon)?

No 8kb is not a big deal at all, but still not very helpful, if we cannot lower the ambiguity to a sane level. Without that it just makes things worse (including runtime + space penalty). So that was a tradeoff decision back then.

Tyriar commented 1 year ago

Mostly actioned with https://github.com/xtermjs/xterm.js/pull/4519, another issue will be open wrt the proposed buffer changes

PerBothner commented 1 year ago

See issue #4800 for a discussion on changing the BufferLine data structure.

xtermjs / xterm.js

character joiner api unsuitable for graphemes #4513