grapheme cluster & unicode v13 support

jerch commented 3 years ago

Issue to track grapheme cluster and follow-up unicode version support.

Currently we are not grapheme cluster aware which leads to several issues around complex combining unicode characters like compound emojis and scripting systems. Final goal should be to handle most newer aspects of Unicode 11+ with a dedicated v13 addon.

TODO

[ ] extend wcwidth handling in InputHandler.print with grapheme ruleset
- [ ] extend v11 addon with grapheme handling
[ ] create v13 addon

bonus goals

investigate if v13 can be inlined into v11 addon
create automatic generator for wcwidth tables (#2668)

Limitations While support for graphemes and v13 will solve several output issues for compound characters on newer OS, it certainly will not solve all unicode related issues:

apps compiled with older unicode version in mind will disagree about the runwidth of certain characters (down to not supporting graphemes at all). This cannot be spotted by the emulator at runtime as we have no "unicode version handshake protocol" in the terminal interface. Here ppl will have to switch to an older version if they encounter issues.
Unicode is highly uncertain about monospaced environments, in fact the consortium states, that there is no good way to derive runwidths from any unicode information. We can only try to mimick here what most other TEs do. This leaves a high chance of still misaligned runwidths.
Complex graphemes might extend character dimensions in both directions, which might break the grid (characters might get cut in height and width). This also depends highly on the used monospace font, whether it still can generate a grid aligned glyph. This cannot be taken into account in parser phase. It might be fixable to some degree later on during rendering by measuring the glyph extend and adjusting its output metrics (not done here).
Unicode explicitly allows different valid output representations of compound chars/emojis, depending on the render system whether it can create a compound symbol or not. Example: 👩 + ZWJ + 🚀 might be rendered as 👩‍🚀 (compound) or as 👩🚀 (sequence of single emojis) (Note: you can only see a difference, if your browser supports that particular compound glyph). This needs some thinking about a proper default handling app side can calculate with (always reserve full sequential space?). Can also be made a terminal settings (useCompoundUnicodeRenderWidth = True) to tell the renderer always to use the real available glyph width. The runwidth ambiguity for app side here could be solved with an additional terminal sequence requesting the real runwidth prior usage (e.g. a chat app that knowns its used compound emojis can request all runwidths before entering the main texting loop).

christianparpart commented 3 years ago

Thanks for pointing me to this ticket @jerch.

I have been actually implementing grapheme cluster support in my terminal emulator. My motivation behind all the pain was to be able to properly render complex emoji in the terminal. Now I'd like to address some of your points above:

apps compiled with older unicode version in mind will disagree about the runwidth of certain characters (down to not supporting graphemes at all).

I think that's not a problem, because in non-supporting client apps it will be highly unlikely that that application will render complex Unicode anyways.

Unicode is highly uncertain about monospaced environments

This is right, but that should not affect grapheme cluster aware terminals. In my opinion. In the context of a terminal, historically every displayed character usually was 1 grid cell wide. And because of that, recent terminals that started to support displaying emoji did indeed render emoji over 2 grid cells, but did not increment the cursor by 2 cells to not break existing applications. I think that wasn't that bad, at least we had emoji. But most terminals are not grapheme cluster aware (seriously, are there any?)

Now, grapheme clusters can be used to determine how many consecutive codepoints should be rendered as "one user perceived character". Characters on the terminal are usually in one grid cell, and a few in 2 grid cells for wide characters (such as emoji, kanji, ...). Implementing grapheme cluster segmentation in a terminal would help putting the right "non-breakable" sequence of codepoints into one grid cell - just like the user would expect them to be (as it's one user perceived character when rendered). I'd highly vote for supporting that. :)

Complex graphemes might extend character dimensions in both directions, which might break the grid (characters might get cut in height and width).

AFAIR, the first codepoint mandates the EastAsian width unless VS15/VS16 is involved which explicitly states to enforce the emoji presentation to either text (VS15, narrow, 1 character cell) or emoji (VS16, wide, 2 char cells).

Grapheme clusters are not all out of the suddon making text look vertical (ref to "both directions"). I think not even mlterm is doing that. Even RTL is a hard (but apparently solved) problem (again: mlterm, gnome-terminal to some degree?). But I think either should hinder properly interpreting grapheme cluster boundaries.

Unicode explicitly allows different valid output representations of compound chars/emojis, depending on the render system whether it can create a compound symbol or not

I think what you are referring to is that for example the family emoji of course usually looks like a family, but some applications might render it as a sequence of emoji symbols. Or a colored person emoji might alternatively rendered as standard (I think: white) colored emoji with the color modification emoji symbol right next to it.

The quote from the UTS 51, section 2.2 is as follows:

However, if that combination is not supported as a single unit, it may show up as a sequence like the following, and the user sees no indication that it was meant to be composed into a single image:

To me, that is not as bad as it sounds, unless one plans to intentionally implement a TE that does fall under the above mentioned category of "not supported". Everyone should at least attempt to support and that should not hinder grapheme segmentation boundary determination in a TE (IMHO).

I hope this is not too much of a wall of text. One last note. The grapheme cluster segmentation algorithm might be expensive to execute, especially in the context of a terminal, where some people like to perform terminal output bandwidth performance tests recently. I found myself in a weak spot and realized that the algorithm I implemented was naively following the rule-set as specified in https://unicode.org/reports/tr29/#Grapheme_Cluster_Boundary_Rules . However in the context of a TE this algorithm can be perfectly optimized for the standard TE case, so that only for the actual non US-ASCII codepoints the algorithm has to be fully run (my performance win was 50%! of total performance in my TE bandwidth benchmark!)

p.s.: In case legacy applications might still be of concern to some, we could agree on a feature detection number that could be queried by client apps that would want to distinguish. The user could make that feature per config default enabled/disabled and an SM/RM code could be even introduced to allow apps to decide whether or not grapheme segmentation should be obeyed. (Even though, personally I'd highly advocate to always respect grapheme clusters).

jerch commented 3 years ago

I hope this is not too much of a wall of text.

Haha it is quite the wall, nonetheless trying to get through.

I think that's not a problem, because in non-supporting client apps it will be highly unlikely that that application will render complex Unicode anyways.

Thats true for clusters, but not so for single codepoint runwidths. There are several unicode release, where they changed the widths, which creates mis-alignments for any app trying to anticipate wcwidth (I think v8 to v9 was the worst upgrade regarding that). Btw emojis used to be just 1 cell in earlier releases, but with the picograms most moved to 2 cells. And to make it worse - there are some emojis, that still map to 1 cell in text representation, but 2 cells for the picogram, lol.

Implementing grapheme cluster segmentation in a terminal would help putting the right "non-breakable" sequence of codepoints into one grid cell - just like the user would expect them to be (as it's one user perceived character when rendered).

Yes I think putting them in one cell is the only right way to deal with them, even if they are longer. But this raises a few questions about cell accounting in TEs, whether the emulator can mark a cell spanning multiple half width cells, and how to deal with those "super cells" during reflow and render. To fix TEs in this regard is imho the hard part of a halfway decent implementation.

AFAIR, the first codepoint mandates the EastAsian width unless VS15/VS16 is involved which explicitly states to enforce the emoji presentation to either text (VS15, narrow, 1 character cell) or emoji (VS16, wide, 2 char cells).

Grapheme clusters are not all out of the suddon making text look vertical (ref to "both directions"). I think not even mlterm is doing that. Even RTL is a hard (but apparently solved) problem (again: mlterm, gnome-terminal to some degree?). But I think either should hinder properly interpreting grapheme cluster boundaries.

Well that "breaking the aligment in both directions" is a thing in browser font renderers, as I have it experienced when doing the original grapheme PR. If you happen to render glyphs on your own, you can always align things as you please (up to totally unreadable, because 20 codepoints got just painted into one text cell). In xterm.js we are somewhat limited in that regard and have to go with what the renderer offers us for combining codepoints. Have not done any BiDi stuff yet.

To me, that is not as bad as it sounds, unless one plans to intentionally implement a TE that does fall under the above mentioned category of "not supported". Everyone should at least attempt to support and that should not hinder grapheme segmentation boundary determination in a TE (IMHO).

In xterm.js we are bound to what the browser/system offers us (browser engine, font renderer, installed fonts). There is no way in attempting to get the compound glyph, the combination of those outer dependencies can either show it or not. For the family emoji this gets really funny. And what shall appside do with that? The problem is renderer stage bound, appside has no knowledge about that. Neither do multiplexers. Thats the reason why I think we might need a lookup sequence - to give apps/multiplexer a way to do correct wcwidth calculations upfront.

About segmentation algo speed: I remember trying different approaches, from table lookups to some function with optimized branches cutting off early. Well the function won (JS is not that good with table lookups, guess there is too much indirection/memory noise involved).

p.s.: In case legacy applications might still be of concern to some, we could agree on a feature detection number that could be queried by client apps that would want to distinguish. The user could make that feature per config default enabled/disabled and an SM/RM code could be even introduced to allow apps to decide whether or not grapheme segmentation should be obeyed. (Even though, personally I'd highly advocate to always respect grapheme clusters).

Ah I am not a big fan of that mixed legacy mode, as it just messes up unicode feature support. Imho if a TE claims v13 support, it should also handle graphemes, as they are quite often (Guess we had at least 20 emoji issues, and still chasing the unicode rabbit). If some apps is legacy, it prolly runs with older unicode version in mind down to v6 without any grapheme stuff. While most newer unicode releases are downwards compatible to some degree (only add stuff), older release upgrades are not. There is no proper way to treat a v6 app with a v13 TE, way too many things changed. Imho those older apps should run on a corresponding TE (either by multiple unicode version support or an older version of the TE).

Well created another text wall :smile_cat:

jerch commented 2 years ago

Better be tracked by #2668.

xtermjs / xterm.js

grapheme cluster & unicode v13 support #3304