Improve compatability of unicode grapheme width

pascalkuthe commented 1 year ago

I mentioned this on matrix. I personally consider this a bug but it doesn't really fit the the template and depending on your point of view it could also be viewed as a feature request.

While trying to fix the way helix render multi codepoint graphemes (currently truncating afte rthe first char) I noticed that my fix breaks rendering on wezterm. The reason is that wezterm changes the traditional cell width that would be calculated by wcchar (which would be 5 for for a mean_heart_woman emoji for example) by capping the width of a single grapheme to two.

This breaks existing applications (like kakune) and puts me as an application author in a rough spot where I can either use traditional width calculation and get the correct result on most emulators (kitty, alacrity, xterm, gnome terminal, ...) or use the width calculation used by wezterm and get the correct result on wezterm.

Using the wrong definition of width results in horrible visual artifacts (random highlights, disappearing and ghost characters,...).

Traditionally terminal emulators have simply computed the width of unicode cosepoints individually (merging zero width codepoints) without any grapheme segmentation. I personally feel that this is correct because the unicode standard only defines width on a codepoint level (with the exception of emoji representation which affects two codepoints at once buy that still kind of works because the variant selectors is zero width so its atleast a single cell).

The unicode standard does not define any notion of width for graphemes as far as I am aware (basically saying its font dependent). That makes a lot if sense because some graphemes can get really crazy (able to cover multiple lines, get super wide etc.). The width limit of 2 works well for emojis but not for other graphemes (to be fair other graphemes are kind of nieche anyway).

The kitty emulator handles this by keeping the cell width definition unchanged.The unicode segmentation is then simply handled at the font level so multi grapheme emojis still get merged but they simply endup with some trailing empty space.

Would you perhaps consider adoption a similar mechanism for compatibility? If you prefer the current behaviour perphase some kind of escape code based negotiation could be possible like there is for unicode version support?

Right now I am in a bit of an awkward spot where I as an application developer habe to choose which emulator I want to support properly.

wez commented 1 year ago

Can you share the codepoints from the problematic sequences? That will make it easier for me to look into this. In theory, wezterm should be compatible with wcswidth. We use https://github.com/ridiculousfish/widecharwidth which provides a wcwidth implementation, but not a wcswidth implementation.

pascalkuthe commented 1 year ago

So one example is 👩‍❤️‍👨

Its madeuopof 3 seperate emojis (woman, heart, man) joined by a zero width joiners (plus a variatuon selector for the heart).

In uncuode 9 mode I expect this emoji to take up 5 cells. In unicode 14 mode I expect it to takeup 6 cells. In wezterm it takesup 2 cells.

For reference: Traditionally terminal emulators don't support unicode swgmentatuin/graphemes so they render each emoji speratly (that's where the width of 5/6 comes from si.ply the sum of all codepoints widths) For example in alacrity:.

264171016-ca9aeba7-7b98-4340-8b7f-67d8fa8edca7 (Ignore the rest around it the just is that the emojis get render aeperartly).

Kitty keeps the traditional wcchar based cellwidth (and just segments at the font level):

264175970-d755633e-cfb0-4659-94e1-2338b3532cd1(1)

Your wcchar based implementation is quite correct and wcchar dies actually compute the right thing. The problem is that you don't return the result of the sum of all wcchar callsdirectly but instead do width.min(2).

AFAIK wcswidth just calls wcwidth for every Unicode codepoint in the sequence and returns the sum (with the special case of emoji presentation, my crate handles that with a fairly simple one char lookahead without any grapheme segmentation)

I have my own width implementation (unfinished ignore the rest of the repo please :D) which heavily references yours and behaves identical to termwiz except for this upper bound I descrived. See the relevant testcase here:

https://github.com/pascalkuthe/grapheme-width-rs/blob/aef038e78e41f3ec33da04d985aeac1476e2fe02/src/test.rs#L52

wez commented 1 year ago

FWIW iterm2 has similar behavior to wezterm:

have you tried helix and kakoune on that terminal?

I'm open to removing the .min(2) from here, but am unsure of the full ramifications at the moment because it has been some time since I was last deep in this and will need to refresh my memory! https://github.com/wez/wezterm/blob/99c9613966a73d5388344fd0df68d5977b493c7c/termwiz/src/cell.rs#L1008

kovidgoyal commented 1 year ago

Note that kitty's current behavior with ZWJ is wrong. It's on my TODO list to implement, see https://github.com/kovidgoyal/kitty/issues/3810 at which point it will render in two cells as well. IMO, rendering in two cells for extended ZWJ sequences is the correct behavior. I have not implemented this yet, partially because it's a fair bit of work in kitty internals and partially because as you point out, most terminal emulators don't implement it and almost no terminal programs implement it, so implementing it would break many things.

wez commented 1 year ago

I'm leaning towards expanding the unicode version escape sequences (https://wezfurlong.org/wezterm/config/lua/config/unicode_version.html) to allow applications to select the behavior they desire; details need designing.

There's a cell structure used by wezterm that builds on a tiny string type that uses a single bit to represent the column width; that will take a bit of effort to adjust for this concept.

kovidgoyal commented 1 year ago

I would recommend against doing that. It means, for example, one can no longer cat text into a terminal and have predictable results. Imagine a table in a markdown file, its columns will no longer be aligned depending on which terminal you cat it into and even details of the terminals internal state.

When it comes to width considerations, really the only workable longterm solution is for all terminals to converge to a common standard for wcswidth()

pascalkuthe commented 1 year ago

I would recommend against doing that. It means, for example, one can no longer cat text into a terminal and have predictable results. Imagine a table in a markdown file, its columns will no longer be aligned depending on which terminal you cat it into and even details of the terminals internal state.

When it comes to width considerations, really the only workable longterm solution is for all terminals to converge to a common standard for wcswidth()

Yeah this is really the crux of the issue. I agree that for ZJW emojis specifically width 2 makes sense (not so much for other graphemes tough IMO) but really as an app developer I can't take the specifics of various emulators into account. It would not be ideal to essentially test for know TERM_PROGRAMM values at startup to figure out what width to use (even if I did implemennt that, such a solution is pretty brittle over ssh). It's already a bit of a pain that kitty defaults to respecting emoji presentation/unicode 14, my code supports that but I have to figure out of if the emulator does and ultimately expose a user setting for that.

I think the escape sequences approach is pretty nice because atleast it's much more robust regarding ssh and similar. If all terminals that supported more extensive unicode handling (emoji presentation, zwj emojis,...) allowed me to opt out by sending a escape sequences on startup that would atleast prevent my tui from breaking depending on the emulator it's running on.

Ideally there would be some mechanism to go the other way too (the tui can figure out what the emulator supports). Otherwise I would just keep my tui at the lowest common denominator (for example alacritty has no plans to support zwj emojis or even emoji presentation AFAIK).

Terminfo would seem like a good candidate but that also has the ssh problem so that is a bit brittle but atleast an improvement over querying TERM_PROGRAM.

kovidgoyal commented 1 year ago

On Tue, Aug 29, 2023 at 11:42:01PM -0700, Pascal Kuthe wrote:

I would recommend against doing that. It means, for example, one can no longer cat text into a terminal and have predictable results. Imagine a table in a markdown file, its columns will no longer be aligned depending on which terminal you cat it into and even details of the terminals internal state.

When it comes to width considerations, really the only workable longterm solution is for all terminals to converge to a common standard for wcswidth()

Yeah this is really the crux of the issue. I agree that for ZJW emojis specifically width 2 makes sense (not so much for other graphemes tough IMO) but really as an app developer I can't take the specifics of various emulators into account. It would not be ideal to essentially test for know TERM_PROGRAMM values at startup to figure out what width to use (even if I did implemennt that, such a solution is pretty brittle over ssh). It's already a bit of a pain that kitty defaults to respecting emoji presentation/unicode 14, my code supports that but I have to figure out of if the emulator does and ultimately expose a user setting for that.

As an app developer your job is to write code that does the right width calculation. If this causes problems with X terminal emulator, that is a bug in that terminal emulator. Please don't try to do terminal emulator capability detection and workarounds, that way lies insanity. When it comes to width you will have to try to workaround not just terminal emulator differences, but also terminal multiplexer ones. It's simply not doable robustly. Do the right thing, tell users who have problems with X emulator to report the issue to that emulator.

In this case the right thing is pretty clear -- two cells. So everybody should converge on that. If XYZ terminal emulator doesn't want to do that, its not your job to accommodate that wrong decision. Wezterm is doing the right thing here, kitty is not, and I say this as the creator of kitty :) So use two as the width for such sequences in your code.

@wez: I say you dont need to do anything, you are already doing the right thing, keep doing it.

wez commented 1 year ago

Looking into this more, it's a PITA to undo the grapheme clustering work, so I've decided not to do it. Separately over in:

https://github.com/wez/wezterm/issues/4320

wezterm now reports "permanently enabled" to a DECRQM query from an application that wants to reason about the grapheme clustering behavior. wezterm, contour and (soon: there's a PR open) foot implement this query. Some terminals support setting this parameter at runtime. You can read more about this sequence over in https://github.com/contour-terminal/terminal-unicode-core

wez / wezterm

Improve compatability of unicode grapheme width #4223