unicode grapheme segmentation

Mr-Andersen commented 2 years ago

Basic information

zellij --version: 0.30.0 tput lines: 43 tput cols: 174 uname -av or ver(Windows): Linux aka-system 5.15.47 #1-NixOS SMP Tue Jun 14 16:36:28 UTC 2022 x86_64 GNU/Linux

Reproducing

paste "a̶b̶c̶" in Zellij

Expected

letters are strikethrough as GH renders them (hopefully)

Actual

It's just "abc"

I've checked this without Zellij, terminal renders them OK

Log

zellij-9.log

imsnif commented 2 years ago

Hey, so I'm totally able to see strikethrough text inside Zellij by entering the ANSI directly, like: echo -e "\033[9mI am strikethrough".

With pasting it doesn't work for me with or without Zellij (likely due to my pastebuffer not carrying this sort of styling data). Does this normally work for you with other styling? As in, if you copy bold text does it appear bold inside the terminal?

Mr-Andersen commented 2 years ago

@imsnif This is not ANSI, it is a Unicode character prepended before each (normal) character of a text piece, like described here

imsnif commented 2 years ago

I understand, hum. If you put it in an echo (outside of Zellij), do you see it in the echo? Also, which shell are you using?

Mr-Andersen commented 2 years ago

> fish --version
fish, version 3.4.1

Copy-pasting this piece into "echo" outside of zellij works, both piece in quotes and echoed one renders correctly

Inside zellij it doesn't work

> echo -e "a̶b̶c̶"
a̶b̶c̶

Mr-Andersen commented 2 years ago

Try copying a̶b̶c̶ and pasting it into your zellij

raphCode commented 2 years ago

can confirm with Arch and alacritty.

raphCode commented 2 years ago

To continue the discussion here:

strikethrough is achieved by appending a unicode modifier codepoint to any char that should be rendered strikethrough
this and other modifying codepoints have a width of zero
currently, we drop zero-width codepoints
I believe we should generally store zero-width codepoints together with the previous printable character in our character representation. This would fix also future problems with some diacritics or other unicode stuff not displayed properly.

tlinford commented 2 years ago

at the moment zero width characters are skipped: https://github.com/zellij-org/zellij/blob/7cd355efaf9de3221880da4ea06cbf804f6bca1e/zellij-server/src/panes/grid.rs#L1016-L1020

raphCode commented 2 years ago

Maybe we can draw inspiration from tmux handling unicode stuff? I tried the striketrough thing and it worked there.

tlinford commented 2 years ago

found this, but my c isn't really good enough to understand how the code works: https://github.com/tmux/tmux/commit/cf7b384c43b4a2c5a1bde8b4f6bfeee20ecad027

raphCode commented 2 years ago

Ah yes, C can be awful at times, but the comment at the top explains: Zero width characters are just appended onto the UTF-8 data for the previous cell. Which is basically my initial idea too and should cover most of the use cases.

I mean the craziest UTF8 shenanigans I saw in the wild was probably these z̶̝̜̀͠ả̷̗̎ḻ̶̌̿ģ̸͔̍ȯ̷̰̠ texts like they can be generated here. These work with tmux' approach. If one has concerns about wide characters, here is a list. Not even alacritty renders 𒐫 correctly and just keeps on drawing follow-up characters over the space this wide char occupies:

OnurKader commented 2 years ago

Sorry for commenting under this issue but I can't get non-standard kitty underlines and undercurlies to show up in zellij, I'm using alacritty (specifically alacritty-sixel). Are they supported, if not will it be possible to support them?

Testing with this: printf "\x1b[58:2::255:0:0m\x1b[4:1msingle\x1b[4:2mdouble\x1b[4:3mcurly\x1b[4:4mdotted\x1b[4:5mdashed\x1b[0m\n"

Alacritty without zellij: 2022-07-17_16-41-32

With zellij: 2022-07-17_16-45-16

Tried it out with wezterm, contour, and terminator; got similar results.

I haven't checked the code, maybe they fail to parse. The standard underline sequence works CSI 4 m

2022-07-17_17-34-42

Maybe some useful issues: wezterm and alacritty, iterm2.

Thank you!

christianparpart commented 2 years ago

The title of this ticket should probably be renamed to address the actual issue and not accidentally getting people to think that this is about unsupported VT sequences :)

I am not yet a zellij user but I'd like to try it out. To me however, it sounds like what you want to implement in zellij is proper grapheme cluster segmentation in order to deterministically assign what codepoints are assigned to which grid cell.

I fought hard on this back then, and I've actually created two documents out of it, first one was simply for me to not forget, and the second one to make people adopt what I think is right.

https://github.com/contour-terminal/contour/blob/master/docs/internals/text-stack.md
https://github.com/contour-terminal/terminal-unicode-core (this is a draft spec, but I was writing down the semantics of my implementation, that I think should be the way to go. feedback welcome :) )

imsnif commented 2 years ago

Thanks @christianparpart - this is really helpful! I know @tlinford wanted to work on this at some point (not sure if he has the time for this at the moment?) I think this is also very important to some Eastern Asian languages, whose users are a little mute on this issue tracker because (I guess) language barriers - but I've seen complaints on Twitter :)

tlinford commented 2 years ago

@christianparpart Thanks for the super useful info! I updated the title as suggested :).

Also, I couldn't figure it out from the repo, is the pdf of the spec available somewhere or do i need to generate it from the source? I had a look at the tex file and I think I got the basic idea of it at least :)

Since the spec says: Backwards compatibility is retained by leaving everything as undefined as it used to be without this specification when this mode is not enabled.; what do you think would be a reasonable approach in that situation? Could the append zero-width characters to the previous non-zero width character be a good starting point?

christianparpart commented 2 years ago

Could the append zero-width characters to the previous non-zero width character be a good starting point?

What you want to to implement is Unicode breakable(prev, next) function that tests if the consecutive codepoints pev and next can be broken up or not. if they can be broken up, the do not belong to the same grapheme cluster, otherwise they do. See https://github.com/contour-terminal/libunicode/blob/master/src/unicode/grapheme_segmenter.h#L72 for a starting-point. Mind, there are some corner cases to object, such as cursor movements in between.

EDIT: if anything in the spec is unclear, any kind of feedback to that is welcome.

har7an commented 2 years ago

Follow up from some team-internal discussion

The main problem with respect to fixing this in our code is that we need to enhance TerminalCharacter to store these additional codepoints. At the moment a single TerminalCharacter is a single unicode char. Multiple ideas come to mind:

We could add some Vec to it, but that isn't Copy, which is likely to impact performance.
Do 1 but use SmallVec instead, which resides on the stack most of the time.
Draw inspiration from alacritty or our Sixel implementation, where such codepoints live outside the actual characters (in extra datastructures).

Things to consider when "fixing" this

Keep an eye out on performance: Any addition to handle the additional codepoints should try to impact performance as little as possible.
We must determine whether unicode-width can handle all types of unicode characters, or whether there are exceptions. This is important for line-breaking (i.e. if a unicode char with added codepoints sticks out of the pane, it must be printed in the next line).
When going for method 3 (see above), ensure that the additional codepoints stored outside of TerminalCharacter are invalidated correctly when the respective character is dropped.

airone01 commented 1 year ago

It's also worth noting that this applies to more complex setups like here with AstroNVim in play. Just thought I'd drop this here for some inspiration maybe, thanks for your work!

Demo

With asciinema (looks better for me) What i see (looks worse for me) ![zellij_astronvim](https://github.com/zellij-org/zellij/assets/21955960/0baf552d-033c-46b6-835f-3f3910d7d940) Wanted result reference ![image](https://github.com/zellij-org/zellij/assets/21955960/c2c31542-dc56-4f69-991a-2ee1c82c27a4)

Reproducible with vanilla AstroNVim. I use Windows Terminal and FiraCode NF

MoSal commented 8 months ago

You should see how current alacritty_terminal versions do this in their cells. tmux's implementation fails the "England flag test":

🏴󠁧󠁢󠁥󠁮󠁧󠁿

This is a wide char followed by six (yes, six) zero-width chars.

We must determine whether unicode-width can handle all types of unicode characters,

It can't. But terminal emulators use it. So using it would match their buggy behavior.

Print this in a terminal emulator to see what I mean:

⸻ A ⸻ A ⸻ A ⸻

This is extra fun because it's a THREE-EM DASH instead of the (⸺ TWO-EM DASH), so it's width is 3. The notion of "wide char" is limited to 2. And both dashes have the unicode width of 1 anyway ;)

A more extreme example:

﷽

This is one single-width (lol) character.

cosmic-term renders both cases correctly btw. But doesn't wrap lines correctly atm (working on it), which is the complication you anticipated. But I don't think zellij needs to worry about it, because as I mentioned, (most) terminal emulators do rely on unicode-width. It is maybe buggy, but it's predictable and simple, and apparently serviceable, since there doesn't appear to be a lot of complaints.

A proper fix would require zellij to be in sync with how every terminal emulator does its rendering, which I don't think is a feasible goal.

MoSal commented 8 months ago

As it turns out, using something more sophisticated than unicode-width would not only require cooperation from the terminal emulator, but also from the app (e.g. a user's shell of choice).

So fancy approaches are definitely out, and relying on unicode-width would be required to keep zellij in sync with both terminal emulators and applications.

christianparpart commented 8 months ago

So fancy approaches are definitely out

I do not understand what you mean by this. However, I feel like just going with your "But terminal emulators use it. So using it would match their buggy behavior" will make the world stop and fall apart (quite intense correlation, but I hope one can understand it). We will never improve with killer arguments like these. I'd rather prefer the especially relatively young Rust community to fix their stuff and improve their APIs rather than relying on their v0 solutions. :)

MoSal commented 8 months ago

I'd rather prefer the especially relatively young Rust community to fix their stuff and improve their APIs rather than relying on their v0 solutions

I shared similar enthusiasm when I wrote my first comment... because I didn't know any better.

And I'm not sure what you mean by "their stuff" and "v0 solutions".

Do "v\<larger than 0> solutions" exist anywhere?

For this to work, you basically need a mechanism where the app (say zsh), a possible container terminal emulation app (say zellij), a fancy (rendering-wise) terminal emulator (say cosmic-term), you need all three to have access to the same rendering engine which can guess better the real width of characters. This is the imaginary fancy approach I was referring to.

And if such a thing actually existed, would that fully solve the problem?

Here is an extra kicker. It wouldn't. Not if the interface is still char => width-based.

Here is an example where, for a change, rendering width undershoots Unicode width.

The Arabic letter/char ل (sounds like l) is single-width. Unicode and the renderer would agree on this. Ditto for the Arabic letter ا (sounds like a).

But if an ل is followed by an ا, then the combination could possibly be rendered as the single-width glyph of the char ﻻ (la). That char is that combination. It's not a letter.

Think of it as if w was not a letter, but a char representing a glyph for rendering doubles of u's ;)

So even if everyone had access to a fancy char => width interface, they would still get the wrong width of 2 instead of 1 for ﻻ.

~~TL;DR~~ Still TL

Terminals were never meant to handle complex text.
Unicode width provides a standard, consistent, and usable improvement over the assumption of "all chars are single width".
Both terminal emulators and applications are making use of Unicode width, which makes the interfacing between them work consistently, and as expected.
Only the rendering side of things suffers when the rendered width doesn't match Unicode width. Both overshooting and undershooting cases exist, and mismatches can be context-dependent.
To anyone who thinks this can easily be improved upon, I would simply say la-la. That's Arabic for no-no. I'm not referring to the other meaning.

christianparpart commented 8 months ago

I'm still trying to understand what your basic take on non-trivial codepoints now is. 🤔

I know that TEs were not meant to handle complex text, and so where Windows 3.1 not made to run apps that do run today. They've moved along with time, for good. Now. Just imagine they didn't.

I'm not trying to convert you here. If you have a stance, that's absolutely okay, but maybe it should be officially stated (probably in the /README.md) then? So that other's don't spend time to argue without knowing that they'll never progress. Similar to Alacritty not wanting to do complex unicode nor wanting to have images, let alone Sixels. The tickets over there are not getting closed, which I quite frankly find very sad, but the author there made his stance very clear at least. :-)

To anyone who thinks this can easily be improved upon, I would simply say la-la. That's Arabic for no-no. I'm not referring to the other meaning

To be honest, then simply just don't write it out to the public. Be straight in your communication. Be clear. This is very important in chat-based communication, because it can be very easily accidentally misunderstood.

MoSal commented 8 months ago

@christianparpart

Apologies if I gave the wrong impression, but neither were my previous comments written in any official capacity, nor do I officially represent any relevant project.

ronisbr commented 6 months ago

Hi!

Is there any news about this issue? Unfortunately it completely breaks my development :(

In Julia, I use a lot of zero-width characters to represent mathematical operations like time-derivative:

(see the dot on top of a which in Julia we can obtain by typing a\dot<TAB>)

If I do the same thing in Zellij, I get:

which is very bad since we cannot see difference between the variables.

ronisbr commented 6 months ago

Hi!

It turns out that this bug appears in more situations than those mathematical symbols. In macOS, Apple uses decomposed UTF-8 by default to create accents, for example. Hence, the glyph á in a filename created in Finder is composed of two UTF-8 characters, where the second has zero width.

Hence, this is the output of ls in a folder containing folders with Brazilian Portuguese words:

This is the same output inside Zellij:

silicakes commented 2 months ago

Not sure this is the right place, but didn't want to issue-spam with something new. Using macOS iterm2 + neovim + treesitter + markdown_inline produces proper strikethrough as seen here:

Running this via zellij produces the following:

Using kitty gives the same result:

Pure terminal:

Zellij:

ronisbr commented 2 months ago

Hi @silicakes !

Yes, you are very likely seeing the same bug I am seeing regarding characters with 0 width.

zellij-org / zellij