Support RTL languages - Githubissues

Tyriar commented 7 years ago

Downstream issue: https://github.com/Microsoft/vscode/issues/28571

When we enforced unicode character width in https://github.com/sourcelair/xterm.js/issues/467 this broke RTL language characters as they are now rendered in reverse (LTR). We could revert that for RTL character ranges only but we should do the right fix and reverse the strings so they're actually on the character grid as the new selection model relies on all characters lining up perfectly on the grid https://github.com/sourcelair/xterm.js/pull/670

Ideally line reflow https://github.com/sourcelair/xterm.js/issues/622 would be done before this so it's easier to change the contents of multiple lines.

Terminal.app:

VS Code 1.13 (notice sentences are reversed):

@mostafa69d @CherryDT a little info on the languages in question would be handy:

Where should the strings be flipped.for Hebrew/Arabic/Persian, do I reverse entire continuous sequences of characters in-between ascii characters?
How are the characters meant to interact with characters like 0-9 or punctuation?

Useful references:

CherryDT commented 7 years ago

It is actually a whole lot more complicated and includes statefulness and even mirroring certain characters. I'd say it's a science of its own. (And I have the deepest respect for those people who wrote robust text rendering libraries that handle all the BiDi issues properly, so I don't have to mess around with it, to be honest.)

See also: https://en.wikipedia.org/wiki/Bi-directional_text (good overview) https://www.w3.org/International/articles/inline-bidi-markup/uba-basics https://www.w3.org/International/tutorials/svg-tiny-bidi/ (the initial premise is not related but it explains a few things better than the previous link) https://github.com/fevangelou/doctype-mirror/tree/master/bidihowto/bidi-support-in-a-ui

EDIT: I think the way the new selection works may actually be unexpected because it is going to behave differently than VSCode itself. For example, given the text "The song מדינת קומבינה makes me think", when I start selecting at "The" and end between the two Hebrew words, I will have selected "The song מדינת", while in the console I will have selected "The song קומבינה".

See example:

However it will still be better than how Sublime Text "works" last time I checked, because there you will see one thing selected but copy another, which is very annoying.

mostafa-drz commented 7 years ago

@Tyriar First of all I'm gonna give you a very brief perspective of Arabic and Persian language maybe it help you(I'm not sure if the Hebrew is the same). In Arabic and Persian languages the alphabets are like "آ" "ب" "س" and so on. And the words are made by these alphabets (obviously) with a very different rule in compare with for example English. The difference is that we have more than one shape for some alphabet like "س" .The first shape is "س" and the second one is " سـ" ,the other one is "ـسـ" and the last one is "ـس". And what's the usage of these shapes? Based on where the alphabet in a word appears, the shape of alphabet we use varies. For example, for the mentioned alphabet "س" we use the shape "سـ" when a word starts with this alphabet like "سلام". Here is the problem and actually the difference between a language like English and Persian or Arabic. We generate words in these languages by concating the different shapes of these alphabets(we adhere them together in some cases). Again I highlight these rule: we generate these words by concating the shapes not the alphabets(Which is always concating alphabets in English) you can see some examples below: we have alphabets "ک" "ن" "ا" "د" "ی" I make these words by just mentioned alphabets : نادان , یاد,دکان So, to wrap it up and give you the clue what happened in the screenshots I posted , the terminal breaks the words to alphabets and reverse them.(So it's not just about reversing). Take a look at words I created and alphabets I mentioned before, Now the VS terminal shows them "separated" and "reversed".

Correct format: نادان Terminal: ن ا د ا ن Correct format:یاد Terminal: د ا ی Correct format: دکان Terminal: ن ا ک د

Now your questions: Where should the strings be flipped.for Hebrew/Arabic/Persian, do I reverse entire continuous sequences of characters in-between ascii characters? I don't have any idea about Hebrew, but in Arabic and persian the sequences of character should flip when they encounter a space character(The word separator is space) like this:" من در حال نوشتن هستم" but still it should keep the "shapes" and necessary adherence.

How are the characters meant to interact with characters like 0-9 or punctuation? About numbers and punctuation the rules are the same as English and the numbers and punctuation signs follows the characters. like this: ?من در سال "۱۳۶۹" به دنیا آمدم. من در سال "1369" به دنیا آمدم. Actually a sequences of characters containing RTL and none-RTL characters is a whole different story and if you need more information I can elaborate that.

P.S 1: This link here is a source code which is written to solve the same problem in PHP( for sure old versions) you can take a look https://github.com/slashmili/php-gd-persian/blob/master/phpgd/fagd.php

P.S 2: Here is a resource on wikipedia about the Persian characters https://en.wikipedia.org/wiki/Persian_alphabet

P.S 3: Again, I have to mention that in the previous version of VS Code, everything was fine.

P.S 4: About the problem with selecting a word containing some LTR character like <p>اینجا را بخوانید</p> which @CherryDT mentioned , there are some minor bugs which I don't have problem with them and I found quick solutions for them.(But still if you need some elaboration about those let me know)

saeedhei commented 7 years ago

After Updating my vscode, Everything reversed, That is Very bad, Please Solve This problem I want to downgrade, Witch version is okey?

amitbeck commented 7 years ago

@mostafa69d luckily enough in Hebrew that barely exist. Hebrew letters stay mostly the same in any position inside a word, besides few letters which are כ which turns to ך, then מ which turns to ם, then נ which turns to ן, then פ which turns to ף and finally צ which turns to ץ. This makes Hebrew easier to format, I guess.

CherryDT commented 7 years ago

However these are still separate characters (in terms of character encoding) and always display the same. They do not change appearance when moved around. (It's the writer's job to use the right letter - sofit or not - at the right position.)

MortadaAK commented 6 years ago

The problem with the splitting characters is when they are wrapped within span one by one it will require connection and it will miss represent the shape (Arabic letters).

To fix the problem these characters must be within one span or not wrap them at all.

The list of the unicode all of these letters are Arabic (0600–06FF, 255 characters) Arabic Supplement (0750–077F, 48 characters) Arabic Extended-A (08A0–08FF, 73 characters) Arabic Presentation Forms-A (FB50–FDFF, 611 characters) Arabic Presentation Forms-B (FE70–FEFF, 141 characters) Rumi Numeral Symbols (10E60–10E7F, 31 characters) Arabic Mathematical Alphabetic Symbols (1EE00—1EEFF, 143 characters) screen shot 2017-11-29 at 11 45 00 pm

wis commented 6 years ago

required reading: https://opensource.com/life/16/3/twisted-road-right-left-language-support

from https://github.com/Microsoft/vscode/issues/28571#issuecomment-307991443

do you have an example of another terminal that handles this well?

mlterm seems to be better than the average (non-web based) terminal. 2018-11-15-023232_577x981_scrot It is cursive but in some cases cut off, I think it can be solved by changing the font, this paragraph was copied from Wikipedia, the blue characters are the RTL mark, that's how vim is outputing them and mlterm is rendering them in blue.

Tyriar commented 5 years ago

The character joiner API might be able to solve this, we could probably make all adjacent arabic/hebrew/etc. unicode characters join and be drawn in the same glyph.

babakks commented 5 years ago

For what it's worth, the debug console works well with RTL texts. This is what I've tried: code And this is the output on the debug console: debug But the terminal is still the same:

I'm using VS Code - Insiders v1.31.0.

elieobeid7 commented 5 years ago

@babakks Only two Terminals as far as I know in the Linux system can output RTL correctly, konsole and mlterm, they are available in all the distros repos.

MortadaAK commented 5 years ago

@elieobeid7 @babakks Mac OS terminal output RTL correctly

Tyriar commented 5 years ago

Put out a PR to fix this, if anyone wants to test out the branch that would be useful as I don't speak these languages. https://github.com/xtermjs/xterm.js/pull/1899

To test:

git clone https://github.com/Tyriar/xterm.js
cd xterm.js
git checkout 701_rtl_support
yarn
yarn watch

# another terminals
yarn start

You may need some dependencies to be installed https://github.com/Microsoft/node-pty#dependencies

egmontkob commented 5 years ago

Please hold off for a little bit :)

I've been recently working on studying, evaluating existing docs and implementations of RTL in terminals, and come up with a (draft) recommendation. I'll release it real soon now.

It's way more complicated than one would first think. A bit of spoiler: If you start shuffling the characters around according to the BiDi algorithm, it becomes literally, mathematicaly provably impossible to have proper BiDi-aware text editing-viewing experience (e.g. vim, emacs...) on top of that platform. (And to respond to the previous few comments: no, konsole, mlterm and macOS Terminal don't get it right either.)

Tyriar commented 5 years ago

@egmontkob does this take into account the fact that we get to leverage the browser's bidi support? All my change does is force related unicode sequences to be drawn together not as separate characters. This is probably wrong when the cursor is over the character but it seems to work other than that.

babakks commented 5 years ago

@Tyriar Sorry Tyriar, but it's still wrong. I commented under the pull request. https://github.com/xtermjs/xterm.js/pull/1899#issuecomment-455333377

egmontkob commented 5 years ago

The spec defines how the canvas needs to look like, after receiving some data. The spec doesn't care what the backend of the terminal emulator is (e.g. a graphical canvas, or a browser (HTML DOM), or another terminal emulator (tmux)), it's the terminal emulator's task to implement the specified behavior by whatever means.

And one aspect of the specified behavior is that in some circumstances the character cells need to be shuffled according to the BiDi algorithm (for display purposes only, not affecting the actual storage), because that's the only reasonable way to get simple utilities like "cat" produce the desired output; and in some other circumstances the cells mustn't be rearranged, because that's the only way vim/emacs/whoever can do their own BiDi. There are escape sequences controlling this behavior. And there's much-much more to the story than this.

egmontkob commented 5 years ago

Please see the published draft BiDi specification at https://terminal-wg.pages.freedesktop.org/bidi/ . Comments, improvement ideas etc. are welcome over there in its issue tracker.

roseMix commented 3 years ago

I just had this issue in vscode terminal is there still no fix for this?

munael commented 3 years ago

Not sure what the current state of this issue is? Some old comments mention PRs fixing it, but it's still active in the latest vsc insiders. :'(

amir-nejad commented 2 years ago

I have the same issue.

This issue is like Adobe's photoshop problem. in adobe, we can go to language settings and enable middle eastern features for fixing this issue. But we do have not any solution in vs code.

Check this link: https://graphicdesign.stackexchange.com/questions/18005/how-can-i-get-farsi-arabic-text-to-render-correctly-in-photoshop

I have Windows 11 and I have not any problem in manual running code in CMD or Powershell. But VS Code not working correctly.

Please fix this.

eyaler commented 2 years ago

hi. i wanted to migrate from pycharm to vscode, but this is a blocker. in pycharm the terminal works fine with RTL. was anyone able to get vscode terminal work for Hebrew?

par3ae commented 2 years ago

@Tyriar Hi dear Daniel excuse me I used the latest version of vscode and still I have this issue about RTL languages. I read the whole messages of this issue but I don't understand the approach to solve this issue in vscode integrated terminal.

whould you please guide me to solve this?

Tyriar commented 2 years ago

@par3ae this needs a bunch of research to figure out how to solve it properly which I haven't had time to do.

starball5 commented 1 year ago

Related question on Stack Overflow: Why doesn't the VS Code integrated terminal support RTL (right-to-left) text?

andjc commented 1 year ago

Arabic and Hebrew script have been mentioned, but there are many more scripts that require bidi support. But it is also not just of a question of bidirectional text, all writing systems requiring complex rendering seem to be affected. Every South Asian and South East Asian script I tried was broken, as were quite a few African scripts.

jerch commented 1 year ago

@andjc Yes - to make it blunt, unicode in terminal emulators is broken, when it comes to script systems outside of the latin/greek derived systems. Mainly 3 things are missing in xterm.js:

cluster/grapheme segmentation: While thats quite easy to get done on codepoint/data level, it raises serious questions about cursor mechanics and how to address/edit perceivable characters later on. There are several IME helpers to get things initially expressed, but it is not specced by anything, how the terminal cursor should move across that or make things editable afterwards.
proper width handling on a grid system: Terminals still stick to the now wonky half vs. full width separation (1 cell vs 2 cells in terminals) based on East Asian Width property. But unicode made it pretty clear, that its not the right way to layout glyphs, instead it depends on more complex rules from clustering and combining, even leading to fractions of cells (+ depending on font devs choice). And they dont answer the question, how a strictly monospaced environment not knowing the glyphs' width in advance should treat that. You will see the side effects of this "under specification" in any monospaced GUI editor, where it suddenly breaks out of the grid system on multiple combined/clustered chars. But a terminal cannot do this the same way.
bidi: @egmontkob did a great job to spec a proposal how to solve that for terminals (see above), still there are some surprising side effects when it comes to line progression/cursor advance. All older attempts (even DEC already had an RTL setting) are useless these days, as unicode made the line progression to a codepoint property (the old systems worked as strictly LTR or RTL).

Regarding bidi and xterm.js - since we have no devs with an RTL background, it is unlikely to be adopted soon. Speaking for myself - I have literally no clue about bidi mechs, and would just end up messing with a system I dont know/understand. Ofc PRs in this regard are more than welcome, but at anyone being up for this - you better have a strong affiliation/dedication to scripting system mechs, or things will get really frustrating.

andjc commented 1 year ago

@jerch, assuming a solution to the extended grapheme clusters is found, it will then be necessary to rethink the grid system. East Asian Width property and two cell sizes may not be adequate, some grapheme clusters become quite complex, and if the base character is wide to start with ...

Bidi is one issue, but visual versus logical ordering is another issue:

Take grapheme "ကြွေ" (U+1000 U+103C U+103D U+1031), U+1031 is rendered at the beginning of the cluster but is the fourth character in the string.

Honestly, it quickly becomes extremely complex to implement. Most terminals work best for LCG, and even for LCG they don't necessarily play nice with all input frameworks either.

socketpair commented 1 year ago

one more example:

echo '"qwe \u05e9\u05dc\u05d5\u05dd 123 \u043f\u0440\u0438\u0432\u0435\u0442"' | jq -r

should give:

Note, 123 going (by bytes) after Hebrew should be rendered on the left side of Hebrew word.

gnome-terminal supports this correctly.

tabarra commented 9 months ago

I wonder if using a library like bidi-js as a pre-processor can mitigate the issue for now. Anyone managed to come up with a patch for this?

xtermjs / xterm.js

Support RTL languages #701