Open jonasbb opened 6 years ago
Thanks for reporting this.
As you mentioned, there are 2 problems
['ä', ...]
) and then add an extra space.
But I have a question, are those two ä
different in the first and second row? They look the same to me.Oh, just realized that you mentioned 3 kinds of characters. Is the fourth line corresponding to the non-BMP character?
ä
s are indeed different (unless github normalizes the unicode input) and they should look identical. I have a small javascript snippet which shows the problems, especially regarding the .length
property.Screenshot:
Javascript code snippet:
s1 = "\u00e4"
s2 = "\u0061\u0308"
s3 = "\u{1F816}"
"String s1: '" + s1 + "'\nLength s1: " + s1.length.toString() +
"\nString s2: '" + s2 + "'\nLength s2: " + s2.length.toString() +
"\nString s3: '" + s3 + "'\nLength s3: " + s3.length.toString()
/*
String s1: 'ä'
Length s1: 1
String s2: 'ä'
Length s2: 2
String s3: '🠖'
Length s3: 2
*/
The fourth line shows a non-BMP character. non-BMP characters have code points between U+10000 and U+15FFFF. In Javascript and some other languages, they have a length of 2, as you can see from the s3
string in the js snippet.
Having a whitelist of combining characters for 1. would not work, as they can be combined arbitrarily. The only real solution is to use a library which counts the number of grapheme clusters. This solves part 1 and part 3 of the problem.
Here is a blogpost explaining the problem: https://blog.jonnew.com/posts/poo-dot-length-equals-two
Really appreciate your information. Then I found a library grapheme-splitter.
But I think it can only solve the problem of combining characters (e.g. ä
).
For emoji ❎
or non-BMP character 🠖
, although we can use the lib to get the right number of grapheme clusters (i.e. 1), the real glyph is not exact 1 space wide (I guess it is related to the font?).
And there is another convention different with above cases. Although "中文".length == 2
is correctly counted (Chinese), we usually use 4 spaces to align it.
I basically have no knowledge about East Asian scripts and how they work in Unicode. I think there are halfwidth and fullwidth characters. The fullwidth characters mainly for East Asian scripts which are printed as two columns.
During a quick google search I found this document http://www.unicode.org/reports/dtr11-02.html Although I don't know how that integrates with grapheme clusters.
The easiest would probably be if Vscode has an api to ask for the width of a string. Alternatively, reimplementing how Vscode or terminals simulators handle these cases.
I think there are halfwidth and fullwidth characters. The fullwidth characters mainly for East Asian scripts which are printed as two columns.
Exactly. For these fullwidth chars, we treat them as two halfwidth chars and then choose a proper mono font. So this is not a big problem. (vscode#48481, vscode#14589)
The easiest would probably be if Vscode has an api to ask for the width of a string
VSCode doesn't and I guess will never provide such an API. What we can control is the text, in other words, the number of spaces we pad around the text in a certain table cell. If a char has width of neither halfwidth nor fullwidth, there is not much we can do.
Alternatively, reimplementing how Vscode or terminals simulators handle these cases.
You are going so deep into the rabbit hole 😄. But the VSCode integrated terminal does have more considerations about char width than the VSCode editors.
So, I am going to adopt the above grapheme-splitter lib to address the ä
problem, leaving emoji and 🠖
as upstream issues (VSCode or Chromium?)
I found vscode does treat ä
as two characters. You need to press the arrow key two times to pass it.
This makes me hesitate whether to fix this issue before vscode deals with it.
I am a Japanese. So I am interested in Japanese.
My #153 PR does not solve this problem.
There seems to be three problems with Unicode's character width.
Chinese/Japanese kanji, hiragana and some symbols have width of "2".
"abcd".width => 4 "日本".width => 4
The range of character types is written in "Blocks.txt". https://www.unicode.org/Public/10.0.0/ucd/Blocks.txt
More precisely, the width of each code point is written in "EastAsianWidth.txt". https://www.unicode.org/Public/10.0.0/ucd/EastAsianWidth.txt https://www.unicode.org/reports/tr11/
In this regular expression I judged these ranges as fullwidth characters out of Blocks.txt.
3000..303F; CJK Symbols and Punctuation
...
4E00..9FFF; CJK Unified Ideographs
FF00..FFEF; Halfwidth and Fullwidth Forms
In addition to this, you may also need to add pictograms/emoji and CJK additional characters.
Some symbols combine with the previous character to represent a single character.
ä : U+00E4
ä : U+0061 U+0308
ǖ : U+01D6
ǖ : U+00FC U+0304
ǖ : U+0075 U+0308 U+0304
が : U+304C
が : U+304B U+3099
They should be counted as one letter including symbols.
In EastAsianWidth.txt and UnicodeData.txt, it is written as "Mn" Mark, Nonspacing. https://www.unicode.org/Public/10.0.0/ucd/UnicodeData.txt
These symbols may be counted as 0 as the width.
http://unicode.org/reports/tr51/
I do not understand the details of these processes in detail.
These symbols may also be counted as width 0.
Really appreciate your information. Then I found a library grapheme-splitter.
But I think it can only solve the problem of combining characters (e.g.
ä
). For emoji❎
or non-BMP character🠖
, although we can use the lib to get the right number of grapheme clusters (i.e. 1), the real glyph is not exact 1 space wide (I guess it is related to the font?).And there is another convention different with above cases. Although
"中文".length == 2
is correctly counted (Chinese), we usually use 4 spaces to align it.
Any updates on this topic? Can we work on this using grapheme-splitter. To solve 80% of the case in the meantime.
we treat all emoji characters as two halfwidth chars. use https://github.com/mathiasbynens/emoji-regex
中文引号、逗号等字符的对齐也存在类似问题,应该也都是2个字节的宽度,但是格式化后是被识别为1个字节。
用的“思源黑体 HW”宽字体,markdown内容:
| 用户名 | 性别 | 备注 |
| ------ | ---- | -------- |
| A | “男” | 中文引号 |
| B | "女" | 英文引号 |
格式化后的效果如下:
中文引号和英文(弯)引号共用字符,宽度取决于你选择的字体。比如 GitHub 上(你的评论里)就显示为一个字符宽。
Running "Format Document" on the following markdown file should align all rightmost
|
characters, but does not. To reproduce run "Format Document" on this file. My Markdown All in One version is 1.2.0There are no error messages in the debug console.
By the looks of it, the problem is how Unicode characters are treated. It looks like the size is calculated in number of UTF-16 code units. This fails for combining characters, as multiple of them combine into a single glyph.
For non-BMP (basic multilinguar plane) characters the problem is that they require two UTF-16 code units to represent them, but they are also only a single glyph, thus miscounting the size.
I am not sure if there is an easy fix for the Emoji problem, as the extension would need to know how the emoji glyph is rendered, to determine the correct size.
Edited by @yzhang-gh (2020/01/26).
🠖
: no general solution yet