yzhang-gh / vscode-markdown

Markdown All in One
https://marketplace.visualstudio.com/items?itemName=yzhang.markdown-all-in-one
MIT License
2.84k stars 322 forks source link

Table Formating does not work properly for combining character, Emoji, and non-BMP characters #151

Open jonasbb opened 6 years ago

jonasbb commented 6 years ago

Running "Format Document" on the following markdown file should align all rightmost | characters, but does not. To reproduce run "Format Document" on this file. My Markdown All in One version is 1.2.0

# Autoformat Test

| Unicode Codepoint | Symbol |
| ----------------- | ------ |
| U+0061 + U+0308   | ä     |
| U+00E4            | ä      |
| U+274E            | ❎      |
| U+1F816           | 🠖🠖🠖 |

There are no error messages in the debug console.

By the looks of it, the problem is how Unicode characters are treated. It looks like the size is calculated in number of UTF-16 code units. This fails for combining characters, as multiple of them combine into a single glyph.

For non-BMP (basic multilinguar plane) characters the problem is that they require two UTF-16 code units to represent them, but they are also only a single glyph, thus miscounting the size.

I am not sure if there is an easy fix for the Emoji problem, as the extension would need to know how the emoji glyph is rendered, to determine the correct size.


Edited by @yzhang-gh (2020/01/26).

yzhang-gh commented 6 years ago

Thanks for reporting this.

As you mentioned, there are 2 problems

  1. Combining characters. One possible solution is to maintain a list for them (e.g. ['ä', ...]) and then add an extra space. But I have a question, are those two different in the first and second row? They look the same to me.
  2. Emoji. This is hard. Every emoji has different width. So I don't know whether there is a good solution

Oh, just realized that you mentioned 3 kinds of characters. Is the fourth line corresponding to the non-BMP character?

jonasbb commented 6 years ago
  1. Yes the äs are indeed different (unless github normalizes the unicode input) and they should look identical. I have a small javascript snippet which shows the problems, especially regarding the .length property.

Screenshot: screenshot

Javascript code snippet:

s1 = "\u00e4"
s2 = "\u0061\u0308"
s3 = "\u{1F816}"

"String s1: '" + s1 + "'\nLength s1: " + s1.length.toString() +
"\nString s2: '" + s2 + "'\nLength s2: " + s2.length.toString() +
"\nString s3: '" + s3 + "'\nLength s3: " + s3.length.toString()

/*
String s1: 'ä'
Length s1: 1
String s2: 'ä'
Length s2: 2
String s3: '🠖'
Length s3: 2
*/

The fourth line shows a non-BMP character. non-BMP characters have code points between U+10000 and U+15FFFF. In Javascript and some other languages, they have a length of 2, as you can see from the s3 string in the js snippet.

Having a whitelist of combining characters for 1. would not work, as they can be combined arbitrarily. The only real solution is to use a library which counts the number of grapheme clusters. This solves part 1 and part 3 of the problem.

Here is a blogpost explaining the problem: https://blog.jonnew.com/posts/poo-dot-length-equals-two

yzhang-gh commented 6 years ago

Really appreciate your information. Then I found a library grapheme-splitter.

But I think it can only solve the problem of combining characters (e.g. ). For emoji or non-BMP character 🠖, although we can use the lib to get the right number of grapheme clusters (i.e. 1), the real glyph is not exact 1 space wide (I guess it is related to the font?).

And there is another convention different with above cases. Although "中文".length == 2 is correctly counted (Chinese), we usually use 4 spaces to align it.

jonasbb commented 6 years ago

I basically have no knowledge about East Asian scripts and how they work in Unicode. I think there are halfwidth and fullwidth characters. The fullwidth characters mainly for East Asian scripts which are printed as two columns.

During a quick google search I found this document http://www.unicode.org/reports/dtr11-02.html Although I don't know how that integrates with grapheme clusters.

The easiest would probably be if Vscode has an api to ask for the width of a string. Alternatively, reimplementing how Vscode or terminals simulators handle these cases.

yzhang-gh commented 6 years ago

I think there are halfwidth and fullwidth characters. The fullwidth characters mainly for East Asian scripts which are printed as two columns.

Exactly. For these fullwidth chars, we treat them as two halfwidth chars and then choose a proper mono font. So this is not a big problem. (vscode#48481, vscode#14589)

The easiest would probably be if Vscode has an api to ask for the width of a string

VSCode doesn't and I guess will never provide such an API. What we can control is the text, in other words, the number of spaces we pad around the text in a certain table cell. If a char has width of neither halfwidth nor fullwidth, there is not much we can do.

Alternatively, reimplementing how Vscode or terminals simulators handle these cases.

You are going so deep into the rabbit hole 😄. But the VSCode integrated terminal does have more considerations about char width than the VSCode editors.

So, I am going to adopt the above grapheme-splitter lib to address the problem, leaving emoji and 🠖 as upstream issues (VSCode or Chromium?)

yzhang-gh commented 6 years ago

I found vscode does treat as two characters. You need to press the arrow key two times to pass it.

press

This makes me hesitate whether to fix this issue before vscode deals with it.

Matsuyanagi commented 6 years ago

I am a Japanese. So I am interested in Japanese.

My #153 PR does not solve this problem.

There seems to be three problems with Unicode's character width.

half-width, full-width

Chinese/Japanese kanji, hiragana and some symbols have width of "2".

"abcd".width => 4 "日本".width => 4

The range of character types is written in "Blocks.txt". https://www.unicode.org/Public/10.0.0/ucd/Blocks.txt

More precisely, the width of each code point is written in "EastAsianWidth.txt". https://www.unicode.org/Public/10.0.0/ucd/EastAsianWidth.txt https://www.unicode.org/reports/tr11/

In this regular expression I judged these ranges as fullwidth characters out of Blocks.txt.

3000..303F; CJK Symbols and Punctuation
...
4E00..9FFF; CJK Unified Ideographs
FF00..FFEF; Halfwidth and Fullwidth Forms

In addition to this, you may also need to add pictograms/emoji and CJK additional characters.

Combining Character Sequence

Some symbols combine with the previous character to represent a single character.

ä : U+00E4
ä : U+0061 U+0308

ǖ : U+01D6
ǖ : U+00FC U+0304
ǖ : U+0075 U+0308 U+0304

が : U+304C
が : U+304B U+3099

They should be counted as one letter including symbols.

In EastAsianWidth.txt and UnicodeData.txt, it is written as "Mn" Mark, Nonspacing. https://www.unicode.org/Public/10.0.0/ucd/UnicodeData.txt

These symbols may be counted as 0 as the width.

emoji / grapheme / skin tones / zwj / gender / ...

http://unicode.org/reports/tr51/

I do not understand the details of these processes in detail.

These symbols may also be counted as width 0.

wasdee commented 4 years ago

Really appreciate your information. Then I found a library grapheme-splitter.

But I think it can only solve the problem of combining characters (e.g. ). For emoji or non-BMP character 🠖, although we can use the lib to get the right number of grapheme clusters (i.e. 1), the real glyph is not exact 1 space wide (I guess it is related to the font?).

And there is another convention different with above cases. Although "中文".length == 2 is correctly counted (Chinese), we usually use 4 spaces to align it.

Any updates on this topic? Can we work on this using grapheme-splitter. To solve 80% of the case in the meantime.

wasdee commented 4 years ago

602 help me to format the Thai language much more pretty

zidoshare commented 3 years ago

we treat all emoji characters as two halfwidth chars. use https://github.com/mathiasbynens/emoji-regex

tanghuanoo commented 1 year ago

中文引号、逗号等字符的对齐也存在类似问题,应该也都是2个字节的宽度,但是格式化后是被识别为1个字节。

用的“思源黑体 HW”宽字体,markdown内容:

| 用户名 | 性别 | 备注     |
| ------ | ---- | -------- |
| A      | “男” | 中文引号 |
| B      | "女" | 英文引号 |

格式化后的效果如下: image

yzhang-gh commented 1 year ago

中文引号和英文(弯)引号共用字符,宽度取决于你选择的字体。比如 GitHub 上(你的评论里)就显示为一个字符宽。