zesterer / ariadne

A fancy diagnostics & error reporting crate
https://crates.io/crates/ariadne
MIT License
1.61k stars 72 forks source link

Adding support for full-width characters #41

Open saihaze opened 1 year ago

saihaze commented 1 year ago

When the source code full-width characters (e.g. CJK characters), underlining is not proper. They might need too spaces.

zesterer commented 1 year ago

Do you have a screenshot + sample text example of this?

saihaze commented 1 year ago

Let's take examples/simple.rs as an example. It is supposed to be like: 截图 2022-08-02 15-45-58 But when it includes a Chinese character, it looks like this: 截图 2022-08-02 15-46-37

zesterer commented 1 year ago

How much did you increment the spans by when doing so? And does Rust's .chars() count the character as a single character?

saihaze commented 1 year ago

Usually a CJK character (a character used in China, Japan and Korea) takes the space of two ASCII characters. Rust's .chars() does count it as a single character. Perhaps you can use crate unicode_width instead. I've never seen a wider character, though. But maybe you can take them into account, too.

saihaze commented 1 year ago

图片

This is my own diagnostics module. It is simple, but might be an example.

zesterer commented 1 year ago

I'll put it on the list of things to do when I get round to refactoring the crate's internals. Contributions are also welcome of course!

ratmice commented 1 year ago

Going to break my comment in the above PR out here:

The PR above fixed some of these issues, but probably not enough for CJK. I imagine what is needed should be fairly simple, I just don't know the right things to do regarding whitespace on CJK. I think all that is left should be something to the effect of:

  1. Add a flag to Config builder for enabling CJK.
  2. When that flag is enabled call width_cjk instead of width from unicode_width from lib.rs: char_width().
  3. Figure out what the width should be for whitespace where it is currently hard coded to 1?
TRCYX commented 1 year ago

As a Chinese, I have not seen "ambiguous characters" rendering wide for years. The width_cjk thing might not be necessary. From the documentation of width_cjk, I can refer to UAX#11, which says that modern practice tend to render ambiguous characters narrow. The UAX recommends that we only render them wide when we reliably know that the context wants them to be wide. At least from what I see, this does not happen a lot nowadays, especially when related to formal languages where an error could be pointed out.

But the hardcoded width of whitespace is a problem, for there is at least a wide space:  . The current approach renders it as a narrow space. I am not sure about why all whitespace converts to \x20 with width 1 in char_width: https://github.com/zesterer/ariadne/blob/06b17483d8ce1f1decf290935818f55ba6380065/src/lib.rs#L401-L413 Regarding this, I come up with two possible solutions.

  1. We could simply remove this case and make a whitespace character represent itself. Its width would be correctly calculated. But in write.rs: https://github.com/zesterer/ariadne/blob/06b17483d8ce1f1decf290935818f55ba6380065/src/write.rs#L585-L599 we do not render whitespaces multiple times; instead we could test if the original c before calling char_width was \t, or we could repeat only \x20 instead of all whitespaces.
  2. Or we keep converting whitespaces to \x20, but calculate its width instead of filling in 1:
            c if c.is_whitespace() => (' ', c.width().unwrap_or(1)),

Both solutions should work well with  , but I believe a clarification about the point of converting whitespaces to \x20 is needed. I prefer the former, since it actually keeps that wide space. If the terminal shows, for example, a dot, where there is a space, the user could possibly see some differences.

ratmice commented 1 year ago

I prefer the former, since it actually keeps that wide space. If the terminal shows, for example, a dot, where there is a space, the user could possibly see some differences.

Might just be useless trivia that won't ocur in practice, but another corner case/point in favor of doing that is U+1680 Ogham space mark, a character with a visible representation that is_whitespace(). https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=cb4c89643c03d3557d336623b242912f