whatwg / html

HTML Standard
https://html.spec.whatwg.org/multipage/
Other
8.18k stars 2.71k forks source link

Provide named character entities for invisible and ambiguous Unicode characters #10297

Open r12a opened 7 months ago

r12a commented 7 months ago

What problem are you trying to solve?

It is much easier for content authors to spot and work with invisible Unicode characters if they are coded using named entities. Some users have to deal with many such characters on a regular basis (Arabic authors work with 12 or more regularly) and it is difficult to remember the Unicode code points. Others only use these characters infrequently, and it is equally difficult to remember the appropriate code point value when needed. In addition, invisible characters in the code can be problematic to work with, especially if they impact the display (such as paired directional embeddings, in RTL scripts), because they are overlooked or duplicated, or miscopied.

What solutions exist today?

Some of these characters have named character entities, but some of the more frequently used ones do not.

How would you solve it?

The W3C i18n WG proposes the following additions. For convenience, the list includes characters for which we already have named entities; these are indicated using ✅. Possible named entities are suggested for the new items; these are derived from standard Unicode abbreviations, where available.

Latin 1 Supplement — Latin-1 punctuation and symbols

Combining Diacritical Marks — Grapheme joiner

Arabic — Format character

Ogham — Space

Mongolian — Format controls

General Punctuation — Spaces

General Punctuation — Format character

We would also like to coin a new &zwsp; entity name, in addition to the too long and complicated ​for U+200B.

General Punctuation — Separators

General Punctuation — Space

General Punctuation — Invisible operators

CJK Symbols And Punctuation — CJK symbols and punctuation

Emoji Variation Selectors - turns on and off colour

Potential additional candidates

Hangul Jamo — Old initial consonants

Hangul Jamo — Medial vowels

Hangul Compatibility Jamo — Special character

Halfwidth And Fullwidth Forms — Halfwidth Hangul variants

General Punctuation — Invisible operators

Shorthand Format Controls — Shorthand format controls

Musical Symbols — Beams and slurs

Anything else?

There are other invisible characters which probably do not need entities. The list above selects those most likely to be useful. In particular, only 2 of the many, many variation selectors are listed here – these are the two that are regularly used for emojis.

There may also be a need to support Egyptian hieroglyph formatting controls, some of which will come out with Unicode 16 later this year.

annevk commented 7 months ago

Can this be folded into #5121 or vice versa? I'm not sure why we need two issues for this.

Psychpsyo commented 7 months ago

I get that &6msp; for the SIX-PER-EM SPACE might be derived from a standard Unicode abbreviation (couldn't actually find the relevant standard at the moment) but given that THREE-PER-EM SPACE and FOUR-PER-EM SPACE are already   and  , shouldn't this one be &emsp16; for consistency? The way it is right now seems confusing.

Similarly, it might make sense to change &nqsp; and &mqsp; to &enqsp; and &emqsp; for consistency with the other em/en related ones as well.

ntounsi commented 7 months ago

Thank @r12a for bringing this up.

It is very welcome to use named entities instead of the digits of the codepoint and their markup syntax x#&HHHH;.

About formatting characters, one can perhaps remember the very common [202A/202B, 202D] to delimit bidi sentences (although you still have to remember that A is for left and B is for right...) But now there are the others: lro/rlo coded 202D/202E, and rli/lri fsi pdfi coded coded 2066/2067 2068 2069, not the same range...

Especially since some HTML editors also replace the digital entities x#&HHHH; by the corresponding Unicode characters U+HHHH which are invisible in the source. Whereas if it's a named entity they put a visible mark instead.

(BTW, I've always wondered why ‎/‏ and not &lre; etc.?)