Provide named character entities for invisible and ambiguous Unicode characters

r12a commented 7 months ago

What problem are you trying to solve?

It is much easier for content authors to spot and work with invisible Unicode characters if they are coded using named entities. Some users have to deal with many such characters on a regular basis (Arabic authors work with 12 or more regularly) and it is difficult to remember the Unicode code points. Others only use these characters infrequently, and it is equally difficult to remember the appropriate code point value when needed. In addition, invisible characters in the code can be problematic to work with, especially if they impact the display (such as paired directional embeddings, in RTL scripts), because they are overlooked or duplicated, or miscopied.

What solutions exist today?

Some of these characters have named character entities, but some of the more frequently used ones do not.

How would you solve it?

The W3C i18n WG proposes the following additions. For convenience, the list includes characters for which we already have named entities; these are indicated using ✅. Possible named entities are suggested for the new items; these are derived from standard Unicode abbreviations, where available.

Latin 1 Supplement — Latin-1 punctuation and symbols

✅ U+00A0 NO-BREAK SPACE  
✅ U+00AD SOFT HYPHEN

Combining Diacritical Marks — Grapheme joiner

U+034F COMBINING GRAPHEME JOINER &cgj;

Arabic — Format character

U+061C ARABIC LETTER MARK &alm;

Ogham — Space

U+1680 OGHAM SPACE MARK

Mongolian — Format controls

U+180B MONGOLIAN FREE VARIATION SELECTOR ONE &fvs1;
U+180C MONGOLIAN FREE VARIATION SELECTOR TWO &fvs2;
U+180D MONGOLIAN FREE VARIATION SELECTOR THREE &fvs3;
U+180E MONGOLIAN VOWEL SEPARATOR &mvs;
U+180F MONGOLIAN FREE VARIATION SELECTOR FOUR &fvs4;

General Punctuation — Spaces

U+2000 EN QUAD &nqsp;
U+2001 EM QUAD &mqsp;
✅ U+2002 EN SPACE &ensp;
✅ U+2003 EM SPACE &emsp;
✅ U+2004 THREE-PER-EM SPACE &emsp13;
✅ U+2005 FOUR-PER-EM SPACE &emsp14;
U+2006 SIX-PER-EM SPACE &6msp;
✅ U+2007 FIGURE SPACE &numsp;
✅ U+2008 PUNCTUATION SPACE &puncsp;
✅ U+2009 THIN SPACE   AND  
✅ U+200A HAIR SPACE &hairsp; AND &VeryThinSpace; AND part of  (U+0205F U+200A)

General Punctuation — Format character

✅ U+200B ZERO WIDTH SPACE &ZeroWidthSpace; AND &NegativeMediumSpace; AND &NegativeThickSpace; AND &NegativeThinSpace; AND &NegativeVeryThinSpace;
✅ U+200C ZERO WIDTH NON-JOINER &zwnj;
✅ U+200D ZERO WIDTH JOINER &zwj;
✅ U+200E LEFT-TO-RIGHT MARK &lrm;
✅ U+200F RIGHT-TO-LEFT MARK &rlm;
U+202A LEFT-TO-RIGHT EMBEDDING &lre;
U+202B RIGHT-TO-LEFT EMBEDDING. &rle;
U+202C POP DIRECTIONAL FORMATTING &pdf;
U+202D LEFT-TO-RIGHT OVERRIDE &lro;
U+202E RIGHT-TO-LEFT OVERRIDE &rlo;
✅ U+2060 WORD JOINER &NoBreak;
U+2066 LEFT-TO-RIGHT ISOLATE &lri;
U+2067 RIGHT-TO-LEFT ISOLATE &rli;
U+2068 FIRST STRONG ISOLATE &fsi;
U+2069 POP DIRECTIONAL ISOLATE &pdi;

We would also like to coin a new &zwsp; entity name, in addition to the too long and complicated &ZeroWidthSpace;for U+200B.

General Punctuation — Separators

U+2028 LINE SEPARATOR &lsep;
U+2029 PARAGRAPH SEPARATOR &psep;

General Punctuation — Space

U+202F NARROW NO-BREAK SPACE &nnbsp;
✅ U+205F MEDIUM MATHEMATICAL SPACE   AND part of   (U+205F U+200A)

General Punctuation — Invisible operators

✅ U+2061 FUNCTION APPLICATION ⁡ AND ⁡
✅ U+2062 INVISIBLE TIMES ⁢ AND ⁢
✅ U+2063 INVISIBLE SEPARATOR ⁣ AND ⁣
U+2064 INVISIBLE PLUS

CJK Symbols And Punctuation — CJK symbols and punctuation

U+3000 IDEOGRAPHIC SPACE &idsp;

Emoji Variation Selectors - turns on and off colour

U+FE0E: VARIATION SELECTOR-15 &vs15;
U+FE0F: VARIATION SELECTOR-16 &vs16;

Potential additional candidates

Hangul Jamo — Old initial consonants

U+115F HANGUL CHOSEONG FILLER &hcf;

Hangul Jamo — Medial vowels

U+1160 HANGUL JUNGSEONG FILLER &hjf;

Hangul Compatibility Jamo — Special character

U+3164 HANGUL FILLER &hf;

Halfwidth And Fullwidth Forms — Halfwidth Hangul variants

U+FFA0 HALFWIDTH HANGUL FILLER &hwhf;

General Punctuation — Invisible operators

U+206D ACTIVATE ARABIC FORM SHAPING &aafs;

Shorthand Format Controls — Shorthand format controls

U+1BCA0 SHORTHAND FORMAT LETTER OVERLAP
U+1BCA1 SHORTHAND FORMAT CONTINUING OVERLAP
U+1BCA2 SHORTHAND FORMAT DOWN STEP
U+1BCA3 SHORTHAND FORMAT UP STEP

Musical Symbols — Beams and slurs

U+1D173 MUSICAL SYMBOL BEGIN BEAM
U+1D174 MUSICAL SYMBOL END BEAM
U+1D175 MUSICAL SYMBOL BEGIN TIE
U+1D176 MUSICAL SYMBOL END TIE
U+1D177 MUSICAL SYMBOL BEGIN SLUR
U+1D178 MUSICAL SYMBOL END SLUR
U+1D179 MUSICAL SYMBOL BEGIN PHRASE
U+1D17A MUSICAL SYMBOL END PHRASE

Anything else?

There are other invisible characters which probably do not need entities. The list above selects those most likely to be useful. In particular, only 2 of the many, many variation selectors are listed here – these are the two that are regularly used for emojis.

There may also be a need to support Egyptian hieroglyph formatting controls, some of which will come out with Unicode 16 later this year.

annevk commented 7 months ago

Can this be folded into #5121 or vice versa? I'm not sure why we need two issues for this.

Psychpsyo commented 7 months ago

I get that &6msp; for the SIX-PER-EM SPACE might be derived from a standard Unicode abbreviation (couldn't actually find the relevant standard at the moment) but given that THREE-PER-EM SPACE and FOUR-PER-EM SPACE are already &emsp13; and &emsp14;, shouldn't this one be &emsp16; for consistency? The way it is right now seems confusing.

Similarly, it might make sense to change &nqsp; and &mqsp; to &enqsp; and &emqsp; for consistency with the other em/en related ones as well.

ntounsi commented 7 months ago

Thank @r12a for bringing this up.

It is very welcome to use named entities instead of the digits of the codepoint and their markup syntax x#&HHHH;.

About formatting characters, one can perhaps remember the very common [202A/202B, 202D] to delimit bidi sentences (although you still have to remember that A is for left and B is for right...) But now there are the others: lro/rlo coded 202D/202E, and rli/lri fsi pdfi coded coded 2066/2067 2068 2069, not the same range...

Especially since some HTML editors also replace the digital entities x#&HHHH; by the corresponding Unicode characters U+HHHH which are invisible in the source. Whereas if it's a named entity they put a visible mark instead.

(BTW, I've always wondered why &lrm;/&rlm; and not &lre; etc.?)

whatwg / html