Open r12a opened 7 months ago
Can this be folded into #5121 or vice versa? I'm not sure why we need two issues for this.
I get that &6msp;
for the SIX-PER-EM SPACE
might be derived from a standard Unicode abbreviation (couldn't actually find the relevant standard at the moment) but given that THREE-PER-EM SPACE
and FOUR-PER-EM SPACE
are already  
and  
, shouldn't this one be &emsp16;
for consistency?
The way it is right now seems confusing.
Similarly, it might make sense to change &nqsp;
and &mqsp;
to &enqsp;
and &emqsp;
for consistency with the other em/en related ones as well.
Thank @r12a for bringing this up.
It is very welcome to use named entities instead of the digits of the codepoint and their markup syntax x#&HHHH;
.
About formatting characters, one can perhaps remember the very common [202A/202B, 202D] to delimit bidi sentences (although you still have to remember that A is for left and B is for right...) But now there are the others: lro/rlo coded 202D/202E, and rli/lri fsi pdfi coded coded 2066/2067 2068 2069, not the same range...
Especially since some HTML editors also replace the digital entities x#&HHHH;
by the corresponding Unicode characters U+HHHH which are invisible in the source. Whereas if it's a named entity they put a visible mark instead.
(BTW, I've always wondered why ‎
/‏
and not &lre;
etc.?)
What problem are you trying to solve?
It is much easier for content authors to spot and work with invisible Unicode characters if they are coded using named entities. Some users have to deal with many such characters on a regular basis (Arabic authors work with 12 or more regularly) and it is difficult to remember the Unicode code points. Others only use these characters infrequently, and it is equally difficult to remember the appropriate code point value when needed. In addition, invisible characters in the code can be problematic to work with, especially if they impact the display (such as paired directional embeddings, in RTL scripts), because they are overlooked or duplicated, or miscopied.
What solutions exist today?
Some of these characters have named character entities, but some of the more frequently used ones do not.
How would you solve it?
The W3C i18n WG proposes the following additions. For convenience, the list includes characters for which we already have named entities; these are indicated using ✅. Possible named entities are suggested for the new items; these are derived from standard Unicode abbreviations, where available.
Latin 1 Supplement — Latin-1 punctuation and symbols
­
Combining Diacritical Marks — Grapheme joiner
&cgj;
Arabic — Format character
&alm;
Ogham — Space
Mongolian — Format controls
&fvs1;
&fvs2;
&fvs3;
&mvs;
&fvs4;
General Punctuation — Spaces
&nqsp;
&mqsp;
 
 
 
 
&6msp;
 
 
 
AND 
 
AND 
AND part of  
(U+0205F U+200A)General Punctuation — Format character
​
AND​
AND​
AND​
AND​
‌
‍
‎
‏
&lre;
&rle;
&pdf;
&lro;
&rlo;
⁠
&lri;
&rli;
&fsi;
&pdi;
We would also like to coin a new
&zwsp;
entity name, in addition to the too long and complicated​
for U+200B.General Punctuation — Separators
&lsep;
&psep;
General Punctuation — Space
&nnbsp;
 
AND part of  
(U+205F U+200A)General Punctuation — Invisible operators
⁡
AND⁡
⁢
⁣
AND⁣
CJK Symbols And Punctuation — CJK symbols and punctuation
&idsp;
Emoji Variation Selectors - turns on and off colour
&vs15;
&vs16;
Potential additional candidates
Hangul Jamo — Old initial consonants
&hcf;
Hangul Jamo — Medial vowels
&hjf;
Hangul Compatibility Jamo — Special character
&hf;
Halfwidth And Fullwidth Forms — Halfwidth Hangul variants
&hwhf;
General Punctuation — Invisible operators
&aafs;
Shorthand Format Controls — Shorthand format controls
Musical Symbols — Beams and slurs
Anything else?
There are other invisible characters which probably do not need entities. The list above selects those most likely to be useful. In particular, only 2 of the many, many variation selectors are listed here – these are the two that are regularly used for emojis.
There may also be a need to support Egyptian hieroglyph formatting controls, some of which will come out with Unicode 16 later this year.