Open adraffy opened 1 year ago
It depends? 😊 I haven't looked into it, probably depends a lot on what it ends up meaning for ToASCII and what the arguments are.
There are 3K+ RGI emoji and 1/3 of them involve ZWJ sequences. CheckJoiners chooses a few exotic characters (that can easily be enforced at the registrar level) for 1350 emoji sequences that are used internationally by billions of people.
RFC 5892 is both outdated (2010) and misguided. AFAICT it's trying to allow ZW(N)J for typographical reasons yet I don't think there's any ambiguity with or without a joiner.
If you look across the internet, there are thousands of developer hours wasted on deciding these choices one way or another, but at the end of the day, CheckJoiners is just a convoluted way to disallow 200C
and 200D
.
For a concrete example: 1F468 200D 1F4BB
xn--1ugz855pfha
xn--qq8hgf
which is wrong — 1F468 1F4BB
is not the same as 1F468 200D 1F4BB
The simplest solution is that CheckJoiners
should be false
For reference, I recently implemented a normalization standard for the Ethereum Name Service ecosystem. I used a combination of UTS-51 + UTS-46 + significantly safer character set (banned punctuation, parens, brackets, vocalizations, obsolete, deprecated, ancient, reversed, turned, flipped, many ligatures, etc.) + an intelligent confusable system (that isn't just a warning system: eg. rn
is a footgun confusable.) Demo | Github
From my experience with the Unicode and RFC documentation, the primary source of confusion and bugs is due to the documentation itself. Many of these rules should be deprecated and the rules should be clarified and modernized.
I think WHATWG made the correct decision with AllowHyphens
and finally broke away from archaic DNS rules.
I think they should do the same with CheckJoiners
. If the WHATWG really wants to protect end-users, it should recommend UTS-51 RGI pre-processing and outright disallow ZW(N)J outside of emoji.
Is CheckJoiners/ContextJ set in stone or can it be debated? If so, I'd like to present some arguments.