Open wlammen opened 4 years ago
Or U+FFFD (�), which I did somewhere recently and I like somewhat better and as it can be rendered would also be preferred.
I am not happy with your suggestion in this particular case. The replacement character serves a special need of applications. If they cannot display a character (because they have not included the latest Unicode version, for example), the only way to indicate this to the user, is to use the replacement character. This MUST NEVER be mixed up with regular data. Which simply means you SHOULD NOT include � in your text, for the sake of a user's ability to attribute shortcomings to the software without doubt. Actually, I checked the rendering form out before my initial post and inserted � into my favourite editor. Fortunately to no avail, with speaks a lot for this piece of software. I do not recommend using U+FFFD (�), but suggest to fall back to U+FFFD REPLACEMENT CHARACTER as the ONLY valid display of this character. If you have a closer look at the rendering form, they have a somewhat principal problem. How can you be sure, a web browser, not capable of the latest (or even former) Unicode version, is able to display the Infra standard properly, if you render the character? The universal form using the name always succeeds. Wolf Lammen
PS To avoid any misconceptions arising from this post: @aphillips is right, when he criticises my showcase here. A valid code point not covered by the current code charts of an application should not trigger the replacement character, but a hollow rectangle instead. The interested reader will find extra information in the Unicode chapter 5.3. The example can be fixed with little effort, so IMO the underlying arguments continue to hold.
He also points out, that replacement characters can replace isolated surrogates. As far as I can see, this is correct, too, but out of the scope of this issue. Maybe I should have provided a better context. It's on my TODO list.
@wlammen I'm not sure I understand your comment? Do you mean "don't include a U+FFFD literal in the text of Infra"? Or do you mean "don't replace isolated surrogates with one"?
U+FFFD is exactly the right thing to use for malformed content or encoding conversion errors--such as scalar value string conversion of isolated surrogates. The display of unsupported characters is usually not U+FFFD but rather the empty (or "notdef") glyph. This is often a hollow box (and colloquially this kind of display is called "tofu" for the food product that looks kind of like white squares). Occasionally some platforms use other glyphs such as question marks or boxes with the hex code point in it, decorative gunk for the Unicode block the character is in, etc. But hardly ever is the character displayed U+FFFD. U+FFFD says "there was data here, but it is gone now". Programs most definitely should keep any U+FFFD code points they are sent.
No, I oppose to U+FFFD (�), not to U+FFFD or U+FFFD REPLACEMENT CHARACTER (written out). The character � is problematic, because the browser, editor... uses this as an error indicator. And if you feed in error characters as regular input, you fool the reader (worst case). This all is about the content of the Infra web page, NOT about the internal data representation of some program. This has to be considered separately.
Added 8 hours later:
Let Unicode speak. https://www.unicode.org/charts/PDF/UFFF0.pdf
• used to replace an incoming character whose value is unknown or unrepresentable in Unicode
So "U+FFFD (�)" in your web content literally says "U+FFFD (" -- ups, something went wrong here during transmission to your browser, sorry cannot display -- ")". Maybe this is a pedantic or paranoic interpretation here, but it is the official one.
I'm not a Unicode expert by any means, but I don't think that's accurate. Clearly the U+FFFD code point must be able to be used in a meta sense, to represent the character itself; otherwise it couldn't appear in Unicode character tables. This usage in Infra is similar.
If this speaks for anything, copy the character from the mentioned PDF from Unicode, and see that a different encoding is used: FFFDREPLACEMENT CHARACTER Looks like 0xE27D, perhaps some private encoding, is used. I personally feel okay with using the character in a meta sense, if it is clearly indicated by the context, that nothing else is to be expected. That is the case in a chart table FFFD � REPLACEMENT CHARACTER but not in U+FFFD (�) where an uninformed reader cannot verify this immediately.
I am not a Unicode expert either, I just don't feel happy, and thought I should voice this somehow,
The normative bit is U+XXXX. Whatever occurs after is always advisory and subject to various issues, though on most modern platforms it'll probably be okay.
The Infra standard itself suggests how to proceed in difficult cases in chapter 4.5:
U+0029 can be referred to as "U+0029 RIGHT PARENTHESIS", because even though it renders, this avoids unmatched parentheses.
I think I provided most of the points already that speak for working likewise with U+FFFD. I know there is some headroom for trading strict conformance with Unicode and other standards for readability. These murky waters are subject to personal perspectives and some overall considerations I might not be aware of. See this as my final post to this topic. I won't repeat my arguments here, as they are readily available in this thread.
Wolf Lammen
In https://infra.spec.whatwg.org/#code-points you suggest.
but you fail to do so, when in https://infra.spec.whatwg.org/#strings you write
It should be
To convert a string into a scalar value string, replace any surrogates with U+FFFD REPLACEMENT CHARACTER.
See https://www.unicode.org/charts/PDF/UFFF0.pdf
Wolf Lammen