whatwg / url

URL Standard
https://url.spec.whatwg.org/
Other
532 stars 140 forks source link

Encourage denoting character-attributable errors by the REPLACEMENT CHARACTER #819

Open hsivonen opened 8 months ago

hsivonen commented 8 months ago

What is the issue with the URL Standard?

The URL Standard gives advice about URL rendering: https://url.spec.whatwg.org/#ref-for-concept-domain-to-unicode%E2%91%A0

It also in the https://url.spec.whatwg.org/#concept-host-parser section says: "Alternatively UTF-8 decode without BOM or fail can be used, coupled with an early return for failure, as domain to ASCII fails on U+FFFD (�).", which is the opposite remark of what I'm asking for here.

UTS 46 says: "Implementations may make further modifications to the resulting Unicode string when showing it to the user. For example, it is recommended that disallowed characters be replaced by a U+FFFD to make them visible to the user."

It would be useful for the URL Standard to highlight this technique and to include a Note to encourage letting U+FFFD from UTF-8 decode flow through the processing and to replace erroneous code points during UTS 46 processing and forbidden domain code point processing with U+FFFD so that errors that are attributable to specific things in the domain are visualized to the user. Since U+FFFD is itself a disallowed character, this technique preserves the overall failure status of the domain.