Open agriffis opened 4 years ago
Good question. I haven't looked into the RFC to see if it supports percent encoding in hostnames. If that's the case, then the existing host
function should be renamed to host-raw
, and a new host
function should be introduced that does percent decoding.
https://url.spec.whatwg.org/#host-parsing
- Let domain be the result of running UTF-8 decode without BOM on the string percent decoding of input.
Yeah, it looks like the grammar in the RFC allows percent encoding in hostnames too.
The WHATWG spec is a little weird, since it seems to assume you have a byte array, and only does UTF-8 decoding after percent decoding.
I think that makes sense though. URL strings shouldn't be UTF-8 decoded directly. An URL is a serialized object where the various components have separate encoding rules and need to be decoded separately from each other. It's wild but it makes sense.
It also means that there are printable URLs that cannot be decoded, because the percent-encoded bytes might combine with other bytes in a way that isn't valid UTF-8.
It does make sense. I should have been more clear about what I meant. The weirdness is trying to follow that spec here. We don't have a byte array, the URL has already been decoded into a Unicode string.
But it shouldn't really cause any problems, it can just throw an exception if the percent decoding emits invalid byte sequences.
It seems like this is incorrect:
Shouldn't the host be decoded by the accessor?