wtetzner / exploding-fish

A URI library for Clojure
Other
150 stars 12 forks source link

Host should be decoded? #27

Open agriffis opened 4 years ago

agriffis commented 4 years ago

It seems like this is incorrect:

user=> (fish/host "https://%76imeo.com")
"%76imeo.com"

Shouldn't the host be decoded by the accessor?

wtetzner commented 4 years ago

Good question. I haven't looked into the RFC to see if it supports percent encoding in hostnames. If that's the case, then the existing host function should be renamed to host-raw, and a new host function should be introduced that does percent decoding.

agriffis commented 4 years ago

https://url.spec.whatwg.org/#host-parsing

  1. Let domain be the result of running UTF-8 decode without BOM on the string percent decoding of input.
wtetzner commented 4 years ago

Yeah, it looks like the grammar in the RFC allows percent encoding in hostnames too.

The WHATWG spec is a little weird, since it seems to assume you have a byte array, and only does UTF-8 decoding after percent decoding.

agriffis commented 4 years ago

I think that makes sense though. URL strings shouldn't be UTF-8 decoded directly. An URL is a serialized object where the various components have separate encoding rules and need to be decoded separately from each other. It's wild but it makes sense.

It also means that there are printable URLs that cannot be decoded, because the percent-encoded bytes might combine with other bytes in a way that isn't valid UTF-8.

wtetzner commented 4 years ago

It does make sense. I should have been more clear about what I meant. The weirdness is trying to follow that spec here. We don't have a byte array, the URL has already been decoded into a Unicode string.

But it shouldn't really cause any problems, it can just throw an exception if the percent decoding emits invalid byte sequences.