whatwg / url

URL Standard
https://url.spec.whatwg.org/
Other
526 stars 137 forks source link

Proposal: Add a normalization interface #729

Open gibson042 opened 1 year ago

gibson042 commented 1 year ago

As noted in #606 and elsewhere, the URL APIs strongly lean towards preserving input in path and query components, and therefore differentiating URIs that are equivalent per e.g. https://www.rfc-editor.org/rfc/rfc9110#section-4.2.3 . But users need to compare such URIs and/or map them to resources, and doing so robustly requires normalization. I think it therefore makes sense to provide a normalization interface, and probably one that is configurable (or can become so in the future) to account for various levels of the "comparison ladder" such as generic percent-decoding (and case normalization of percent-encodings that survive), dot-segment removal, component-sensitive percent-decoding, scheme-based rules, and possibly also even higher-order considerations such as full case normalization and/or query parameter ordering/combining/value normalization.

One possibility would be adding a normalize method to the URL class with reasonable behavior in the absence of any arguments (e.g., as much normalization as possible without conflation of URIs that implementations supporting the scheme are permitted to differentiate), such that e.g. new URL("httpS://EXAMPLE.com:443/%7ESMith/./home.html").normalize() === "https://example.com/~SMith/home.html" is true but so is new URL("http://example.com/data/").normalize() !== new URL("http://example.com/data").normalize() (because presence vs. absence of a trailing slash in a path are not equivalent at the level of an http-scheme URL).

annevk commented 1 year ago

From https://www.rfc-editor.org/rfc/rfc3986.html#section-6.2 I think we would want this method to perform "Case Normalization" (essentially only of the %3a to %3A variety) and "Percent-Encoding Normalization".

The other aspects there are either already handled by the URL parser (e.g., httpS://EXAMPLE.com:443/%7ESMith/./home.html is already normalized to https://example.com/%7ESMith/home.html) or out-of-scope. We wouldn't want to offer scheme-based or protocol-based normalization as that's not tenable and better handled by the standards for those schemes and protocols. HTTP(S) schemes end up being covered anyway, but in general schemes are supposed to build on top of the URL Standard.

Now there are some difficulties with "Percent-Encoding Normalization", e.g., https://test/?%%33a. That would have to become https://test/?%253a presumably, but it's not entirely clear as the input is invalid.

And yeah, assuming application/x-www-form-urlencoded for query and normalizing that could make sense to offer as an option, though you could also do this yourself quite easily with url.searchParams.sort().