whatwg / url

URL Standard
https://url.spec.whatwg.org/
Other
533 stars 139 forks source link

Proposal for new version of parsing spec #778

Open agowa opened 1 year ago

agowa commented 1 year ago

I think we need to extend the PEG to specify the lower-layer protocols explicitly (I.E., chain multiple schemas together). Especially since HTTP can now also be via UDP and more and more stuff uses HTTP as a transport/tunneling protocol. The current parsing spec is just not flexible enough and thereby adds a bunch of redundancies and limitations that lead to non-compatible implementation differences in software already (implementation of UNIX domain sockets or the "http+socket://" notations). And these differences can cause security vulnerabilities when one software gets "glued to another" (different parsing between WAF, frontend, and backend servers).

Like so far, I think this one would encompass all of these (new) challenges and flexibility while still being backward compatible with the current one:

LowestLayer+HigherLayer+EvenHigherLayer://[[username]:[password]@EvenHigherLayerEndpointIdentifier]:[HigherLayerEndpointIdentifier]:[LowerLayerEndPointIdentifier]/resource (with optional square brackets around each attribute and default values for the lower layers if not specified in the URL explicitly, as well as recommendation to offer a strict parsing mode for implementations that will not try to guess anything and only treat URLs with square brackets around every attribute and explicitly provided data (no implied application ports, no implied lower layer protocols, ...), mainly for security, futureproofing and reliability in usages by scripts and automation, as well as for debugability by experts and prosumers). And multiple (chained) endpoint identifiers only being allowed for the verbose version (to avoid parsing bugs and ambiguity), as well as requiring EndpointIdentifiers to match the number of specified lower layers 1:1 (but in reversed order). (And the current username:password@ would explicitly become part of the part that specifies the HTTP endpoint, for example so that each layer can have its own independent login information or additional protocol-specific information, we'd just hand it off to the protocol the schema specified as an opaque blob)

Examples:

Stack example URI Comment
TCP => HTTP tcp+http://[example.com]:[80] HTTP via TCP, no probing and no switching to e.g. UDP
UDP => HTTP udp+http://[example.com]:[80] HTTP via UDP, no probing and no switching to e.g. TCP
TCP => TLS => HTTP tcp+tls+http://[example.com]:[example.com]:[443] So having two schemas specified for "TLS" wrapped version is now no longer necessary as a side effect, but kepting them for backward compatibility for already added/specified ones isn't an issue anyway
TCP => TLS => HTTP tcp+https://[example.com]:[443] Same, but with HTTPS instead of "tls+http"
UDP => HTTP => TCP udp+http+tcp://[48569]:[example.com]:[80] Specifies a raw TCP stream that is tunneled through HTTP which itself is served via a UDP connection
IP => TCP => TLS => HTTP ip+tcp+tls+http://[example.com]:[example2.com]:[443]:[2001:db8::1]/foo This form would mean that an IP connection to 2001:db8::1 is established that contains a TCP connection to port 443, which contains a TLS connection¹. And the HTTP being within it and the SNI header of example.com
HTTP => HTTP http+http://[example.com]:[username:password@example2.com] authenticating against example2.com to use it as HTTP proxy to connect to example.com, also avoids current ambiguity of credentials being for the destination or the proxy
Unix Socket => HTTP socket+http://[example.com]:[/run/foo/bar.sock]/foobar opening a unix socket to /run/foo/bar.sock and sending example.com as the SNI name
FILE => HTTP file+http://[example.com]:[/run/foo/bar.sock]/foobar same, but using file as schema, for an even more generic approach, as technology also other things beside unix sockets that are accessed like files could be used like e.g., a (virtual) serial device

¹: with an explicitly specified hostname example2.com to use for certificate validation. Web browsers should throw a disableable (in the options, not the error message itself) error if this differs from the HTTP SNI, but that's application behavior (shouldn't be part of the PEG), as for CLI tools, debugging and developing or for web proxies like those universities use for off-campus online access to journals etc, it is very much desirable.

This extension (or, admittedly, propose for a new version of the PEG) is my preferred improvement, as it does not break the independence of the different protocols and allows extensibility, debugability, and clarity (no ambiguity and no security vulnerabilities by parsing "trickery"). But if breaking backward compatibility is not an issue (e.g. because we can detect the "parser spec version" easily, then I'd prefer this alternative:

LowestLayer[[LowerLayerEndPointIdentifier]]+HigherLayer[[HigherLayerEndpointIdentifier]]+EvenHigherLayer[[username]:[password]@EndpointIdentifier]://resourcePath

Example: ip[2001:db8::1]+tcp[80]+http[example.com]://index.html (or maybe for backward compatibility via a new schema in the existing style like uri2://ip[2001:db8::1]+tcp[80]+http[example.com]/index.html)

Change the syntax completely to have the endpoint identifiers right after the schema part. Cleaner, simpler to implement a parser for, but a drastic and breaking change to the current one, currently not used (not even in a similar fashion) by any available implementation I'm aware of.

Limitations: Changes to the parsing spec will take a long time to get to clients as it requires every parser to update, but the status quo of "everyone doing their own thing" of working around these limitations is worse than having a cut and a v2, esp. because ways for backward compatibility can be implemented and therefore "switching the parser" should be manageable in almost all cases. Alternatively, the current syntax could be used as "a shorthand" or "auto-detect" mode of the new one, which mainly should be used in automation and prouser use cases anyway (so not visible to the "average end-user" that just wants to browse amazon or Facebook)...

Address:

749

577

...

annevk commented 1 year ago

I'm not sure what PEG is, but it's highly unlikely we'll change the URL parser, especially so drastically. The motivation is also somewhat hard to digest and not necessarily convincing, as browsers have adopted UDP for HTTP without a need for a change in scheme.

https://whatwg.org/faq#adding-new-features might help in presenting this in a clearer way. And https://whatwg.org/working-mode#changes lists the requirements for an actual change.

agowa commented 1 year ago

PEG == Parsing Expression Grammar, basically the abstract way of defining how a parser works. https://en.wikipedia.org/wiki/Parsing_expression_grammar

More things than web browsers exist and the URL schema is not just for browsers. Also just a few days ago someone on twitter asked how CORS is supposed to work with HTTP via Unix Socket, so the limitations are real. And for web browsers/users the change wouldn't be noticeable, as the current syntax would remain as a shorthand notation, where as the newer more explicit one will allow all the usecases that currently aren't standardized and that everyone implements in hundred different ways...