Open agowa opened 1 year ago
I'm not sure what PEG is, but it's highly unlikely we'll change the URL parser, especially so drastically. The motivation is also somewhat hard to digest and not necessarily convincing, as browsers have adopted UDP for HTTP without a need for a change in scheme.
https://whatwg.org/faq#adding-new-features might help in presenting this in a clearer way. And https://whatwg.org/working-mode#changes lists the requirements for an actual change.
PEG == Parsing Expression Grammar, basically the abstract way of defining how a parser works. https://en.wikipedia.org/wiki/Parsing_expression_grammar
More things than web browsers exist and the URL schema is not just for browsers. Also just a few days ago someone on twitter asked how CORS is supposed to work with HTTP via Unix Socket, so the limitations are real. And for web browsers/users the change wouldn't be noticeable, as the current syntax would remain as a shorthand notation, where as the newer more explicit one will allow all the usecases that currently aren't standardized and that everyone implements in hundred different ways...
I think we need to extend the PEG to specify the lower-layer protocols explicitly (I.E., chain multiple schemas together). Especially since HTTP can now also be via UDP and more and more stuff uses HTTP as a transport/tunneling protocol. The current parsing spec is just not flexible enough and thereby adds a bunch of redundancies and limitations that lead to non-compatible implementation differences in software already (implementation of UNIX domain sockets or the "http+socket://" notations). And these differences can cause security vulnerabilities when one software gets "glued to another" (different parsing between WAF, frontend, and backend servers).
Like so far, I think this one would encompass all of these (new) challenges and flexibility while still being backward compatible with the current one:
LowestLayer+HigherLayer+EvenHigherLayer://[[username]:[password]@EvenHigherLayerEndpointIdentifier]:[HigherLayerEndpointIdentifier]:[LowerLayerEndPointIdentifier]/resource
(with optional square brackets around each attribute and default values for the lower layers if not specified in the URL explicitly, as well as recommendation to offer a strict parsing mode for implementations that will not try to guess anything and only treat URLs with square brackets around every attribute and explicitly provided data (no implied application ports, no implied lower layer protocols, ...), mainly for security, futureproofing and reliability in usages by scripts and automation, as well as for debugability by experts and prosumers). And multiple (chained) endpoint identifiers only being allowed for the verbose version (to avoid parsing bugs and ambiguity), as well as requiring EndpointIdentifiers to match the number of specified lower layers 1:1 (but in reversed order). (And the currentusername:password@
would explicitly become part of the part that specifies the HTTP endpoint, for example so that each layer can have its own independent login information or additional protocol-specific information, we'd just hand it off to the protocol the schema specified as an opaque blob)Examples:
tcp+http://[example.com]:[80]
udp+http://[example.com]:[80]
tcp+tls+http://[example.com]:[example.com]:[443]
tcp+https://[example.com]:[443]
udp+http+tcp://[48569]:[example.com]:[80]
ip+tcp+tls+http://[example.com]:[example2.com]:[443]:[2001:db8::1]/foo
example.com
http+http://[example.com]:[username:password@example2.com]
socket+http://[example.com]:[/run/foo/bar.sock]/foobar
file+http://[example.com]:[/run/foo/bar.sock]/foobar
¹: with an explicitly specified hostname
example2.com
to use for certificate validation. Web browsers should throw a disableable (in the options, not the error message itself) error if this differs from the HTTP SNI, but that's application behavior (shouldn't be part of the PEG), as for CLI tools, debugging and developing or for web proxies like those universities use for off-campus online access to journals etc, it is very much desirable.This extension (or, admittedly, propose for a new version of the PEG) is my preferred improvement, as it does not break the independence of the different protocols and allows extensibility, debugability, and clarity (no ambiguity and no security vulnerabilities by parsing "trickery"). But if breaking backward compatibility is not an issue (e.g. because we can detect the "parser spec version" easily, then I'd prefer this alternative:
LowestLayer[[LowerLayerEndPointIdentifier]]+HigherLayer[[HigherLayerEndpointIdentifier]]+EvenHigherLayer[[username]:[password]@EndpointIdentifier]://resourcePath
Example:
ip[2001:db8::1]+tcp[80]+http[example.com]://index.html
(or maybe for backward compatibility via a new schema in the existing style likeuri2://ip[2001:db8::1]+tcp[80]+http[example.com]/index.html
)Change the syntax completely to have the endpoint identifiers right after the schema part. Cleaner, simpler to implement a parser for, but a drastic and breaking change to the current one, currently not used (not even in a similar fashion) by any available implementation I'm aware of.
Limitations: Changes to the parsing spec will take a long time to get to clients as it requires every parser to update, but the status quo of "everyone doing their own thing" of working around these limitations is worse than having a cut and a v2, esp. because ways for backward compatibility can be implemented and therefore "switching the parser" should be manageable in almost all cases. Alternatively, the current syntax could be used as "a shorthand" or "auto-detect" mode of the new one, which mainly should be used in automation and prouser use cases anyway (so not visible to the "average end-user" that just wants to browse amazon or Facebook)...
Address:
749
577
...