webspecs / url

The URL specification
https://specs.webplatform.org/url/webspecs/develop/
Other
21 stars 9 forks source link

Document should define ABNF where appropriate #6

Open awwright opened 9 years ago

awwright commented 9 years ago

The current document as published on https://specs.webplatform.org/url/webspecs/develop/ has some really cool railroad diagrams visually showing the parser branching, but as an implementor, I don't see anything that's going to be beneficial for writing an implementation.

The document should normatively define ABNF, and which is used for generating non-normative graphics and the pegjs, and other cool things that can fit into an ABNF pipeline.

The repository includes a pegjs file, if this is normative or not is unclear. This should also be made clear.

As identified in #4, railroad diagrams could be removed (and no grammar provided) where no regular grammar production is possible.

rubys commented 9 years ago

In general, these rules are NOT regular. In general, attempting to use an ABNF pipeline is not going to produce correct results. As an implementor, these railroad diagrams, combined with the subsequent prose, contain the necessary information for a correct implementation.

FYI: bikeshed issue 281 is intended to provide a textual equivalent (likely, but not guaranteed, to be ABNF) for each railroad diagram.

For the sake of discussion, the following is the closest equivalent ABNF grammar, as produced by http://bottlecaps.de/convert/:

/* converted on Sun Nov 30, 2014, 21:32 (UTC-05) by pegjs-to-w3c v0.33.722 which is Copyright (c) 2011-2013 by Gunther Rademacher <grd@gmx.net> */

Url      ::= ( FileUrl | NonRelativeUrl | RelativeUrl ) ( '?' Query )? ( '#' Fragment )?
NonRelativeUrl
         ::= Scheme ':' SchemeData
FileLikeScheme
         ::= 'file'
NonFileRelativeScheme
         ::= 'ftp'
           | 'gopher'
           | 'https'
           | 'http'
           | 'wss'
           | 'ws'
Authority
         ::= ( User ( ':' Password )? '@' )? Host ( ':' Port )?
IPv6Addr ::= '[' ( ( H16 ':' )* ':' )? ( H16 ':' )* ( H16 | LS32 ) ']'
IPv4Addr ::= ( Number '.' ( Number '.' ( Number '.' )? )? )? Number
LS32     ::= DecimalByte '.' DecimalByte '.' DecimalByte '.' DecimalByte

<?TOKENS?>

FileUrl  ::= FileLikeScheme ':' [a-zA-Z] [:|] [\/]? Path
           | '/'* [a-zA-Z] '|' '/'? Path
           | FileLikeScheme ':' ( '/' '/' Host )? '/'* Path
RelativeUrl
         ::= ( NonFileRelativeScheme ':' )? [/\] [/\] Authority ( [/\] Path )?
           | NonFileRelativeScheme ':' [\/]? Authority ( [/\] Path )?
           | NonFileRelativeScheme ':' Path
           | Path
Scheme   ::= [a-zA-Z] [-a-zA-Z+.]*
User     ::= [^/\?#@:]*
Password ::= [^/\?#@]*
Host     ::= IPv6Addr
           | IPv4Addr
           | [^:/\?#]*
Number   ::= '0' ( 'x' | 'X' ) [0-9a-fA-F]+
           | '0' [0-7]+
           | [0-9]+
H16      ::= [0-9A-Fa-f] [0-9A-Fa-f]? [0-9A-Fa-f]? [0-9A-Fa-f]?
DecimalByte
         ::= [0-2]? [0-9]? [0-9]
Port     ::= [^/\?#]*
Path     ::= ( [^/\?#]* [/\] )* [^/\?#]*
SchemeData
         ::= [^?#]*
Query    ::= [^#]*
Fragment ::= .*
setProtocol
         ::= Scheme ':'? .*
setUsername
         ::= .*
setPassword
         ::= .*
setHost  ::= Host ( ':' Port )? ( [/\?#]? .* )?
setHostname
         ::= Host [:/\?#]? .*
setPort  ::= Port [/\?#]? .*
setPathname
         ::= [/\]? Path [/\?#]? .*
setSearch
         ::= '?'? .*
setHash  ::= '#'? .*

Something like the above could be added as a non-normative appendix, but it would need to be preceded with a strongly worded note saying that this input is not suitable for use in an ABNF pipeline.

awwright commented 9 years ago

Noted that the rules are not regular in general. But the overwhelming majority of them are, and stand to benefit from a more accessible definition.

It appears to be necessarily true that the railroad diagrams are always reducible to ABNF (correct?), so in each place it is used, either (1) the usage of a railroad diagram is inappropriate, or (2) an ABNF could be used in place of the railroad diagram, with a non-normative diagram attached, and correct usage would be specified with the subsequent prose.

rubys commented 9 years ago

We agree that a more accessible definition should be provided, and an issue is open on this.

Let's posit for the sake of discussion each railroad diagram could be converted correctly to ABNF, and ABNF could be converted correctly to a railroad diagram. I'm not convinced, and do believe that attempts to use the ABNF provided will cause problems, but for the sake of discussion, lets ignore that.

If both forms are equivalent, how do you come to the conclusion that one is not beneficial and furthermore is inappropriate; and the other (equivalent!) representation is beneficial and appropriate?

awwright commented 9 years ago

If both forms are equivalent except in their accessibility, I think it stands to reason that the more accessible version should be normative. More pragmatically, having to eyeball a diagram is prone to error and wastes implementer time. Mere users of the document, however, can suffice with (and would even prefer) a non-normative diagram generated from the normative text.

I'm eager to see a resolution for the upstream issue, but I'm curious as to what problems might be caused for this particular document. Note that the railroad diagram is just a directed graph, and ABNF can express sets defined in terms of this, therefore it seems possible to do in general. That seems like a reasonable proof, and I can't seem to find any exceptions, at least not specific to this document.

rubys commented 9 years ago

Further reading: Implementing parsers from parsing expression grammars:

Any parsing expression grammar can be converted directly into a recursive descent parser.[3] Due to the unlimited lookahead capability that the grammar formalism provides, however, the resulting parser could exhibit exponential time performance in the worst case.

I will also note that the multiple PEG grammar rules may match a given input, and in such cases, the first match would be used. I'm not an expert, but I don't believe this matches ABNF. As such, it would be actively misleading to present ABNF as the normative mechanism, and furthermore, it would be unproductive to lead people to believe that such grammar could be used by an ABNF pipeline.

If you can build an implementation using an ABNF grammar that passes these tests, then we can explore together incorporating the resulting ABNF into the specification, possibly as a wholesale replacement for the railroad diagrams. I've even provided you with a start on what the ABNF might look like. Until then, the assertion that ABNF is the only beneficial and appropriate mechanism for presenting this data is simply unproven.

As to accessibility, the existing railroad diagrams will be made accessible, most likely via a SVG title for each diagram.

awwright commented 9 years ago

Like the railroad diagram, ABNF is merely defining a set of strings (or rather, octets), so whether there's a second match or not is insignificant.

In practice, of course, applications will want to switch behavior depending on which production was matched, this behavior defined in prose. ABNF does define alternatives as ordered, presumably for this end. PEG adds this ability natively due to the built-in logic, railroad diagrams do not. Unless I am mistaken? It does not appear the railroad diagrams are defining any sort of evaluation order.

E.g. RFC3986's URI-reference production can match either a URI or a relative-ref, so it defines that when resolving a URI-reference production to a URI, it tries to match the URI production first, or else the relative-ref production is matched and resolved against a base URI.

Likewise, it notes that it's impossible to construct a path-absolute URI Reference to e.g. http://example.com//index.html (as //index.html) because the "//" authority path-abempty would be matched first, and the resolved URI would instead be http://index.html/

I would be happy to try my hand at generating railroad diagrams from ABNF. Not tonight, though :)

domenic commented 9 years ago

Noted that the rules are not regular in general. But the overwhelming majority of them are, and stand to benefit from a more accessible definition.

This seems patently untrue; almost every rule has additional processing steps.

awwright commented 9 years ago

You'll need to provide an example. The overwhelming majority of patterns described in the document are describable by ABNF, including all the railroad diagrams, and the vast majority of english descriptions of patterns.

A URL must be written as either a relative URL or an absolute URL, optionally followed by "#" and a fragment.

That's a lot of words for:

 URL = ( relative-URL / absolute-URL ) [ "#" fragment ]
domenic commented 9 years ago

For example, in https://specs.webplatform.org/url/webspecs/develop/#url,

If result.scheme has a default port, and if result.port is equal to that default, then delete the port property from result.

How do you express that with ABNF?

https://specs.webplatform.org/url/webspecs/develop/#authority is also full of non-ABNF things

awwright commented 9 years ago

That's not merely matching text, but defining a mapping from an input to an output, which is beyond the scope. It has the same answer as "How do you express it with a railroad diagram?" (You don't.)

The intent of the issue is to replace text-matching/pattern-matching, so the passage you cite would likely remain untouched.

domenic commented 9 years ago

Ah, I misunderstood. In that case I simply agree with @rubys that an ABNF would be harmful in that it would mislead people into thinking you could do something useful with it, similarly to my issue #4.

awwright commented 9 years ago

Agreed with #4 that some patterns should be removed if it's confusing. But that doesn't mean pushing everything into English prose. If we can cut down on the amount of English, that would reduce implementation mistakes and human lines of code. (For instance, if not defined, I'll be writing RegExp or ABNF myself. That can't ever be good, yet that's what we're doing with PEG.)

rubys commented 9 years ago

Acubed: please do investigate. Seriously.

I've provided you with syntactically correct ABNF. I have significant doubts that that grammar, when evaluated as an ABNF grammar would produce the correct results. Meanwhile, I've defined how the railroad diagrams are to be interpreted (and will update that definition based on bug reports), and I have published test results showing that these diagrams when interpreted as I have defined them produce the expected results.

Until we have demonstrably correct ABNF, discussions as to which format is the most readable format are moot.