Provide a grammar for the URL parser

alercah commented 4 years ago

As an occasional standards-user, the lack of a succinct expression of the grammar for valid URL strings is rather frustrating. It makes it rather difficult to follow what's going on and, in particular, to work out whether a given thing is a valid URL. A grammar in EBNF or a similar form would be greatly appreciated and make this spec significantly easier to understand.

domenic commented 4 years ago

Just to be sure, you're saying you prefer something like

url-query-string = url-unit{0,}

(or whatever grammar format) to the spec's current

A URL-query string must be zero or more URL units.

?

masinter commented 4 years ago

I think this is #24 #416

alercah commented 4 years ago

Apologies for the slow reply.

I think that @masinter is right, and that my concern matches the ones discussed there. I skimmed those threads and a few statements concerned me, such as the assertion that a full Turing machine is required to parse URLs: this would very much surprise me; my instinct on reading the grammar is that, once you separate out the different paths for file, special schemes, and non-special schemes in relative URLs, the result is almost certainly context-free. It might even be regular. The fact that the given algorithm mostly does a single pass is a strong indication that complex parsing is not required.

I discussed with a colleague of mine, @tabatkins, and they said that the CSS syntax parser was much improved when it was rewritten from a state-machine based parser into a recursive descent parser. Doing this would effectively require writing the URL grammar out as a context-free grammar, which would make providing a BNF-like specification, even if it's only informative, very easy.

Separately, but not entirely, splitting out the parsing from the semantic functions (checking some validity rules, creating the result URL when parsing a relative URL string) would likely improve the readability of the spec and the simplicity of implementing it. I think this might be better suited for a separate thread, though, as there are some other thoughts I have in this vein as well.

alwinb commented 4 years ago

This might be a more complicated problem than you think (@alercah). I have tried several times, but the scheme dependent behaviour causes a lot of duplicate rules, so you end up with a grammar that is not very concise nor easy to read. And there is a tricky problem with repeated slashes before the host, the handling of which is base URL dependent.

I have some notes on it here. (I eventually went with a hybrid approach of a couple of very simple grammars and some logic rules in between). This ties into a model of URLs that I describe here.

What's the status of this? It really does work. I developed the theory when I tried to write a library that supports relative URLs. I am quite confident that it matches the standard (but not everything is described in the notes); as the library now passes all of the parsing tests.

alercah commented 4 years ago

It's not supported by vanilla BNF, but I would personally be quite satisfied with a grammar taking parameterized rules like you have there. Many modern parser generators can handle them; those that cannot, it is relatively easy to split out the (small number of) parameters here.

On Wed, 14 Oct 2020 at 11:33, Alwin Blok notifications@github.com wrote:

This might be a more complicated problem than you think. I have tried several times, but the scheme dependent behaviour causes a lot of duplicate rules, so you end up with a grammar that is not very concise nor easy to read. And there is a tricky problem with repeated slashes before the host, the handling of which is base URL dependent.

I have some notes on it here https://github.com/alwinb/reurl/blob/master/doc/grammar.md. (I eventually went with a hybrid approach of a couple of very simple grammars and some logic rules in between). This ties into a model of URLs that I describe here https://github.com/alwinb/reurl/blob/master/doc/theory.md.

What's the status of this? It really does work. I developed the theory when I tried to write a library that supports relative URLs. I am quite confident that it matches the standard (but not everything is described in the notes); as the library now passes all of the parsing tests.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/whatwg/url/issues/479#issuecomment-708482325, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE7AOVLPC5M2RVBAW3GK3UDSKXAFHANCNFSM4MQOLQEA .

masinter commented 4 years ago

it would be useful to start with the BNF of RFC 3986 and make changes as necessary. at least to explain the differences and exceptions.

alwinb commented 4 years ago

Interesting, I did not know that there were parser generators that support parameterised rules.

I did consider a more formal presentation with subscripted rules, but then I backed off because I thought it would be less accessible. It makes me think of higher order grammars, and I think that's too heavy. I guess in this case it could result in something quite readable too though.

As for the comparison with RFC 3986, it would be great if this can help to point out the differences. I have not looked into that much, but the good news is that it might not be that different, after all. I couldn't start with the RFC though because I was specifically aiming for the WHATWG standard. That was motivated by an assumption that this is the common URL standard, in part because It does mention obsoleting RFC 3986 and RFC 3987 as a goal.

alwinb commented 4 years ago

Back to the issue, the question is how this could flow back to the WHATWG standard. And I am not really sure how that would work yet. The parser algorithm seems to be the heart of the standard, and I think there is a lot of work behind that. There is of course the section on URL Writing which does look like a grammar in prose style.

To be clear, what I tried to do, and what I suspect people in this thread (and alike) are after is not to give a grammar for valid URL strings (like in the URL writing section), but to give one that describes the language of URLs that is implicitly defined by the parser algorithm – and in such a way that it also describes their internal structure. Then the grammar contains all the information that you need for building a parser. This is indeed possible, but it is a large change from the standard as it is now.

sideshowbarker commented 4 years ago

There is of course the section on URL Writing which does look like a grammar in prose style.

To be clear, what I tried to do, and what I suspect people in this thread (and alike) are after is not to give a grammar for valid URL strings (like in the URL writing section)

I think in fact that there are people who’ve been involved with discussion of this who have actually been hoping for a formal grammar for valid URL strings — in some kind of familiar formalism rather than in the prose style the spec uses. (And to be clear, I’m not personally one of the people who wants that — but from following past discussions around this, I can say I’m certain that’s what at least some people have been asking for.)

but to give one that describes the language of URLs that is implicitly defined by the parser algorithm

I know that’s what some people want but I think as pointed out in https://github.com/whatwg/url/issues/479#issuecomment-708482325 (and #24 and other places) there are some serious challenges in attempting to write such a grammar.

And what I think has also been pointed out in #24 and elsewhere, for anybody who wants that, there’s nothing that prevents them from attempting to write up such a grammar themselves, based on the spec algorithms — but short of that happening, nobody else involved with the development of the spec is volunteering to try to write it up.

alwinb commented 4 years ago

The technical issues are mostly solved. I'm willing to help, and I'm looking for some feedback about how to get started.

I cannot just write a separate section because in one way or another it'll compete with the algorithm for normativity, (ao). That's a big problem and I think it is the main reason to resist this.

It also requires specifying the parse tree and some operations on them. That could use the (internal) URL records, but it will require some changes to them.

My main concern is that these things together will trigger too much resistance, too many changes, and that then the effort will fail.

What I can do is try to sketch out some approaches that could help to prevent that. I'll need some time to figure that out. I'm not sure what else I can do to get this going at the moment. Feedback would be appreciated.

sideshowbarker commented 4 years ago

I cannot just write a separate section because in one way or another it'll compete with the algorithm for normativity, (ao). That's a big problem and I think it is the main reason to resist this.

Yes — and in other cases in WHATWG specs where there’s been discussion about formalisms for similar cases (e.g., some things in the HTML spec), that rationale (people will end up treating the extra formalism as normative) has been a strong argument against including such formalisms in the specs.

For this case, a grammar could be maintained in a separate (non-WHATWG) repo, and published separately — and then the spec could possibly (non-normatively) link to it (not strictly necessary, but just to help provide awareness it exists).

domenic commented 4 years ago

Agreed with @sideshowbarker, generally. If people want to work on personal projects that provide alternative URL parser formalisms, that's great, and I'm glad we've worked on a test suite to help. As seen from this thread, some folks might appreciate some alternatives more than they appreciate the spec, and so it could be helpful to such individuals. But the spec is good as-is.

alwinb commented 4 years ago

There are issues with it that cause further fragmentation right now. I have to say I'm disappointed with this response. I'm trying to help out and solve issues. Not just this one but also #531, #354 amongst others, which cannot be done without a compositional approach. If you do not address that, people come up with ad hoc solutions, creating new corner cases, leading to renewed fragmentation. You can already see this happening in some of the issues. It is also not true that it cannot be done, because I already did it, once for my library and a couple of weeks ago I did a fork of jsdom/whatwg-url over a weekend that uses a modular parser/ resolver based on my notes, has everything in place to start supporting relative URLs as well, and passes all the tests. I didn't post about it, because the changes are too large. Clearly it would not work out. I'm trying to take these concerns into account and work with them. Disregarding that with 'things are fine', I think is a shame.

alercah commented 4 years ago

While I unfortunately do not have the time to contribute to any work on this at the moment, I have a few thoughts.

First, I agree that care should be taken to avoid confusion about normativity. There definitely should be only one normative spec. If a grammar were to go into the spec itself alongside the algorithm, with the algorithm remaining normative, great care would need to be taken to ensure that they remain accurate as disagreement between the two breeds problems.
Second, I believe that you already basically have not one, but two alternate semi-normative specifications anyway: the section on writing URLs, which specifies a sort of a grammar on how to write them out, and the test suite. I don't believe that anyone can state with certainty that the section on writing URLs actually matches the parser, and I think this comment by one of the major contributors to the spec goes to show how the test suite is treated basically as normatively as the spec, if not more.
Third, I am convinced that trying to define a grammar, normative or non-normative, for the spec as it is, is fundamentally a fool's errand.
But I am not of the opinion that this means that it shouldn't be done. I believe that the current parser should be ripped out entirely, or at least moved to an auxiliary specification on how browsers should implement an actual specification.

To elaborate a bit, I very much disagree with the claim that "the spec is good as is". The spec definitely provides an unambiguous specification with enough information to determine whether or not an implementation meets the specification. This is enough to meet the bare minimum requirements and be an adequate technical standard. But it has a number of flaws that make it difficult to use in practice:

It conflates domains. This URL specification is primarily geared towards the web and web standards, as is indicated by a lot of the implicit assumptions it makes (see also #535). But the use of URLs, and RFC 3986, extends far beyond the web and the spec does not make any meaningful attempt to address uses outside the web. Recommendations on displaying URLs to users are explicitly applicable only to browsers. It defines an API applicable only to the web, with no discussion of API design for other environments. It canonically defines file as the default scheme when no scheme is specified, when most clients would likely prefer to make that decision themselves.
The mere fact that the spec is a living standard is not suitable for use in many application domains. It may be acceptable for the web, perhaps, but there are other interchange systems that need a more reliable mechanism.
It contains almost no background or discussion. It contains only a section listing the goals of the document and three sparse paragraphs on security considerations. It does not explain the purpose of a URL or the human meaning of its various components. It explains almost none of its decisions, such as why special schemes are special or why particular different API setters behave the way they do, or why special schemes get a special, elevated place in the spec to have their scheme-specific parsing requirements incorporated into it.
It is poorly organized. For instance, it discusses security considerations in sections 4.8 and 1.3 and does not mention this in section 2.
Most relevantly to the original topic here, it is nearly impossible for a human to reason about whether or not a URL is valid without manually executing the algorithm. It is incredibly opaque. There is no benefit to this. I defer to @sjamaan's excellent comment. I find the suggestion that section 4.3 provides a useful "overview" of the grammar to be ridiculous. It doesn't. It's just as opaque as the rest of the document.
As an additional point, the opacity of the spec makes it nearly impossible to reason about whether a given behaviour is intentional or a bug. The spec is defined by the implementation in pseudocode. Even understanding the spec's behaviour given an input, much less deciding whether or not it is correct, effectively requires debugging the specification.
There is no abstraction of related concepts, and there is bad mixing of technical layers between semantics and syntax. Semantic errors are returned during parsing, rather than during a separate step on the parsed values.

It is worth noting that this specification explicitly intends to obsolete RFC 3986. RFC 3986 is a confusing mix of normative and informative text, and a difficult specification to apply and use. Yet this specification is distant from being able to obsolete it because it is targeted entirely at one application domain

In conclusion, this spec is a PHP Hammer. It is not "good". It is barely adequate in the one domain it chooses to support, and abysmal in any other domain.

If the direction of this standard can't reasonably be changed (assuming there are people willing to put in the effort), and in particular if WhatWG is not interested in addressing other domains in this specification. I would be fully supportive of an effort, likely through IETF's RFC process, to design a specification which actually does replace RFC 3986, and to have the WhatWG spec recognized only as the web standard on the implementation of that domain-agnostic URL specification. I will probably direct any energy I do find myself with to address this spec to that project rather than this one.

sideshowbarker commented 4 years ago

Second, I believe that you already basically have not one, but two alternate semi-normative specifications anyway: the section on writing URLs, which specifies a sort of a grammar on how to write them out

To be clear, there’s nothing semi-normative about that https://url.spec.whatwg.org/#url-writing section. It’s normative.

and the test suite.

And to be clear about that: The test suite is not normative.

I don't believe that anyone can state with certainty that the section on writing URLs actually matches the parser

The section on writing URLs doesn’t claim it matches the parser. Specifically: There are known URL cases that the writing-URLs section defines as non-conforming — as far as documents/authors being prohibited for using them— but which have normative requirements that parsers must follow if documents/authors use them anyway.

and I think this comment by one of the major contributors to the spec goes to show how the test suite is treated basically as normatively as the spec, if not more.

While some people may treat the test suite as authoritative for some purposes, it’s not normative. In the URL spec and other WHATWG specs, normative is a term of art used consistently with an exact unambiguous meaning: it applies only to the spec, and specifically only to the spec language that states actual requirements (e.g., using RFC 2119 must, must not, etc., wording).

The test suite doesn’t state requirements; instead it tests the normative requirements in the spec. And if the test suite were to test something which the spec doesn’t explicitly require, then the test suite would be out of conformance with the spec.

Most relevantly to the original topic here, it is nearly impossible for a human to reason about whether or not a URL is valid without manually executing the algorithm.

The algorithm doesn’t define whether a URL is valid or not; instead the algorithm defines how a URL must be processed, whether or not the https://url.spec.whatwg.org/#url-writing section defines that URL as valid/conforming.

sideshowbarker commented 4 years ago

Note also that the URL spec has multiple conformance classes for which it states normative requirements; its algorithms state one set of requirements for parsers as a conformance class, and separately, the https://url.spec.whatwg.org/#url-writing section states a different set of requirements for documents/authors as a conformance class.

alercah commented 4 years ago

I'm well aware that the test suite is not normative, and that the writing spec is normative, and of the use of "normative" as a term of art. But you said:

Yes — and in other cases in WHATWG specs where there’s been discussion about formalisms for similar cases (e.g., some things in the HTML spec), that rationale (people will end up treating the extra formalism as normative) has been a strong argument against including such formalisms in the specs.

You claimed that people treating the extra formalism as normative is an argument against the inclusion, not that it would create two potentially-contradictory normative texts.

By the same argument, you should remove the URL writing spec, because it risks being treated as normative, and consider retiring the test suite as well because people treat it as normative because the spec itself is incomprehensible.

I don't think that you should remove either of them. I think you should make the spec comprehensible so that people stop being tempted to treat something else as normative.

The section on writing URLs doesn’t claim it matches the parser.

I agree that it does not claim to produce invalid URLs. It does, however, make a claim that the operation of serialization is reversible by the parser:

The URL serializer takes a URL and returns an ASCII string. (If that string is then parsed, the result will equal the URL that was serialized.)

Admittedly, this claim is rather suspect because it then provides many examples of where that is not true. I suspect it is missing some qualifiers, such as that the serialization must succeed and the parsing must be done with no base URL and no encoding override.

Even with those qualifiers added, I challenge you to produce a formal proof that serialization and parsing produces an equal URL.

alwinb commented 4 years ago

Thank you @alercah. I feel validated by the statement that I have been running a fools errand. It is nice that someone understands the issues and the amount of work that it involves.

The only reason I pushed through was because I had made a commitment to myself that I would finish this project.

johnwcowan commented 4 years ago

RFC 3986 is probably what you want.

alwinb commented 4 years ago

No, I want an end to the situation where an essential building block of the internet has two incompatible specifications that cannot be unified because the one that describes the behaviour of web browsers does so by a long stretch of convoluted pseudocode to describe a monolith function that mixes parsing with normalisation, resolution, percent encoding and updates to URL components. Indeed, an update to RFC 3986 to include browser behaviour would be really, really great. Unfortunately that requires reverse engineering this standard.

masinter commented 4 years ago

I want an end to the situation where an essential building block of the internet has two incompatible specifications that cannot be unified

I tried over many years to resolve this issue. @sideshowbarker @rubys @royfielding @duerst can attest, See https://tools.ietf.org/html/draft-ruby-url-problem-01 from 2015

alwinb commented 4 years ago

Work has started. This is going to happen. Stay tuned.

alwinb commented 4 years ago

There is a GitHub project page for a rephrased specification here. It can be viewed online here.

Whilst still incomplete, it is coming along quite nicely. The key section on Reference Resolution is complete. The formal grammars are nearly complete. There is also a reference implementation of the specification here.

It will not be hard to add a normative section on browser behaviour to e.g. RFC 3986/ RFC 3987 once this is finished. The differences are primarily around the character sets and the multiples of slashes before the authority. The latter is taken care of by the forced-resolution as described in that section Reference Resolution.

This also means that it will be possible to add a section to the WHATWG Standard that accurately describes the differences with the RFCs.

alwinb commented 3 years ago

Following up on this:

This also means that it will be possible to add a section to the WHATWG Standard that accurately describes the differences with the RFCs.

I have done more research, esp. around the character sets, making some tools to compute the differences. These are my findings. I will follow up with a post about other, minor grammar changes and reference resolution.

The differences will be very small, after all is said and done. Which is great!

Character Sets

IRI vs WHATWG URL

The codepoints allowed in the components of valid WHATWG URLs are almost the same as in RFC3987 IRIs. There is only one difference:

WHATWG URLs allow more non-ASCII unicode code points in components.

Specifically, the WHATWG Standard allows the additional codepoints:

The Private Use Areas: { u+E000-u+F8FF, u+F0000-u+FFFFD, u+100000-u+10FFFD }.
Specials, minus the non-characters: { u+FFF0-u+FFFD }
Tags and variation selectors, specifically, { u+E0000-u+E0FFF }.

Specials are allowed in the query part of an IRI, not in the other components though.

IRI vs loose-WHATWG URL

Let me call any input that the 'basic url parser' accepts as a single argument, a 'loose-WHATWG URL'.

Note: The IRI grammar does not split the userinfo into a username and password, but RFC3986 (URI) suggests in 3.2.1. that the first : separates the username from the password. So I assume this in what follows. Note though that valid WHATWG URLs do not allow username and password components at all.

To go from IRIs to loose WHATWG URLs, allow any non-ASCII unicode code point in components, and a number of additional ASCII characters as well. Let's define iinvalid:

iinvalid := { u+0-u+1F, `,",<,>,[,],^, <code>`</code>,{,|,}`, u+7F }

Then, for the components:

username: add iinvalid and @ (but remove :).
password: add iinvalid and @.
opaque-host: add a subset of iinvalid: { u+1-u+8, u+B-u+C, u+E-u+1F, ", `, {, }, u+7F }
path component: Add iinvalid.
query: add iinvalid.
fragment: add iinvalid and #.
For non-special loose WHATWG URLs also add \ to all the above except for opaque-host.

The grammar would have to be modified to allow invalid percent escape sequences: a single % followed by zero or one hexdigits, (but not two).

Note that the WHATWG parser removes tabs and newlines { u+9, u+A, u+D } in a preprocessing pass, so you may choose to exclude those from the iinvalid set. Preprocessing also removes leading and trailing sequences of { u+0-u+20 } (aka c0-space), but it's not a good idea to try and express that in the grammar.

masinter commented 3 years ago

I've suggested a BOF session at IETF 111, which will be held online, to consider what changes to IETF specs would, in conjunction with WHATWG specs, would resolve this issue. A BOF is not a working group, but rather a precursor. to evaluate whether there is enough energy to start one. IETF attendance fees can be waved. https://mailarchive.ietf.org/arch/msg/dispatch/i3_t-KjapMhFPCIoQe1N47buZ5M/

alwinb commented 3 years ago

In case a new IETF effort does get started,

I just want to state that I hope, and actually believe that a new/ updated IETF specification and the WHATWG URL standard could complement each other quite well. It will require work and there will be problems, but it is possible and worthwhile.

masinter commented 3 years ago

@alwinb Nothing in IETF will happen unless you show up with friends willing to actually do the work.

alwinb commented 3 years ago

So here's some work to be done.

[ ] Decide if an addendum is enough, or if RFC3986/39876 should be merged (the latter has my preference)
[x] Decide if the full WHATWG parsing/resolution behaviour should be included, or if it is enough to provide the elementary operations that can then be recombined in the WHATWG standard to exactly reproduce their current behaviour (latter one has my preference, then the standards can really be complementary!)
[x] Decide how to include the loose grammar in such a document (my preference: parameterise the character sets)
[ ] Rewrite my 'force' operation into the RFC style and maybe refactor the merge operations from RFC3986 a little, or switch to my model of sequences more whole heartedly.
[ ] Amend or parameterise the 'path merge' to support the WHATWG percent-encoded dotted segments.
[ ] A remaining technical issue: solve #574, and figure out how to incorporate that into the RFC grammar
[ ] Decide what to do with the numbers in the ip-addresses of the loose grammar, esp. how to express their allowed range (ie. on the grammatical level as in RFC3986 or on a semantic level)
[ ] Preferably, find implementations of the existing RFCs, work with them to implement the additions and have them test agains the wpt test suite, to corroborate that the additions can be combined to express the WHATWG behaviour
[ ] Expand the wpt test suite to include validity tests (!!)
[ ] Write about the encoding-normal form, parameterise it by component-dependent character sets, so that the percentEncodeSets of the WHATWG standard can be plugged into the comparison ladder nicely.
[ ] For the WHATWG standard: decide if a precomposed version of the 'basic-url-parser' should be kept or if it should be split up. It may be possible to automatically generate a precomposed version from an implementation of the elementary operations, and to also automatically generate the pseudocode from that.

Let's get started!

masinter commented 3 years ago

I responded on the "dispatch" list: what's the minimum amount of work that will improve the lack of clarity of what spec is what? (MNot's suggestion). Second, what is the minimum to resolve the differences in normative specifications? Once the specs are aligned normatively, you can do everything else. Step 0: host a BOF at IETF 111 with stakeholders. (Get people to show up and agree to do work.)

alwinb commented 3 years ago

@masinter Thank you. I think that is a good strategy, but with an aside. It is dangerous to apply the IETF level of accuracy and exactness to this too soon. Rather it has been tearing apart, digesting, recomposing/ refactoring what the WHATWG has produced, and now trying to relate it to what was there before.

what is the minimum to resolve the differences in normative specifications? Once the specs are aligned normatively, you can do everything else.

I got carried away, but some of the things I mentioned do need to be done, otherwise you cannot make that comparison. Or perhaps, those items are my answer to that question.

Have you studied my reverse specification? Have you checked it against the WHATWG standard and your knowledge of the RFCs? Do you have comments or ideas? Doing so should enable you to answer this question as well.

Step 0: host a BOF at IETF 111 with stakeholders. (Get people to show up and agree to do work.)

I'm a bit intimidated, but, it sounds good.

Oh except the work thing. I was taken aback, because I've done so much work on this already, and still, and I am kind of tired. Also, I find the political situation unpleasant. I don't want to pick sides. I just want to solve the situation.

masinter commented 3 years ago

Given the response from IETF Dispatch (mainly cautions) I'd suggest trying to do the minimum you can do without needing much help from anyone else. Aim for an Informational RFC (not standards track) which states what you believe the current status. Include your grammar as an appendix. If anyone wants something more formal and normative, they can speak up, but this would at least lead to acknowledging the current status.

alwinb commented 3 years ago

Thank you. It would be a step, at least.

masinter commented 3 years ago

To comment on this list:

[ ] Decide if an addendum is enough, or if RFC3986/39876 should be merged (the latter has my preference) You don't need to decide this in a "rough consensus" way: people will disagree what is or isn't "enough".

Do the minimum; if that isn't enough then you can do more.

[ ] Decide if the full WHATWG parsing/resolution behaviour should be included, or if it is enough to provide the elementary operations that can then be recombined in the WHATWG standard to exactly reproduce their current behaviour (latter one has my preference, then the standards can really be complementary!)

Again, start with the minimum.

[ ] Decide how to include the loose grammar in such a document (my preference: parameterise the character sets)

what about leaving out the "loose" grammar -- if that's what people want, they should look at WHATWG's URL.

Don't understand these:

[ ] Rewrite my 'force' operation into the RFC style and maybe refactor the merge operations from RFC3986 a little, or switch to my model of sequences more whole heartedly.
[ ] Amend or parameterise the 'path merge' to support the WHATWG percent-encoded dotted segments.
[ ] A remaining technical issue: solve #574, and figure out how to incorporate that into the RFC grammar
[ ] Decide what to do with the numbers in the ip-addresses of the loose grammar, esp. how to express their allowed range (ie. on the grammatical level as in RFC3986 or on a semantic level)

Is this necessary if there isn't a "loose grammar"?

[ ] Preferably, find implementations of the existing RFCs, work with them to implement the additions and have them test agains the wpt test suite, to corroborate that the additions can be combined to express the WHATWG behaviour
[ ] Expand the wpt test suite to include validity tests (!!)

These seem more useful than writing more specs

[ ] Write about the encoding-normal form, parameterise it by component-dependent character sets, so that the percentEncodeSets of the WHATWG standard can be plugged into the comparison ladder nicely.

To what end? For what audience? Going forward, shouldn't we aim toward UTF-8?

[ ] For the WHATWG standard: decide if a precomposed version of the 'basic-url-parser' should be kept or if it should be split up. It may be possible to automatically generate a precomposed version from an implementation of the elementary operations, and to also automatically generate the pseudocode from that.

What's the minimum? I wouldn't count on a sudden welling of support.

Let's get started!

I hope you take my comments as constructive.

alercah commented 3 years ago

what about leaving out the "loose" grammar -- if that's what people want, they should look at WHATWG's URL.

WHATWG does not provide a grammar, it provides a specification of an algorithm. That's what started this conversation in the first place.

alwinb commented 3 years ago

Yeah, the bare minimum would be a grammar for valid WHATWG URLs. But that's very simple, and not very useful. The main goal of the WHATWG is to specify error recovery. My goal is to rephrase that, to reconcile it with the RFCs.

So. It needs two grammars, a rewritten chapter on Reference Resolution including the 'force' op, and...

Write about the encoding-normal form, parameterise it by component-dependent character sets, so that the percentEncodeSets of the WHATWG standard can be plugged into the comparison ladder nicely. To what end? For what audience? Going forward, shouldn't we aim toward UTF-8?

The WHATWG standard specifies different character sets that must be encoded per component(-type). So this is about a section analogous to the RFC3986 Percent Encoding Normalisation. The WHATWG behaviour is much more complex.

I hope you take my comments as constructive.

I do, and I need guidance, especially when it comes to process and organisation. Thank you.

karwa commented 3 years ago

IMO, it's premature to talk about building more formal, "more final" specifications based on this standard.

Currently, there is only 1 major browser that passes the test suite (Safari), and that's a very, very recent development. Chrome has by far the largest market share, but only passes ~70% of the tests. That number doesn't really capture the gravity of the situation, though; really, Chrome doesn't behave anything like this standard suggests it should.

To be clear: this project, to unify and document how browsers actually parse and manipulate URLs, is still very much ongoing. The IETF does not help its credibility by rushing to copy this standard and slap its own name on it.

Also, after reading the 2012 email thread, it seems to me that the IETF is a toxic organisation. The WHATWG process is far more open, more welcoming, and overall much better-natured than what I see at the IETF. For that reason, while I'm happy to participate in the WHATWG process, I won't be participating in any future IETF effort.

alwinb commented 3 years ago

I want to make this very clear.

The IETF does not help its credibility by rushing to copy this standard and slap its own name on it.

For me personally this is in no way whatsoever about slapping anyones name onto anything.

And it is not a copy, thank you very much. An immense amount of work went into this and it goes quite beyond what the WHATWG standard offers.

IMO, it's premature to talk about building more formal, "more final" specifications based on this standard.

Currently, there is only 1 major browser that passes the test suite (Safari), and that's a very, very recent development. Chrome has by far the largest market share, but only passes ~70% of the tests. That number doesn't really capture the gravity of the situation, though; really, Chrome doesn't behave anything like this standard suggests it should.

It won't be more "final". In general though, this is very good input and @karwa, I also fully respect your position, I really do.

karwa commented 3 years ago

And it is not a copy, thank you very much. An immense amount of work went into this and it goes quite beyond what the WHATWG standard offers.

Oh, I don't mean you personally; my comment was directed at the IETF as an organisation. Their opposition to this standard (and I think it's fair to say there is widespread opposition among its members) seems to be largely based around where it lives and who maintains it. Their primary motivation seems to be to have something with the IETF's name on it.

It won't be more "final"

AFAIK, the IETF doesn't do living standards. Meanwhile, this standard is essentially in its initial testing phase in a single browser. We'll need to see what issues Safari (and other WebKit-based browsers) users encounter, and that will be some sort of validation of this standard, and then wait for adoption by other browsers, see what issues their users encounter, etc. This standard is likely at least a couple of years away from meeting its goal of describing how browsers behave, IMO.

Anyway, that's all I have to say on it. I just wanted to clarify that it wasn't an attack at you personally, @alwinb

masinter commented 3 years ago

@karwa There are many organizations involved in web standards including IETF for HTTP and QUIC, W3C, ISO, Unicode consortium and many others. Each has its own way of managing changes but few "do" living standards. And if the "primary goal" of the WHATWG spec is to describe "how browsers behave", that leaves out the myriad of non-browser applications that process URLs. The concern about "where it lives" and "who maintains it" is a concern about the narrowness of focus to "browsers" alone. I was happy to see recent consideration of the needs of Node.js and curl.

There are very few undocumented overlaps besides URLs. It might be premature to include @alwinb 's grammar in a standards-track RFC but I think Experimental might be appropriate.

alercah commented 3 years ago

There are already multiple standards-track RFCs. Any overlap lays entirely at the hands of WhatWG and shouldn't be an excuse not to update the IETF standards.

aucampia commented 2 years ago

Any updates on this?

alwinb commented 2 years ago

Any updates on this?

@aucampia, it is a solved problem. The differences between whatwg URL and IRI are minor, but they are very subtle and tricky to get right. Feel free to send me a message directly if you are interested.

Changes to the Whatwg standard are unlikely to be made because the differences cannot be harmoniously incorporated without, essentially rewriting it.

Specifically, at its very inception there has been a deliberate decision to ignore some solid advice by David Sheets (this was in 2012) and base the standard on a single precomposed parse-resolve-and-normalise algorithm, rather than more carefully separating the syntactic and semantic concerns.

Part of that decision, and the frustrations around these issues can be seen as an outflow of the cultural schism between the IETF and the W3C at the time and the WHATWG, where the former traditionally have taken a more academic approach with a stronger focus on cs theory and the latter has taken a more pragmatic and post-modern approach which (in the case of html, I believe) was sorely needed to break out of an impasse.

Of course both approaches are valuable and the only way to move on from here is to reintegrate them. In contrast, what is happening now is that the Whatwg is increasingly crippling itself by a distaste for anything and anyone that reminds them of the more formal or theoretic. A formal grammar, of course is one of those things :D

In addition, if you appear to have any such inclination, you are not likely to be well received here, no matter the effort you put in to try and remain constructive.

alercah commented 2 years ago

On reading @karwa's last comment...

Meanwhile, this standard is essentially in its initial testing phase in a single browser. We'll need to see what issues Safari (and other WebKit-based browsers) users encounter, and that will be some sort of validation of this standard, and then wait for adoption by other browsers, see what issues their users encounter, etc. This standard is likely at least a couple of years away from meeting its goal of describing how browsers behave, IMO.

I think this fundamentally shows a misalignment of goals between most of the people asking for a succinct goal and folks who appear to be defending the current WhatWG approach. The WhatWG approach appears to be to attempt to document the behaviour of one browser, write tests, then attempt to converge browsers to the resulting specification. Although the goal is to eventually have a standard that browsers are expected to follow, the quoted paragraph tells me that this document is not yet a standard even in the browser world. It is merely documentation.

In my opinion, the specification should have loftier goals, but I would like to turn the conversation in a different direction and make the argument that the existing document is counterproductive to WhatWG's own goal of converging browser implementations

The complaint at the root of this thread is that the specification is extremely difficult to implement. It is so complicated that, given a disagreement between an attempted implementation and the specification, serious consideration needs to be put into the possibility that the specification has a defect. (And I speak from personal experience, having found such a defect.)

Or to put it more plainly, in the 30% of test cases where Chrome disagrees with Safari, there is no way to tell who is right and who is wrong. Should Chrome change to match Safari or vice versa? A standard should provide an answer to that question. But the current document seems to me incapable of doing that, and I don't see a clear path to getting it to that state. Ideally, a standard for something like this should be able to explain clearly and concisely why each of its edge cases are the way they are. But from the sounds of it, in many cases, that reasoning is merely "because Safari does this" which is exactly the opposite of good design practice.

tl:dr I personally believe that the WhatWG likely stands to benefit, even just in its goal of browser convergence, from adopting Alwin's draft as the basis of the specification going forward. I don't think this needs to be a turf war over territory.

karwa commented 2 years ago

@alwinb

Please don't question the motivations of anybody contributing to this standard. This kind of hostile language ("the Whatwg is increasingly crippling itself by a distaste for anything and anyone that reminds them of the more formal or theoretic") really isn't appropriate.

I mentioned previously that the IETF email thread was shockingly hostile, and it put me off wanting anything to do with that process. Please don't bring the same attitude here.

@alercah

The WhatWG approach appears to be to attempt to document the behaviour of one browser, write tests, then attempt to converge browsers to the resulting specification. Although the goal is to eventually have a standard that browsers are expected to follow, the quoted paragraph tells me that this document is not yet a standard even in the browser world. It is merely documentation.

Or to put it more plainly, in the 30% of test cases where Chrome disagrees with Safari, there is no way to tell who is right and who is wrong. Should Chrome change to match Safari or vice versa?

But from the sounds of it, in many cases, that reasoning is merely "because Safari does this" which is exactly the opposite of good design practice.

To be clear: I don't speak for the WHATWG or the editors.

But I would like to refute the suggestion that this standard just documents WebKit's behaviour; WebKit rewrote its URL implementation to match the standard.

When differences are found between the browsers, generally the editors ask for a survey of what the current behaviours are, and the browser reps agree on which behaviour should be converged upon. Some parts have converged on WebKit's prior behaviour, some parts have converged on Firefox, some on Chrome, etc - it's all openly available, you can check the discussions by checking the relevant PRs. So while each piece has its own precedent, there wasn't one browser that implemented all of the various pieces in one place (until recently).

Even for the parts that Chrome/Firefox haven't implemented yet, in principle they've agreed with that text being part of the standard. That being said, it's a living standard, and perhaps they (or others) will encounter compatibility issues which WebKit users didn't experience - that's why I say we need broader adoption in order to truly claim that this is how URLs work on the web platform, and that writing it in stone this early is likely to be unwise.

ghost commented 2 years ago

@alwinb: I believe your work is splendid, but please stop trying to villainize the WHATWG. I feel like your points would be taken much more carefully if you worded them less aggressively.

As I said in another thread, I feel like people should work together to solve the issues that they deem relevant. There might be disagreement in what issues are deemed relevant (and there is room to converse about that), but you seem to just want to impose what you believe.

Just because people don’t acknowledge the issues you are trying to solve as legitimate, it doesn’t mean they aren’t taking you seriously, it just means they disagree with you. If anything, your defensive (to avoid saying “hostile”) behavior when talking about your work is what is actually making people take you less seriously.

The approaches taken by the WHATWG spec and your spec are vastly different. I believe there is value to both approaches. And at any rate, if they are indeed equivalent, it doesn’t feel compelling for people to wish to change either way.

I really hope you can stop acting that way, so that people (including you) can actually cooperate towards something together.

(Edit: To avoid confusion, I wanted to note that I personally hid this comment myself, as it is indeed off‐topic for the given issue.)

mnot commented 2 years ago

Apparently villianising the IETF is still cool. @karwa could you please provide a link so we know what you're referring to?

alwinb commented 2 years ago

Yes, I have become increasingly frustrated and hurt.

We can talk about it endlessly. Meanwhile nothing is changing and the issue remains open.

It is feasible to combine and update the IETF documents. The changes are tricky but not large. This would be significantly less work than changing the WHATWG standard to include a formal grammar and cover all of what the RFCs provide.

The WHATWG can recognise that effort and publish an improved algorithm in pseudocode, a reference implementation, an API and tests in such a way that they agree. Both documents should openly acknowledge the other one and ensure that they are in sync.

I think one can be pragmatic about them being a “living standard” or not. Pick some versioning scheme and decide on marking it final (or not) later.

I don’t want to be the person to do all that for you. I’m happy with my own document with the little tree and I have other things to do. I’m willing to help with the content itself and give advise if someone else wants to take on a significant amount of work and coordinate the effort. I’d be honoured if it is based directly on my document, but anything can be made to work.

Meanwhile, if nobody else wants to take action, the editors may as well explicitly mark this issue as wontfix and close it.

@mnot there is a link to the mailinglist in one of the comments above.

karwa commented 2 years ago

@mnot There was a link previously to a 2012 email thread when I believe this standard was first created. I'd rather not repeat it because it isn't important, but it seemed to me that Anne received a very unfair amount of pushback, primarily due to the choice of venue (WHATWG vs IETF). It was relatively heated, I would say. Especially after reading Larry Masinter's summary of URL standards history (from 2015) and the myriad IETF initiatives that simply fizzled due to lack of interest, the overall impression I get is of an organisation more interested in complaining and putting its name on things than actually fixing problems. The interest was not so much in solving the problems but rather ensuring that, if anybody does solve it, that solution says "IETF" on the cover.

It may be that things have changed (in fairness, an entire decade has passed), or that it wasn't a fair impression, or perhaps it is only true of certain members. Regardless, it means that when some new IETF effort gets proposed (as it was here), my immediate reaction is "no, thanks". That's what happens when the level of discourse is reduced to pointless bickering - who would choose to get involved with that?

IMO, matters like venue or formal vs algorithmic expression are distractions. We can switch between them at any point in the future, or even offer both at the same time, with no functional differences. Not one computer system in the entire universe would become more/less capable, more/less reliable, or more/less secure -- which means it's not engineering; it's politics.

aucampia commented 2 years ago

@aucampia, it is a solved problem. The differences between whatwg URL and IRI are minor, but they are very subtle and tricky to get right.

To me it is not solved really, it may be solvable, but if I want to figure out what to put in a grammar to describe a URL I need to have a grammar for a URL, and given there is no such grammar it is not solved. Of course this could be mostly my problem and WHATWG can see it outside of their scope, but it is none the less something I want and I'm not sure it is an unreasonable desire. I'm not an academic, I don't even have a degree, I'm just an engineer, but a grammar makes reasoning about things much simpler than the natural language algorithm in this spec.

However I appreciate the explanation.

Meanwhile, if nobody else wants to take action, the editors may as well explicitly mark this issue as wontfix and close it.

I agree with this, I don't think it is helpful to maintain an expectation if there is no interest in meeting it.

IMO, matters like venue or formal vs algorithmic expression are distractions. We can switch between them at any point in the future, or even offer both at the same time, with no functional differences. Not one computer system in the entire universe would become more/less capable, more/less reliable, or more/less secure -- which means it's not engineering; it's politics.

I disagree here. Having a defined grammar does present for options of increasing reliability of computer systems. By my estimation a defined grammar (i.e. defined in ABNF) can be translated into code much easier and with less errors than basic URL parser from this spec which is written in English and that needs a human to execute it.

And just to be clear, I don't care if WHATWG or IETF provides the grammar, I just want a grammar that I know with no ambiguity is the grammar for a WHATWG URL.

alwinb commented 2 years ago

@aucampia there is.

I’ve worked on a formal grammar and added support for relative references in a way that is compatible with the WHATWG URL standard. I put an immense amount of time and effort into that. I did so because of the statement in the standard that it obsoletes the RFCs and I wanted to be thorough and up to date. I didn’t know that there was a conflict with the IETF.

It doesn’t matter but my account is, let’s put it this way, that I inadvertently touched upon a sensitive issue with it, and I experienced the response to my attempts and also to others’ such as @bagder here to be so profoundly off the mark and personally hurtful that I saw no other possibility but to collect my notes and write a new specification, in order to transcend my emotions.

The document is here: URL Specification and it includes a formal grammar. It is not normative, but descriptive of the WHATWG standard. If you find a difference then that is a bug. The only exception is that I’ve already included a solution to an issue about drive letters that is still open here.

I had to name the document so boldly, it hurt me to do so, but I found that the editors of the WHATWG standard really had to be called to attention.

Evidently, doing so doesn’t create an environment of trust, so it’s been very hard for me to work together on issues here since.

FWIW I also read the email thread that @karwa (whose work here I respect a lot) mentions, but I did have a different assessment of it. I found that there was hostility, but also that it was understandable on both sides.

Content wise there is one significant thing that I’d change right now, which is to handle opaque paths in the grammar. I’d say that this is the most significant grammatical difference between the RFC and the WHATWG standard. I’m too tired to get into that now though.

whatwg / url