whatwg / url

URL Standard
https://url.spec.whatwg.org/
Other
526 stars 137 forks source link

It's not immediately clear that "URL syntax" and "URL parser" conflict #118

Closed domenic closed 7 years ago

domenic commented 8 years ago

URL syntax is a model for valid URLs---basically "authoring requirements". The URL parser section allows parsing URLs which do not follow URL syntax.

An easy example is https://////example.com, which is disallowed because the portion after https: contradicts

A scheme-relative URL must be "//", followed by a host, optionally followed by ":" and a port, optionally followed by a path-absolute URL.

/cc @bagder via some Twitter confusion this morning.

duanyao commented 8 years ago

What about this trade-off: spec that single or 3+ slashes are allowed only in HTTP urls and deprecate it, and wait serveral years to see if we can forbide them entirely (we are transiting to HTTPS anyway). Softwares other than major browsers are free to implement this quirk or not.

SEAPUNK commented 8 years ago

I am for starting off by being fairly permissive with the URL spec to fit in best with the current popular implementations, and then start working on making the URL spec increasingly stricter until it becomes "what it should be". Other than people not updating their broken URLs (akin to ancient Netscape-age websites), is there a reason this isn't possible or reasonable?

domenic commented 8 years ago

No, that's very reasonable, and is in fact the plan outlined in https://github.com/whatwg/url/issues/118#issuecomment-218323801

ghost commented 8 years ago

@magcius You're directly contradicting the spec. The spec claims to "[define] URLs, domains, IP addresses, the application/x-www-form-urlencoded format, and their API." It also claims its goal is to "[align] RFC 3986 and RFC 3987 with contemporary implementations and obsolete them in the process."

If the WHATWG is to obsolete these RFCs and to provide a replacement definition of URLs, it must consider more than just web browsers (especially just Chrome, I'm not convinced that the Chrome bias is fake). That the WHATWG is specifically about web applications is just one reason the specification should have a much, much narrower scope than it asks for.

Regarding Safari, iOS and Internet Explorer moving to implement what this standard says, this could just as well be turned the other way by not allowing multiple slashes (which is the solution favored by the RFCs, simplicity, and correctness, remember) and expecting Chrome and Firefox to adjust. But of course we expect Safari, iOS and Internet Explorer to adjust instead, as well as all the other programs that parse URLs that don't get mentioned simply because only browsers seem to be cared about. Why is that?

Can't we just admit that browsers are being permissive because this allows them to break less pages and thus gain more market share, and that doing this should be frowned upon and certainly not made into the standard?

I seriously don't think the WHATWG should be defining concepts such as URLs, domains or IP addresses. This specification is only useful for web browsers (even then I'm not too sure about that...), so its scope should be restricted to those.

@SEAPUNK, @domenic Plan which is not going to happen, and therefore not reasonable, because any reason we could have to be permissive right now is only going to become even stronger once a spec says it's okay to be permissive. If making the spec increasingly stricter makes sense, making it stricter right now would make even more sense. If anyone is going to update his broken URLs (doubtful), it's only going to be less likely to happen once a spec says they are not broken and all browsers support them. So the only thing I see happening is the spec becoming more permissive, not stricter. And this is obviously a problem, I hope nobody wants URL parsing to become as complicated and full of heuristics (even if these heuristics are well-defined in a spec, they're still heuristics) as parsing of HTML5 has become.

SEAPUNK commented 8 years ago

@mark-otaris

While generally I am for "fixing" something in one swift action regardless of what it would break, the reason I am okay with incremental changes is to only incrementally break websites. Obviously there will be people who don't fix their URLs, but those are the people we increasingly start to care less about (an analogy can be drawn to the Windows XP diehards). There are stages to deprecation (soft, then hard), and it's in many people's best interest to start with the smallest breakages, and work your way up over time so people start noticing that WHATWG is being serious about deprecating and implementing these breaking changes, so they start to get a move on themselves.

If WHATWG went ahead and implemented the perfect spec with a massive amount of breaking changes, it would be up to the browser developers to figure out how they want to do the incremental breaking changes, so while WHATWG is responsible for writing a standard that Does It Right™, they should also be responsible for standardizing how implementers make the gradual changes, so everything is done in more-or-less sync.

ghost commented 8 years ago

@SEAPUNK Agreed! (Almost.) If the purpose of the specification is to sync web browser implementations in incrementally becoming less permissive, then:

So the spec would be about coordinating the efforts for making Firefox, Chrome and Edge progressively stricter and getting them as close to the RFCs as possible. (How close it is possible to bring them would depend on how many web pages have incorrect URLs.)

Now, that'd be an initiative I can agree with, and a plan I would find reasonable.

annevk commented 8 years ago

The purposes of the specification is to get URLs interoperable, in the same way we got HTML parsing and other things interoperable. By focusing on the implementations that are the least likely to change, and making it easier for others to align with them. The specification is aimed at cURL & co and it's a shame they rather implement no standard at all, but that's up to them.

bagder commented 8 years ago

The specification is aimed at cURL & co and it's a shame they rather implement no standard at all, but that's up to them

I'm here (representing "cURL & Co" to some degree), discussing in this issue among others, because we find the whatwg "URL standard" to be inferior.

"cURL & co" intend to continue to support URL standards. We've already supported them for a very long time. But as my recent blog post on the subject points out, there is no URL standard. Well, apart from perhaps RFC3986 that browsers don't care much for anyway.

The whatwg "URL standard" has not been developed with proper consideration of the entire ecosystem, sort of in your corner of the world and it is written in a really funny way that makes it really hard to understand what is supported.

annevk commented 8 years ago

Browsers don't implement all this either in a consistent fashion yet. And I don't think much libraries have taken notice yet, though there are quite a few implementations now, including in C, JavaScript, Java, PHP, etc. It takes a lot of time. For HTML it took about a decade. I suspect this might take longer given that it's more fundamental and subtle changes can have grave consequences.

Lukasa commented 8 years ago

By focusing on the implementations that are the least likely to change, and making it easier for others to align with them.

@annevk this reads to me like the WHATWG rewards intransigence: that is, if the Chrome or Firefox team rocked up and said "no we will not change because we don't want to", the WHATWG will throw its hands up in the air and say "welp, guess that's spec-standard then". Is my understanding of that right?

The specification is aimed at cURL & co and it's a shame they rather implement no standard at all, but that's up to them.

I'm here representing "& co", and while we're talking I should point out that in shazow/urllib3#859 I proposed moving our tooling to using the parsing algorithm provided here. So it's not like we're sitting here ignoring you.

But the attitude of WHATWG is rubbing me the wrong way. We have had a specification plopped down in front of us that we were not consulted in, that we have not helped to develop, and which gives the impression of not really considering us first-class citizens. This thread repeatedly talks about "major implementations" and about "the implementations that are least likely to change". The overwhelming sentiment I read from those messages is that this specification is about doing what big browsers do, rather than considering what the wider ecosystem does. I get the strong impression that the implementations that are more likely to change boils down to "anyone who isn't a browser", and that's a pretty insulting way to talk about our work.

You cannot be surprised, then, that we don't feel particularly warm and fuzzy about a specification that doesn't consider us or our use cases to be important enough to be part of the story. This specification feels to me like it exists to beat us into line, as the product of an organisation that, unlike the IETF, doesn't think our implementations matter very much.

And, yeah, that's fine, I'll try to do it anyway because I care enough about my users that I'm prepared to work in the broken world that the WHATWG is happy to spec on rather than press for people to fix their broken implementations. But the high-and-mighty benevolent and somewhat patronising tone taken in this thread is really quite insulting.

annevk commented 8 years ago

Understood and I realize it's not exactly pretty. The problem is that browsers are extremely hard to change. That one bug report cURL got is often enough for them to not be willing to touch code. Especially code that's been working okay for two decades and is deployed to hundreds of millions of users.

So if we want to change the slashes situation, that existed way before I wrote the URL Standard by the way, we will need a concrete reason why it's bad and why browsers should change. Might be best as a separate issue on this repository at this point. Then we file bugs against the various browsers saying that them not returning failure (as they all do for, .e.g, http://test:test/) is problematic and then hopefully they invest the engineering time in changing that. And it seems from @sleevi's response we might be able to get Chrome to run some experiments for us there.

It might help to understand that the WHATWG was created out of this browser inertia. The W3C wanted to forget about the mess that was HTML and browsers and move on to what they saw as greener pastures. The WHATWG was formed by individuals from browser vendors and web developers that recognized the inertia and were willing to figure it out, lay it bare, and then improve upon it. That is what I see us doing here too. It's not pretty, it's not fun, and the power balance is a little skewed towards the mega clients, but this is how we think we can improve the web and I think we've done a pretty good job of that thus far across various efforts.

Anyway, I am truly sorry for the tone. It's always a bit of a balancing act when there's a bunch of fresh ideas entering your community and I'm usually pretty bad at handling them. But we are definitely willing to work with you all and appreciate your input on what should be clarified and how. And what parts should maybe be changed since they are actively harmful to allow.

bagder commented 8 years ago

Is the WHATWG URL standard document:

  1. Describing how the browsers' URL parsers work
  2. Describing how a browser's URL parser SHOULD work to be compliant

?

Sometimes there's talk of (1) and sometimes there's talk of (2). I can see the value of a (1) but I think a spec like (2) could also say that parts of (1) isn't desirable.

annevk commented 8 years ago

It's a mix of those, trying to pave the path towards convergence, while keeping in mind what we know about "inertia" as I described it earlier.

Lukasa commented 8 years ago

@annevk I want to be clear as well that my frustration is largely not directed at you. It's frustration borne of discovering that a whole lot of very powerful entities seem to have a totally different idea of what the world is supposed to look like, and relatively little interest in hearing a dissenting opinion. Not your fault, of course, and I appreciate the apology. =)

rubys commented 8 years ago

Perhaps instead of trying to impute motives, people could, you know, actually look at test results? Here's some slightly dated but probably mostly accurate data including both browser and non-browser behavior: https://url.spec.whatwg.org/interop/test-results/

ghost commented 8 years ago

@annevk

Browsers aren't extremely hard to change, they just really don't want to lose their market share, and see no problem with sacrificing the open web if that means they get more users and thus more influence. That's also why they shouldn't get to dictate standards.

The WHATWG's standards are about major browser vendors and web application developers. They are never about users, and are rarely about web page authors.

The W3C wanted to move the web forward with XHTML, SVG, XForms and other well thought technologies. Google wanted to make web applications to have as much control as possible over users' computing. Browsers wanted market share.

The semantic web project failed because it didn't have the most influence, even if it was the better idea. Now we have HTML5: a hundred JavaScript APIs for everything, very slow web browsers, DRM as a web standard, specifications that tell browsers (how) they must support broken web pages; and the web is more of a mess than ever. All these JavaScript APIs result in a neverending stream of vulnerabilities, but it is the users who care about that, not the web developers or the browser vendors.

The web application developers are happy: they have enough JavaScript APIs to make their web applications that give them control over users' computing and access to all of their data. The major browser vendors are very happy: they get to keep their market share while still being able to say they follow standards, and the web is complicated enough now that it is impossible for any new browser to come in and compete with them.

But this does not make the web win. It just makes the people who have the most influence win. This is one among an infinity of good reasons why standards should never be defined by implementations.

You ask for a concrete reason why it's bad and why browsers should change, but there is no reason that the browsers will accept, because being stricter will always mean more pages possibly being broken, and therefore users possibly moving to another browser. The question that should really be asked is "Why did browsers change to allow all these invalid URIs in the first place?" The concrete reason they had to do that is their market share.

Extensible Resource Identifiers are the best example of a standard, related to URIs, that actually improves the web, in this case by adding many useful features. What does the WHATWG's current URL spec actually bring that would make it worthy of replacing RFC 3896? Making dubious terminology changes about favoring "URL" even though "URI" has been used consistently for 15 years by the W3C and the IETF, and is the more logical term. Supposedly making "solid" something that has already been very clearly defined for a long time and that browsers have just decided not to follow. The problem is the behavior of the browsers, not the RFCs.

JohnMH commented 8 years ago

On Fri, 2016-05-13 at 16:42 -0700, Sam Ruby wrote:

Perhaps instead of trying to impute motives, people could, you know, actually look at test results? Here's some slightly dated but probably mostly accurate data including both browser and non-browser behavior: https://url.spec.whatwg.org/interop/test-results/

It is unclear what most of those tests are actually talking about, as they just list languages and not what API was used in the test, and just show "that there is a difference" in most cases, not what the difference is.

rubys commented 8 years ago

@JohnMHarrisJr You can drill down by clicking on a line. You will see how each client interprets the URL.

JohnMH commented 8 years ago

On Sat, 2016-05-14 at 17:22 -0700, Sam Ruby wrote:

@JohnMHarrisJr You can drill down by clicking on a line.  You will see how each client interprets the URL.

Oh, gotcha. It does have the differences, but it's just hard to use. That still doesn't explain what it means by "python" or "ruby".

JohnMH commented 8 years ago

Or perl, csharp, nodejs (although there I suppose it makes more sense, it should definitely say whether it's talking about the standard functions or a library).

In addition, let's look at https://url.spec.whatwg.org/interop/test-res ults/63fe456f89

It seems that "ruby" is the only one that parses it correctly?

rubys commented 8 years ago

Description of the APIs used, including links to the actual source code:

https://github.com/webspecs/url/tree/develop/evaluate#evaluation-programs-and-results

The term "correctly" implies a value judgment; I'll decline to go there. At the moment there are specs that are underspecified; specs that don't match reality; and implementations that don't match one another. The latter remains true even if one limits oneself to only browser implementations, actively maintained browser implementations; or even actively maintained browser implementations. Or alternately if you limit yourself to implementations that purport to be faithful implementations of the standards.

Also might be of some interest: what in my opinion is a more readable version of the URL standard, as of nearly a year and half ago: https://specs.webplatform.org/url/webspecs/develop/

The diagrams in that spec are generated from executable code that was meant to be the reference implementation. My hope at the time was to get set of implementors together to reduce observable differences in their output.

JohnMH commented 8 years ago

On Sat, 2016-05-14 at 18:04 -0700, Sam Ruby wrote:

Description of the APIs used, including links to the actual source code:

https://github.com/webspecs/url/tree/develop/evaluate#evaluation-prog rams-and-results

Instead of "nodejs", it should be "nodejs-url" or similar, to show that it is using the url "package".

The "ruby" one is now even more odd, because "addressable" is also listed. Isn't "ruby" "addressable"?

Or is "ruby" the standard library URI, as you'd expect?

"perl" is also confusing, unless you mean the "URI" module (which I'd suspect, since there is a dead link to it on CPAN)

That should definitely be "perl-URI" or "perl-uri".

In addition, I do not believe that galimatias should be listed, but that you should be testing against the Java URL and URI classes. These are the most common, and most actively developed URL and URI parsing classes, and they are also closest to the RFCs which define whats URLs and URIs are.

The term "correctly" implies a value judgment; I'll decline to go there.  At the moment there are specs that are underspecified; specs that don't match reality; and implementations that don't match one another.  The latter remains true even if one limits oneself to only browser implementations, actively maintained browser implementations; or even actively maintained browser implementations.  Or alternately if you limit yourself to implementations that purport to be faithful implementations of the standards.

In this case, "correctly" refers to "ruby" using port 80 when no port number is specified.

I also don't know what you mean by "specs that don't match reality", because specs define what implementations should do.. They don't "match reality".

FagnerMartinsBrack commented 8 years ago

The concern about focusing only in browsers is real and there are evidence of problems that happened because of that. Last year I found out that ietf created an RFC for cookies in 2011 documenting the allowed characters in a cookie name without considering the internals of server-side technologies such as PHP. That created a big interoperability issue, because php decoding implementation is coupled to a historic syntax ($_COOKIE) that is also older than many browsers out there in the wild and cannot change. This heavily restricted the default interoperability between client and server-side transmission of unicode characters using UTF-8, just because the spec was built using browsers as a baseline.

JohnMH commented 8 years ago

@FagnerMartinsBrack I don't see how the use of $_COOKIE has anything to do with the ability or inability of PHP to handle a given cookie.

FagnerMartinsBrack commented 8 years ago

@JohnMHarrisJr I want to be careful not to go too off-topic here since there is already a considerable amount of comments.

I don't see how the use of $_COOKIE has anything to do with the ability or inability of PHP to handle a given cookie.

The summary of the problem is in the link above. For more context, see here.

The point is that, focusing all the efforts of a spec using only the browsers can have unintended side effects. That specific cookie case happened probably because nobody went deeper researching the current state of each ASCII character support outside the browser-land at that time. If we don't consider other clients when handling URLs, there is the potential for something similar happening again.

Thiez commented 8 years ago

I find the URL test results very confusing: what is different here? And here?

rubys commented 8 years ago

OK, admittedly that doesn't show up very clearly. The difference is string "0" vs number 0.

Details:

https://url.spec.whatwg.org/interop/useragent-results/refimpl https://url.spec.whatwg.org/interop/useragent-results/rusturl

robertlagrant commented 8 years ago

The strategy of gathering various browser implementations into a descriptive standard, and then keeping on refining the standard until it hits the actual end goal seems very nebulous. What is the end goal?

Why bother doing things this way, as opposed to the normal practice of defining the actual standard we want, and then documenting variances (e.g. spec and variance)? Why define anything other than the intended result? Browsers are very capable of setting their own timelines for spec conformance; why does a separate body need to create a timeline for everyone to follow? Browser vendors having massive funding (relatively speaking) and can constantly keep up with spec changes, but it seems crazy to offer lots of little spec changes over time to gradually shepherd software that uses URLs. This is an obvious recipe for disaster, as it could lead to a situation where there are many URL parsing schemes, each with their own little differences, and we have to do content negotiation to figure out which parser to use! Presumably that negotiation would also need a spec :)

I hope that's at least slightly convincing regarding why planning to change the spec over time is a bad idea.

Here's an alternative, that perhaps describes this standard better: why not convert it into a standard for parsing malformed URLs? This has the advantage of not superceding useful, prescriptive standards for URLs (so people won't think that writing software that generates a URL that starts with http://////////// is ever a good idea, which this standard is in massive danger of enabling) and also keeps whatever benefits people are claiming would arise from standardising permissive URL parsing.

annevk commented 8 years ago

The end goal is convergence between implementations. Convergence usually requires tweaks over time as nothing is perfect. This is true for all standards efforts.

It sounds like you have not read OP which indicates that the URL standard has separate sections for syntax and parsing which indeed have different requirements as to what is conforming and as to what works. It just needs to be clarified and get some examples.

ghost commented 8 years ago

@annevk If convergence requires tweaks over time, why can't web browsers adjust over time their behavior to make it closer to the sane behavior defined by the RFCs?

Again, the answer is market share and power. That convergence on this requires tweaks over time is not true, it is very easy to just remove all the URL parsing code that handles incorrect URLs. Writing a proper URL parser that follows the RFCs is not difficult, and if all the web browsers did it they would already converge. But it is not what browsers want, because then some URLs might not work which might lead browsers to lose some of their precious market share. Instead, browsers want to work with as many URLs as possible, and specifications like this one are created not to make the web browsers work toward a sane end goal—which is never going to happen, everyone knows very well the spec is not going to become stricter with time—, but to make the other, already-correct implementations move toward some lax behavior. So that the browsers can get away with their implementation choices that actually harm the web. This is nothing new, the same thing happened with HTML5 versus XHTML.

The obstacle to convergence isn't that we don't have a spec (we already do have one: the RFCs), it's that the interests of browser vendors do not align with the correct behavior, so they do not want to follow that spec. Instead of converging toward the long-established correct behavior, the browser vendors create a new specification that establishes as a standard a new behavior, inferior from a technical point of view, but that does not conflict with their interests.

robertlagrant commented 8 years ago

@annevk I have read the spec and this thread; not sure if there's something else I should've read; apologies if that's the case.

My point about a different spec is that URLs are not generated by browsers, really. So the parsing spec is the one that defines what most URLs will look like, not a theoretical URL syntax. That's why it needs to be strict, because it is the de facto syntax spec, and not constantly adjusting to encompass an ever-widening Overton window of URL brokenness which cannot be kept up with by any tool that isn't well funded.

Having said that, breaking out a separate malformed parsing standard, so that people can detect malformed URLs and fix them, might be useful.

I think fundamentally this is a question of who owns the Web. If the answer is browser vendors; they will keep up with the spec, mostly by having it codify their products' behaviours, then this spec makes sense. If it's no-one; anyone can write a simple tool that easily parses URLs then it makes no sense.

bagder commented 7 years ago

So how can a URL be defined to only have two slashes but you mandate that parsers should handle thousands? The first is pointless with the second.

nox commented 7 years ago

A URL can be defined to only have two slashed but the spec mandate that parsers should handle thousands the same way the IETF standard can say that URL producers should not create percent-encoded octets for the ALPHA range, but URL consumers should decode them nonetheless.

https://tools.ietf.org/html/rfc3986#section-2.3

For consistency, percent-encoded octets in the ranges of ALPHA (%41-%5A and %61-%7A), DIGIT (%30-%39), hyphen (%2D), period (%2E), underscore (%5F), or tilde (%7E) should not be created by URI producers and, when found in a URI, should be decoded to their corresponding unreserved characters by URI normalizers.

I'll concede that it just says "should not" and not "SHOULD NOT", but that's pretty similar in spirit in my book.

annevk commented 7 years ago

@bagder it's the robustness principle more or less. It also keeps things simpler for those only producing URLs and not consuming them. And those sticking to valid URLs. Even if we parse URLs with a space in them, we wouldn't want to make those valid as that would make it harder to work with URLs in plain text over the long term.

bagder commented 7 years ago

But then the syntax spec is pretty useless as it doesn't restrict URL generators (since they will output what parsers manage to parse) and parsers are encouraged to be "robust" and accept things that aren't in the syntax spec. A downward spiral into a world of tears.

So why is there a syntax spec? To explain how an ideal URL would look like?

annevk commented 7 years ago

I think it's still useful, if you type https:///example.com/ it's likely a typo. That the parser accepts it is needed for compatibility, but that does not mean there is no value in stricter syntax, warnings, etc.

bagder commented 7 years ago

What exactly is that value?

I can see one distinct value right now: that "polished" version is more compatible with RFC3986-URLs, so copying the cleaned up version to external tools has a much higher chance of working.

I however would like a spec that is clear on exactly what is allowed and what then isn't allowed. What is allowed is handled by the (compliant) parsers, what isn't allowed is rejected. Everything that is silently handled for the robustness principle becomes accepted and is then considered valid.

If a spec-compliant parser accepts and works with a given syntax, then that syntax is valid. We can't separate allowed by parser and allowed by syntax.

annevk commented 7 years ago

What exactly is that value?

You don't see value in catching typos?

We can't separate allowed by parser and allowed by syntax.

We have been doing this in a number of areas quite successfully, so I don't think that claim is true.

bagder commented 7 years ago

We have been doing this in a number of areas quite successfully

Please educate me then. How do we/you/everyone benefit from this split?

Lukasa commented 7 years ago

As a particular note, if something MUST be tolerated and has one and only one defined meaning when parsed, that is effectively allowed by the specification.

Put another way: if implementations are required to parse any number of slashes to be defined as compliant, then the spec allows any number of slashes, regardless of the preamble text.

nox commented 7 years ago

Can't we just accept that the separation is similar to the thing I quoted from the IETF RFC and move on?

bagder commented 7 years ago

@nox: that text in the RFC is also rather useless and is mostly there for humans to understand a little what the intention is: readable characters is better than percent-encoded ones since they may be read by humans. "How to write a readable URL" sort of.

Nobody would say that a URL cannot contain such codes. They are legal entities of a URL.

I think this is an important issue. I've kept harping on this number-of-slashes issue and I've been told at least twice that the URL spec says it should be two slashes, when in fact the URL spec says a parser should handle an infinite[*] amount of them.

For me personally, this (unorthodox) split is causing some of the confusion I experience when reading this spec. Maybe just maybe I'm not alone. And I have read a spec or two before.

[*] = it actually doesn't specify an exact amount

magcius commented 7 years ago

The URL spec is about the construction of a valid, bona fide URL. It also says something like "a URL is composed of ASCII characters". Turns out, if you venture out beyond the English web, you'll find things like <a href="http://facebook.com/fútbol"> in just raw UTF-8. You will also find http:///facebook.com/fútbol

The URL parsing spec is about the ability to handle these URLs that are found naturally in the wild that do not conform to any stricter URL specification. To create a URL, read the RFC. To parse a URL, read the URL parsing spec.

ghost commented 7 years ago

@magcius It is not obvious that URL parsers should be able to handle URLs that are invalid, that this is desirable, that if it is the behavior should be codified in a spec, or that if invalid URLs are accepted they will not become valid URLs in practice. My answer to all these questions is "no".

bagder commented 7 years ago

@magcius: that's a completely broken argument. This effort is about writing an updated URL spec that can replace RFC 3986/3987 (as per the goal section). Thus referring back to one of those RFCs for actual content would be counter to that goal. (And no, RFC3986/3987 can actually not be followed to create proper TWUS (The WHATWG URL Specification) URLs, they're not that compatible. I'm primarily thinking of IDN and non-ascii here.)

zcorpan commented 7 years ago

The model of being stricter in what is allowed than what parses into something (or is processed in some defined way) is used widely in HTML, and this section discusses the whys:

https://html.spec.whatwg.org/#conformance-requirements-for-authors

Some of the syntax ones seem like they could apply to URLs as well:

Conformance checking URLs to catch these classes of mistakes seems worthwhile to me. Making the parser return failure for all of these cases would not be web compatible and so a web browser would not be able to do that.

However there is another bullet in HTML's list that currently doesn't apply to URLs, but could in theory:

The URL standard could allow a failure to be returned for any syntax violation.

bagder commented 7 years ago

@zcorpan I don't think we argue (here) why the parser should be forgiving and accept all sorts of weirdo input, this discussion is more searching for a motivation for the "syntax spec" when then "parser spec" is really the one that dictates the truly accepted syntax.

zcorpan commented 7 years ago

OK, I've struck the part of my comment suggesting optional failure for syntax violations. What is remaining in my above comment tries to give motivation for the "syntax spec".

bagder commented 7 years ago

Unintuitive error-handling behavior

How does the syntax spec help with this? It doesn't describe the syntax the parser accepts!

Errors involving known interoperability problems in legacy user agents

How does the syntax spec help with this? It doesn't describe the syntax the parser accepts!

Errors that risk exposing authors to security attacks

If the syntax spec describes this, why on earth does the parser spec allow things that puts users at risk.

... and so on.

zcorpan commented 7 years ago

Unintuitive error-handling behavior

It might be intuitive that 3 slashes in http:///example.org/ would result in failure, but for web compat the parser needs to not return failure. If the syntax section were to match what the parser accepts, it would mean that 3 slashes is not a syntax violation, and so conformance checkers would not give any error message when people accidentally use too many slashes.

Errors involving known interoperability problems in legacy user agents

Converting backslashes to forward slashes in URLs is an example where there wasn't interoperability in legacy user agents, but doing so gives better compatibility with web content. If the syntax section were to match what the parser accepts, conformance checkers would not give any error message when people accidentally use backward slashes, and the URLs would not work in some legacy user agents.

And so on. Making them different is the entire point; it is required to have an effect at all.