It's not immediately clear that "URL syntax" and "URL parser" conflict

domenic commented 8 years ago

URL syntax is a model for valid URLs---basically "authoring requirements". The URL parser section allows parsing URLs which do not follow URL syntax.

An easy example is https://////example.com, which is disallowed because the portion after https: contradicts

A scheme-relative URL must be "//", followed by a host, optionally followed by ":" and a port, optionally followed by a path-absolute URL.

/cc @bagder via some Twitter confusion this morning.

bagder commented 8 years ago

Is there really any reason for accepting more than two slashes for non-file: URLs? I mean apart from this spec saying that the parser should accept them.

domenic commented 8 years ago

The fact that all browsers do.

bagder commented 8 years ago

The fact that all browsers do.

I tested Safari on a recent OS X version and it doesn't even accept three slashes. Not in the address bar and not in Location: headers in a redirect. It handles one or two slashes, no more. So I refute your claim.

The fact that all browsers do

That's exactly the sort of mindset that will prevent the WHATWG URL spec to ever become the universal URL spec. URLs need to be defined to work in more contexts than browsers.

annevk commented 8 years ago

URLs need to be defined to work in more contexts than browsers.

They are, no? Handling multiple slashes or not seems irrespective of that. If Safari does not do it there might be wiggle room, or Safari might hit similar compatibility issues to curl.

bagder commented 8 years ago

That's a large question and too big of a subject for me to address fully here.

URLs in the loose sense of the term are used all over the place.

URLs by the WHATWG definition are probably not used by much else than a handful of browsers, no. In my view (wearing my curl goggles), there are several reasons why we can't expect that to change much short-term going forward either. Like this slash issue shows.

I would love a truly universal and agreed URL syntax, but in my view we've never been further away from that than today.

domenic commented 8 years ago

I'm sorry for the imprecision. We often use "all browsers" to mean "the consensus browser behavior, modulo minor deviations and bugs."

The URL Standard defines URLs for software that wants to be compatible with browsers, and participate in the ecosystem of content which produces and consumes URLs meant for browsers. If cURL does not want to be part of that ecosystem, then yes, the URL Standard is probably not a good fit for cURL. But we've found over time that most software (e.g. servers which which to interact with browsers, or scraping tools which wish to be able to scrape the same sites as browsers visit) wants to converge on those rules.

bagder commented 8 years ago

We often use "all browsers" to mean "the consensus browser behavior, modulo minor deviations and bugs."

This made me also go and check IE11 on win7, and you know what? It doesn't support three slashes either.

To me, this is important. It shows you've added a requirement to the spec that a notable share of browsers don't support. When I ask about why (because it really makes no sense to me), you make a recursive answer and say you did this because "all browsers" act like this. Which we now know isn't true. It's just backwards in so many levels.

If cURL does not want to be part of that ecosystem

Being part of that ecosystem does not mean that I blindly just suck up what the WHATWG says a URL is without me questioning and asking for clarification and reasoning. Being here, asking questions, responding, complaining, is part of being in the ecosystem.

curl already is and has been part of the ecosystem since a very long time. Deeply, firmly and actively - we have supported and worked with URLs since back when they were still truly standard "URLs" (RFC 1738). I'm here, writing this, because I'd rather want an interoperable world where we pass URLs back and forth and we agree on what they mean.

When you actively decide to break RFC 3986 and in extension RFC 7231 for the Location: header I would prefer you could explain why. If you want to be a part of the ecosystem.

the URL Standard is probably not a good fit for cURL

I wish we worked on a URL standard, then I'd participate and voice my opinions like I do with some other standards work. A URL standard is very much a good idea for curl and for the entire world.

A URL works in browsers and outside of browsers. It can be printed on posters, it works parsed and highlighted by terminal emulators or IRC clients, they get parsed by scripts and they get read out loud over the phone by kids to their grandparents. URLs are, or at least could be, truly universal. Limiting the scope to "all browsers" limits the usability of them. It fragments what a URL is and how it works (or not) in different places and for which uses.

If you want a URL standard, you must look beyond "all browsers".

domenic commented 8 years ago

This made me also go and check IE11 on win7, and you know what? It doesn't support three slashes either.

Edge does, however: http://software.hixie.ch/utilities/js/live-dom-viewer/?saved=4182

In general, Edge has made changes like this to be compatible with the wider ecosystem of web content. I can't speak for their engineers, but this shows clear convergence.

It's good to hear you're interested in participating. That wasn't my impression from your earlier comments, and I welcome the correction.

JohnMH commented 8 years ago

Why should malformatted URLs be parsed? Surely the solution is to simply tell people who are using malformatted URLs to.. stop using malformatted URLs?

jyasskin commented 8 years ago

In the interest of looking for ways forward, instead of just saying "no", per https://twitter.com/yoavweiss/status/730173495464894465, it might make sense to collect usage data and see if browsers can simplify the URL grammar.

JohnMH commented 8 years ago

It may be best to ignore that browsers even use URLs, because there are definitely other pieces of software that use URLs. Consider the following URL: irc://network:port/#channel

On May 10, 2016 7:31:35 PM EDT, Jeffrey Yasskin notifications@github.com wrote:

In the interest of looking for ways forward, instead of just saying "no", per https://twitter.com/yoavweiss/status/730173495464894465, it might make sense to collect usage data and see if browsers can simplify the URL grammar.

You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/whatwg/url/issues/118#issuecomment-218321890

John M. Harris, Jr. PGP Key: f2ea233509f192f98464c2e94f8f03c64bb38ffd

Sent from my Android device. Please excuse my brevity.

domenic commented 8 years ago

I'd suggest the following plan to any browsers interested in tightening the URL syntax they accept:

look through https://github.com/w3c/web-platform-tests/blob/master/url/urltestdata.json and find all non-error results that you wish became error results
(optionally, wait until I finish my long-delayed project to expand that file to give 100% coverage of the spec. Currently it gives around 60%.)
Instrument all URL parsing to figure out how often these undesirable patterns occur
When the numbers come back, decide which percentage of users or pages you are willing to break, and pick the subset of algorithm changes you can make to do so up to that percentage
optionally, weigh the percent of users broken vs. the corresponding spec or implementation complexity reduction
start coordinating with other vendors to see which of your changes they're interested in. (Some cases might already behave differently in those other vendors' browsers, which could help!)
Ship, preferably in a coordinated fashion with appropriate devrel support.

JohnMH commented 8 years ago

Browsers are not the only applications that use URLs.

On May 10, 2016 7:40:56 PM EDT, Domenic Denicola notifications@github.com wrote:

I'd suggest the following plan to any browsers interested in tightening the URL syntax they accept:

look through https://github.com/w3c/web-platform-tests/blob/master/url/urltestdata.json and find all non-error results that you wish became error results

(optionally, wait until I finish my long-delayed project to expand that file to give 100% coverage of the spec. Currently it gives around 60%.)

Instrument all URL parsing to figure out how often these undesirable patterns occur

When the numbers come back, decide which percentage of users or pages you are willing to break, and pick the subset of algorithm changes you can make to do so up to that percentage

optionally, weigh the percent of users broken vs. the corresponding spec or implementation complexity reduction

start coordinating with other vendors to see which of your changes they're interested in. (Some cases might already behave differently in those other vendors' browsers, which could help!)

Ship, preferably in a coordinated fashion with appropriate devrel support.

You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/whatwg/url/issues/118#issuecomment-218323801

John M. Harris, Jr. PGP Key: f2ea233509f192f98464c2e94f8f03c64bb38ffd

Sent from my Android device. Please excuse my brevity.

domenic commented 8 years ago

@JohnMHarrisJr your comment seems irrelevant to my plan for "any browsers interested in tightening the URL syntax they accept".

ghost commented 8 years ago

The syntax for URIs is such that the authority component (user:password@host:port) is always separated from the scheme by two slashes, except for some schemes that do not require them. The path may only begin with // if the authority component is there, and in that case it must begin with a slash. So there is no possible case where there would be more than three slashes after the colon following the URI scheme.(1) HTTP in particular requires the two slashes between the URI scheme and the authority component, so there should always be exactly two slashes between the http URI scheme and the URI component.(2)

In other words, http://user:password@host:port/path is valid, http://user:password@host:port//path might be valid, but anything else is definitely not valid.

URLs being a subset of URIs, it would only make sense to follow the standards that have already been established for a long time—especially since they make sense.

jyasskin commented 8 years ago

The main difference between the WHATWG and some other standards organizations is that the WHATWG attempts to describe the world as it is, rather than as folks would like it to be. That means that if a major implementation of URLs implements them in one way, the WHATWG specification needs to allow that implementation. So, it doesn't help to "prove" that browsers are wrong by citing earlier specifications.

That said, it seems like there's a good argument to have the URL spec allow implementations to reject too many slashes, since at least one recent browser and several other tools do reject them.

domenic commented 8 years ago

Well, we want interoperable behavior, so it's either accept or reject. There's room for "accept with a console log message telling you you're being bad" (the spec already has this concept), but it would take some cross-browser agreement to move toward "reject".

ghost commented 8 years ago

@domenic That would mean web browsers are not allowed not to accept something that is pointless and has been considered invalid since URIs were a thing, only to accept it but give a console log message. Which again considers only web browsers. Other applications that use URLs probably don't have a console for logging such messages.

@jyasskin If the goal is interoperability, standardizing the behavior of the major (read: popular) implementations—which for WHATWG usually means "let's standardize whatever Google Chrome does, other browsers don't matter as much and anything that isn't a web browser or isn't the HTTP protocol doesn't matter at all"—isn't the best option since these implementations are usually the most actively developed and the ones that care most about those standards. If the standards make another decision than what these implementations do, it's more likely that they would change their behavior than it is that the other implementations would. Of course, this sounds like a bad argument, but that's only because it relies on a wrong premise, which is that defining things based on the major implementations is a good idea.

This isn't surprising given that many people interested in WHATWG use Chrome, have Gmail email addresses, or are Google employees. The others are with Mozilla, probably use Firefox, and probably use Gmail email addresses.

This approach of standards as a popularity contest is harming the web, it tries to make tools like curl that already do URL parsing correctly and very well behave like the popular web browsers for "interoperability". And the popular web browsers behave like they do only in order to support every unreasonable thing that can be found on web pages, because their market share depends on supporting as many web pages as possible so that users don't switch to another browser. And then other browsers, and tools like curl, are expected to do the same because a spec says to!

domenic commented 8 years ago

Your claims about the WHATWG having a Chrome bias are false. Please be respectful and make on-topic, evidence-based comments, and not ad hominem attacks, or moderation will be required.

ghost commented 8 years ago

@domenic I admit my comment is the result of frustration, and neither on-topic, nor evidence-based, nor respectful.

domenic commented 8 years ago

Thanks for that. We can hopefully keep things more productive moving forward.

At this point I think the thread's original action item (from my OP) still stands, to clarify the authoring conformance requirements for "valid" URLs, versus the user agent requirements for parsing URLs.

Besides that, there seems to be some interest from some people in getting browsers (and other software in the ecosystem that wishes to be compatible with browsers) to move toward stricter URL parsing. I think my plan at https://github.com/whatwg/url/issues/118#issuecomment-218323801 still represents the best path there.

As for the particular issue of more than two slashes, I have very little hope that this can be changed to restrict to two slashes, since software like cURL is already encountering compatibility issues, and we can at least guess that the change from IE11 to Edge to support 3+ slashes might also be compatibility motivated. (Does anyone want to try http://software.hixie.ch/utilities/js/live-dom-viewer/?saved=4182 in Safari tech preview to see if they've changed as well?)

But, of course, more data could be brought to bear, as I outlined in my plan. I personally don't think three slashes is the worst kind of URL possible out of all the weird URLs in https://github.com/w3c/web-platform-tests/blob/master/url/urltestdata.json (my vote goes to maybe h\tt\nt\rp://h\to\ns\rt:9\t0\n0\r0/p\ta\nt\rh?q\tu\ne\rry#f\tr\na\rg) but it seems like a lot of people care about that, so maybe browsers will want to invest effort in measuring that particular use case.

justjanne commented 8 years ago

@domenic Considering how many people write software working with existing URL libraries, wouldn’t it be more useful to define URL as "whatever the majority of tools support" (like the URL libraries in about every language, framework, command line tool, server framework, etc)?

Sure, what users input should be accepted, but considering that the input bars of browsers will also happily accept any text (and, if search is off, prepend an http://www., and append a .com/), is already a sign that maybe the definition here is wrong.

Maybe we need a defined spec for a single correct storage format for an identifier, and additionally an algorithm on how to get to this identifier based on user input.

"Google.com" is not a URL, although users think it is one – seperating the actual identifier and the user-visible representations might be helpful here (especially for people writing tools, as they can then declare "we accept only the globally defined format", and let you use other libraries for transforming user-input into that format).

annevk commented 8 years ago

@justjanne the URL Standard does not concern itself with the address bar. That is a misconception. It concerns itself with URLs found in <a>, Location headers, URL API, <form>, XMLHttpRequest, etc.

annevk commented 8 years ago

To be crystal clear, the browser address bar is UX and accepts all kinds of input that is not standardized. And it is totally up to each browser how they want to design that. They could even make it not accept textual input if they wanted to. That code has no influence on how URLs are parsed.

justjanne commented 8 years ago

@annevk It also concerns itself with URLs used for cross-app communication in Android, for IPC in several situations, etc. (Android uses for intents a URL format, and for cross-app communication, and doesn’t accept more than 2 or 1 slash either)

It also concerns itself with address bars.

What I was suggesting that maybe we should split it into one specific, defined representation (which libraries, tools, Android, cURL, etc could accept), and one additional definition for "how to parse input/Location headers/etc into that format".

Because obviously browsers have to accept a lot more malformed input than other tools, but also it’s obvious that not every tool should include a way to try and fix malformed input itself.

annevk commented 8 years ago

That is basically how it is structured today. There's a parser that parses input into a URL record. And that URL record can then be serialized. The URL record and its serialization are much more constrained than the input.

I think that cURL basically wants the same parsing library in the end. It has already adopted many parts of it. I'm sure as it encounters more content it will gradually start to do the same thing. There's some disagreement on judgments calls such as whether or not to allow multiple slashes, but given that browsers are converging there (see Edge) and there is content that relies on it (see cURL's experience) I'm not sure why we'd change that.

justjanne commented 8 years ago

@annevk That’d be cURL – but do you suggest every tool that handles URLs is rewritten based on this single implementation of a parser? Do you actually want Android to accept single or triple slashes in cross-app communication? You’d add a lot more ambiguity, complexity, and performance issues to any software working with it.

There are use cases for when you want to accept malformed input (for example, when it comes from a user), and there are use cases where you don’t. The definition of URL should be what you call "serialization of a URL record".

(And, IMO, for cURL it would be better to split the parsing into a URL record into a seperate tool, and do url-parse "http:/google.com/" | curl. The UNIX principle applies in many places, including here. )

annevk commented 8 years ago

@justjanne I think it makes sense to have a single parser, yes. Perhaps the parser should have a strict mode where it bails on the first syntax violation rather than simply recording an error, but I'm not sure if that matters much. I'm still open to that idea, though probably best discussed in its own issue. Overall it seems better to validate input at the URL record level.

bagder commented 8 years ago

The curl project is 18 years old. We have received exactly one report of a URL using three slashes...

dxlbnl commented 8 years ago

Why is the fact that most browsers accept an arbitrary number of slashes good enough to have a spec which allows that. Why is it better to open up the spec to allow all possible options. Hell, most browsers accept any kind of input, and will happily trigger a searchengine. But is it wise to include a default search engine in the spec, and say that any valid string is a valid url. If no host is found lets search on this searchengine?

That browsers accept an arbitrary number of slashes does not say those slashes are sent to the server. That must mean that the browser parses any string, and tries to create a valid url from that.

jyasskin commented 8 years ago

Be sure to read https://github.com/whatwg/url/issues/118#issuecomment-218432532.

dxlbnl commented 8 years ago

Thanks for that. But the same point applies (bar the searchengine). The browser generates a valid url from whatever is found. That does not mean it should be in the spec.

annevk commented 8 years ago

@neoel how would you write a new browser without such a standard? How would you write a good search engine without such a standard?

magcius commented 8 years ago

Having implemented a web scraper, I have actually seen one-slash and triple-slash URLs in the wild (a lot of the time, they're from posts on social media, but we've seen news content from popular sources contain such URLs as well. Part of the reason you never notice is that the simply browsers handle this silently.). Additionally, there is plenty of popular syntax not supported by most URL parsing libraries, like the "//example.com/" syntax to keep a relative scheme. We've also encountered a lot of links that use Unicode characters directly inline, and, again, there's no standard way to handle that.

A specification to implement the web as seen by browsers is invaluable to us, since otherwise it's been hacks and guesswork to get right.

JohnMH commented 8 years ago

On Wed, 2016-05-11 at 08:53 -0700, Dexter wrote:

Why is the fact that most browsers accept an arbitrary number of slashes good enough to have a spec which allows that. Why is it better to open up the spec to allow all possible options. Hell, most browsers accept any kind of input, and will happily trigger a searchengine. But is it wise to include a default search engine in the spec, and say that any valid string is a valid url. If no host is found lets search on this searchengine? That browsers accept an arbitrary number of slashes does not say those slashes are sent to the server. That must mean that the browser parses any string, and tries to create a valid url from that.

Anything that includes non-standard format should not be accepted at all, and the user should be presented with an error message telling them why the URL is invalid, if possible. (And there is no reason for it not to be possible) The issue with large news sources can be fixed with server-side code, and browsers should definitely join in shaming those who do not fix that kind of obvious issue. Promoting laziness and inefficient design isn't helping anyone.

sleevi commented 8 years ago

@JohnMHarrisJr I appreciate your enthusiasm, but please keep in mind: This Standard reflects the way the world is, not the way we want it to be. As @magcius mentioned, it's important to capture how the world actually is - and where it's diverged from the idealistic view of the RFCs - so that we can accurately know the bugs, quirks, and behaviours that exist.

Arguing on ideological purity ("It shouldn't be in the spec") accomplishes nothing, serves nothing, because at the end of the day, code is running today that accepts it, servers depend on that/exploit that, and the Standard accurately reflects that.

I don't think there's opposition to making a "better world, tomorrow" - and @domenic comment in https://github.com/whatwg/url/issues/118#issuecomment-218323801 suggests how we can go about making that world. But the Standard exists to reflect what code today is doing - warts and all - so that there's consistency - warts and all. Once we have consistency, we can also make a concerted, collaborative effort to move to the world we want.

I think you're confusing the documenting the state of the world as promoting it, and that's not the case, no more than a study of controversial topics (say, a study of racism, or police violence, or global warming) should be seen as promoting/endorsing it. The Standard is not saying it's good - it's saying "This is the world we live in, and this is how to work within that world today" - and once we understand how bad it is, we can collaborate to make it better. And https://github.com/whatwg/url/issues/118#issuecomment-218323801 captures how we can do that, with browsers in particular having a significant influence not because of any innate primacy/superiority, but simply because they're the way most users will interact with the content, and the first and only means that authors will know something works or doesn't work.

JohnMH commented 8 years ago

On Wed, 2016-05-11 at 11:25 -0700, sleevi wrote:

@JohnMHarrisJr I appreciate your enthusiasm, but please keep in mind: This Standard reflects the way the world is, not the way we want it to be. As @magcius mentioned, it's important to capture how the world actually is - and where it's diverged from the idealistic view of the RFCs - so that we can accurately know the bugs, quirks, and behaviours that exist. Arguing on ideological purity ("It shouldn't be in the spec") accomplishes nothing, serves nothing, because at the end of the day, code is running today that accepts it, servers depend on that/exploit that, and the Standard accurately reflects that. I don't think there's opposition to making a "better world, tomorrow"

and @domenic comment in #118 (comment) suggests how we can go about making that world. But the Standard exists to reflect what code today is doing - warts and all - so that there's consistency - warts and all. Once we have consistency, we can also make a concerted, collaborative effort to move to the world we want. I think you're confusing the documenting the state of the world as promoting it, and that's not the case, no more than a study of controversial topics (say, a study of racism, or police violence, or global warming) should be seen as promoting/endorsing it. The Standard is not saying it's good - it's saying "This is the world we live in, and this is how to work within that world today" - and once we understand how bad it is, we can collaborate to make it better. And #118 (comment) captures how we can do that, with browsers in particular having a significant influence not because of any innate primacy/superiority, but simply because they're the way most users will interact with the content, and the first a nd only means that authors will know something works or doesn't work.

So this is not a "Standard", this is simply documentation of what is already done? Surely you see that that is an issue, and that there is no sense in calling this a standard? Saying that "This is the world we live in" is much different from saying "this is how to work within that world today." Browsers are not the "first a nd only means that authors will know something works or doesn't work.", in fact you could say the opposite. Yes, you know that those browsers following this will accept whatever URL is in question, but not how other software will deal with this issue. Instead of relying on that, they could learn how to properly format URLs so that this nonsense isn't done. It would be trivial to parse URLs on the server side so that when URLs are used, they are always exactly what the server wants, and not what your browser says it probably is. That is where standards come into play, and why this shouldn't be based on what browsers are doing at all, if it is to be called a "Standard".

sleevi commented 8 years ago

@JohnMHarrisJr While you may object to the notion, Standards that do not reflect running code are not valuable. That's why the IETF standardization effort recognizes not just "rough consensus", but "running code". Standards which are purely idealistic and ignore the world as it is often have zero traction, while those that accurately capture the world provide an invaluable service, and reflect the world as it is.

In any event, it sounds like you have broader disagreements with documenting "the world as it is," and those concerns don't apply just to this - it applies to things like HTML, CSS, or any of the other number of documents that reflect the "rough consensus and running code". It's almost certain that this bug is not the one to express your unhappiness with that, but hopefully it explains why you might have been confused.

abritinthebay commented 8 years ago

Standards that do not reflect running code are not valuable

Running code will often have bugs, bad behavior, and plan illogical choices. The purpose of a standard is to specify what the good behavior actually is so that those bugs/bad behaviors can be fixed.

Writing a standard that accepts everything means it's not a standard at all. It's just throwing it's hands up and saying "well, they do what they want anyhow, may as well say it's ok."

It's especially frustrating in this specific case because that's not even a good justification (as illustrated: several major browsers, and the largest mobile browser, specifically don't allow this behavior).

JohnMH commented 8 years ago

On Wed, 2016-05-11 at 11:44 -0700, sleevi wrote:

@JohnMHarrisJr While you may object to the notion, Standards that do not reflect running code are not valuable. That's why the IETF standardization effort recognizes not just "rough consensus", but "running code". Standards which are purely idealistic and ignore the world as it is often have zero traction, while those that accurately capture the world provide an invaluable service, and reflect the world as it is. In any event, it sounds like you have broader disagreements with documenting "the world as it is," and those concerns don't apply just to this - it applies to things like HTML, CSS, or any of the other number of documents that reflect the "rough consensus and running code". It's almost certain that this bug is not the one to express your unhappiness with that, but hopefully it explains why you might have been confused.

Code should reflect standards, not the other way around. The point of standardization is to provide instructions on what to implement, not to base standards on implementation. Yes, I understand that standards are often based on implementations, but not to the degree that we change standards to reflect major implementations.

magcius commented 8 years ago

Sure. Standards like this shouldn't represent code, they should represent data. Terabytes of data in the wild that doesn't adhere to more strict standards needs an interpretation, and starting with the interpretation supplied by the browsers seems sane to me.

For a stricter interpretation, you are welcome to use the IETF RFCs. They aren't deprecated -- in your own closed systems, you can reject data that doesn't adhere to them.

The web doesn't have that luxury, unfortunately.

abritinthebay commented 8 years ago

Terabytes of data in the wild that doesn't adhere to more strict standards needs an interpretation

Except, as pointed out repeatedly, it does adhere to that standard because many major browsers don't support what is proposed.

For a stricter interpretation, you are welcome to use the IETF RFCs

Well yes, I mean if the point of this standard is to just exist with no-one using it or caring about it... that would be a great position to take.

If, on the other hand, you want people to actually respect it... then being so obtuse about things is not a great way to go about it.

magcius commented 8 years ago

This standard is supposed to provide an interpretation for URLs found in the wild on the web. For people writing software that interfaces with the web, I find the specification valuable.

If you don't believe me that I've found URLs that have single or triple-slashes on the web at wild, well, I'm not sure what to say. These URLs might break in some older or niche browsers like IE and Safari, but the failure case is that the user would simply get an error page and then go back to what they were doing before.

As mentioned, Firefox, Chrome and Edge all have behavior to support them now.

JohnMH commented 8 years ago

I assume by "the web" you're referring only to HTTP? URLs are used much more widely by protocols other than HTTP.

abritinthebay commented 8 years ago

Firefox, Chrome and Edge all have behavior to support them now.

And that's a tiny fraction of the overall ecosystem for URLs. The focus on just latest versions of just browsers is myopic and if continued throughout this standards development will doom it to be just another standard on the pile.

I think WHATWG is better than that and the standard deserves better than that. It's a good idea to do this, but this is a bad approach.

magcius commented 8 years ago

The WHATWG URL spec is designed to standardize URL parsing in HTML tags and HTTP headers, to provide semantics and interpretation to data in the wild that does not adhere to RFC 3986. It is not designed to replace RFC 3986 and the broader ecosystem of applications that use URIs.

Web content in the wild is already broadly not spec-compliant. If you do not have to interact with this kind of content, you don't have to care about the WHATWG URL spec.

abritinthebay commented 8 years ago

Web content in the wild is already broadly not spec-compliant

While true in this particular issue they wouldn't work in Safari or on iOS at all... as that is not the case this shows that to be not true in that particular instance.

Also given that as you say....

The WHATWG URL spec is designed to standardize URL parsing in HTML tags and HTTP headers

Then thinking about it from a browser-only perspective (as the comments here appear to) is specifically against the design of the spec as many other things consume HTTP headers. Not just browsers.

JohnMH commented 8 years ago

On Wed, 2016-05-11 at 15:12 -0700, Gregory Wild-Smith wrote:

Web content in the wild is already broadly not spec-compliant While true in this particular issue they wouldn't work in Safari or on iOS at all... as that is not the case this shows that to be not true in that particular instance.

Also given that as you say....

The WHATWG URL spec is designed to standardize URL parsing in HTML tags and HTTP headers Then thinking about it from a browser-only perspective (as the comments here appear to) is specifically against the design of the spec as many other things consume HTTP headers. Not just browsers.

There is also the issue that, if this is about HTTP headers alone, it shouldn't be a "URL Standard", it should be "HTTP URL Standard" or "Browser URL Standard", both of which show that this has very little to do with URLs and more to do with how browsers handle them.

magcius commented 8 years ago

WHATWG stands for the "Web Hypertext Application Technology Working Group". It's implicit that any standards released by them are primarily in the context of the web, especially people implementing web browser technologies.

Explicit language to make this clearer in the spec is, of course, welcome, especially now that more eyeballs are on the spec it given Daniel's blog post today. But, again, this spec is about how invalid-by-RFC-terms URLs found on the web should be parsed and the interpretations they should have. Outside of web contexts, especially when generating new URLs in your own cases, you are welcome and recommended to use RFC 3986.

As for "Safari and iOS don't support it", the obvious answer is "yet". WebKit has bugs open about implementing the URL spec, and for web compatibility they will eventually move to the new spec'd parsing algorithm. That said, the error case for someone using an (arguably) niche browser is that they get an error message which is easy to dismiss and pass off as "oh, the internet isn't working today".

JohnMH commented 8 years ago

I don't know what your definition of "the web" is, but the web goes far beyond the HTTP protocol.

On May 11, 2016 10:32:51 PM EDT, "Jasper St. Pierre" notifications@github.com wrote:

WHATWG stands for the "Web Hypertext Application Technology Working Group". It's implicit that any standards released by them are primarily in the context of the web, especially people implementing web browser technologies.

Explicit language to make this clearer in the spec is, of course, welcome, especially now that more eyeballs are on the spec it given Daniel's blog post today. But, again, this spec is about how invalid-by-RFC-terms URLs found on the web should be parsed and the interpretations they should have. Outside of web contexts, especially when generating new URLs in your own cases, you are welcome and recommended to use RFC 3986.

As for "Safari and iOS don't support it", the obvious answer is "yet". WebKit has bugs open about implementing the URL spec, and for web compatibility they will eventually move to the new spec'd parsing algorithm. That said, the error case for someone using an (arguably) niche browser is that they get an error message which is easy to dismiss and pass off as "oh, the internet isn't working today".

You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/whatwg/url/issues/118#issuecomment-218644103

John M. Harris, Jr. PGP Key: f2ea233509f192f98464c2e94f8f03c64bb38ffd

Sent from my Android device. Please excuse my brevity.

whatwg / url

It's not immediately clear that "URL syntax" and "URL parser" conflict #118