whatwg / url

URL Standard
https://url.spec.whatwg.org/
Other
533 stars 139 forks source link

Should file URLs have opaque hostnames? #599

Open karwa opened 3 years ago

karwa commented 3 years ago

I've been trying to port Chromium's file path <-> file URL utilities to a project conforming to the latest standard.

As far as I have been able to tell (it's a large codebase and I'm not at all familiar with it), Chromium turns Windows UNC paths in to file URLs with hostnames, and those hostnames may include percent-encoding (e.g. \\some computer\foo\bar.txt becomes file://some%20computer/foo/bar.txt).

That is not allowed by this standard: the hostnames of file URLs are domains (they are even encoded with IDNA), and may not contain raw spaces or percent-encoding.

I don't think it is expected that a file URL's host must only be a domain or IP address; it might be better for them to have opaque hostnames, so they can contain percent-encoding and other characters as allowed by whatever mechanism resolves them.

annevk commented 3 years ago

cc @TimothyGu @achristensen07

TimothyGu commented 3 years ago

Node.js's url.pathToFileURL function, when run on Windows, calls domainToASCII on the provided NetBIOS machine name, which would return the empty string if the NetBIOS name has a space. That's probably not the correct behavior, though no one seems to have complained so far.

Would someone be able to test the Windows UrlCreateFromPath function? I'd be curious to know how it treats \\some computer\foo\bar.txt.

This should do. ```c++ #include #include #include #include int main() { char out[INTERNET_MAX_URL_LENGTH + 1]; DWORD out_len = INTERNET_MAX_URL_LENGTH; HRESULT res = UrlCreateFromPathA("\\\\some computer\\vol\\file", out, &out_len, NULL); std::cout << res == S_OK << std::endl; out[out_len] = '\0'; std::cout << out << std::endl; } ```

RFC 8089 provides some useful background information on file URLs, but unfortunately doesn't consider the question of weird characters in hostname or IDNs. RFC 8089 uses the grammar in RFC 3986 to describe file hosts though, and the reg-name production does not allow spaces to exist but does allow percent-encoded bytes.

I'd be inclined to allow some sort of opaque hostnames for file URLs as they already have a lot of exceptions in the spec, and hostname in file URLs is basically a legacy feature to support Windows anyway.

TimothyGu commented 3 years ago

I also found Microsoft's naming conventions for NetBIOS computer names. Unfortunately it doesn't describe exactly whether spaces are allowed (it's not an alphanumeric character which is explicitly allowed, but also doesn't appear in the list of disallowed characters). This ambiguity is also documented on Microsoft's page NetBIOS Name Syntax.

However, RFC 1002 has an example for the NetBIOS name "FRED " with the space at the end, which suggests that spaces are indeed allowed in the protocol at least.

TimothyGu commented 3 years ago

Interestingly, Chrome actually doesn't treat file URLs differently from http/https. That is, it allows both percent-encoding and IDNs in URLs:

new URL('file://félicit ations.fr/test.txt').href // ⇒ file://xn--flicit%20ations-bnb.fr/test.txt
new URL('https://exam ple.com').href              // ⇒ https://exa%20mple.com/

That's certainly one way around giving file URLs special handling, but no one other than Chrome seems to support this…

karwa commented 3 years ago

So I finally had time to set up a Windows VM, and I've found that URLCreateFromPathA does indeed produce URLs with percent-encoding in the hostname:

\\some computer\vol\file -> file://some%20computer/vol/file \\?\vol\file -> file://%3F/vol/file

Unfortunately, any browser or other application which conforms to this standard would fail to parse these URLs.

Since NetBIOS hostnames are documented as being case-sensitive, we need to be conservative and preserve the hostname as it was given (i.e. no IDNA). It might be okay to detect and canonicalise IP addresses, but I'm not sure, and in the worst case we can just preserve it in the path and let the system deal with it.

So yes, I'm quite sure that file URLs must support opaque hostnames. If we can find a reliable way to decide that it should be interpreted as an IP address or domain, that would be a nice improvement, but they at least need to support percent-encoding and uppercase characters.

alwinb commented 3 years ago

Just as an aside, the standard does ask to parse file hosts as opaque (non-special) in step 3.1. of the file host state. But there are some tests that disagree.

karwa commented 3 years ago

@alwinb I don't think that is true - whilst it does say "Let host be the result of host parsing buffer with url is not special.", that is the same formulation used in the regular host/hostname state. "url is not special" is a link, or a computed property of url; it does not mean that you should parse the hostname as though the URL were not special. For further confirmation, see the scheme/host matrix.

When I first tried to implement the spec, it took me far longer to parse that line than I'd like to admit. It's very awkwardly worded.

alwinb commented 3 years ago

Aah now I see! That makes sense, thank you.

dandclark commented 3 months ago

I am doing some work in the Chromium project to try to bring the handling of space characters more in line with the URL parsing standard, and to complete that it would be very helpful if we had an idea of the way this issue is likely to be resolved.

As @TimothyGu noted here, Chromium allows spaces in hostnames for both http/https and in file URLs. Ideally we would change the Chromium behavior to match other browsers by banning spaces for all special URLs. However, this is risky to do for file because of the issue raised in this thread of potential spaces in NetBIOS names; and if this issue is resolved to allow file URLs to have opaque hostnames, then changing Chromium to treat space as error in file URLs could be a wrong step.

I've been trying to work around this by drafting a Chromium change that only disallows spaces for non-file special URLs. However the complexity of scoping the change in that way is higher than we'd like, especially since we're not sure how file URLs will end up being handled based on the resolution of this issue.

So to decide how we should proceed in bringing Chromium closer to standards compliance, it would be helpful if this issue could be moved towards resolution. Would it be possible for the URL standards experts to comment on which way this should go?

domenic commented 3 months ago

I want to be clear I am speaking as neither a URL standard editor nor in my capacity as a Chromium engineer. So I don't have any relevant decision power for this. But maybe I am a URL standard expert...

I personally find the arguments in this thread compelling that the current standard is not good. I know Chromium engineers (/cc @ricea @hayatoito) have already been reluctant to change file: URL handling (e.g. excluding it from Interop 202X efforts) and I suspect this sort of issue would not help them overcome that reluctance.

It's less clear to me what the path forward is. If I understand the issue correctly, we have a few options:

I'm hearing that on balance @karwa believes that treating file: URL hostnames as opaque is the best option, even if that means we don't get canonicalization for IP addresses. Do others agree with that?

@annevk, how do you feel about this issue, especially given WebKit's position as a browser that has successfully shipped the URL Standard's parsing?

karwa commented 2 months ago

When it comes to host canonicalisation in file URLs, it's worth pointing out that this is only relevant on Windows, and while I expect it is could be useful for IPv4, the canonicalisation by the standard isn't even correct on Windows. I can give two examples:

  1. Stripping localhost in a file URL can be destructive. See #618
  2. Windows also allows IPv6 addresses to be specified using a DNS-compatible syntax because their path syntax does not allow the colons in IPv6 addresses. See: https://en.wikipedia.org/wiki/IPv6_address#Literal_IPv6_addresses_in_UNC_path_names

With regards to the latter, file://fe80::1ff:fe23:4567:890a/... and file://fe80--1ff-fe23-4567-890a.ipv6-literal.net/... point to the same host according to the OS, and your URL may get converted in to the latter form by converting to/from a file path. If applications want to robustly handle IPv6 addresses as hostnames in file paths on Windows, they will need some custom logic to parse/normalise this anyway.

So yes, I think it's better to just say these things are opaque from the URL standard's perspective, which allows for this kind of implementation-defined/platform-specific logic.

I also think we should document how browsers and other applications are supposed to convert file URLs <-> paths on the major operating systems, since there's lots of divergence there as well. That's filed as https://github.com/whatwg/fetch/issues/1338

Treat file: URL hostnames the same as http:/https:, per the comment on https://github.com/whatwg/url/issues/599#issuecomment-846372280. This will canonicalize IP addresses and allow spaces, but also do punycode stuff and case-folding, which I think is bad?

This would not allow spaces. https://example com fails to parse.

(The comment showed that happening in Chrome because Chrome's handling of spaces is yet to be aligned with this standard. It fails on the live viewer: Live viewer)

achristensen07 commented 2 months ago

I wouldn't mind changing new URL("file://host with spaces/path") from a parsing failure to returning file://host%20with%20spaces/path.

valenting commented 2 months ago

I wouldn't mind changing new URL("file://host with spaces/path") from a parsing failure to returning file://host%20with%20spaces/path.

Percent encoding spaces (and potentially other characters) in file hostnames is something that appeals to me. We haven't yet shipped file hostnames in Firefox yet, but we'll be watching this issue as we make progress on the implementation.

dandclark commented 2 months ago

@annevk, how do you feel about this issue, especially given WebKit's position as a browser that has successfully shipped the URL Standard's parsing?

@annevk friendly ping, did you have any thoughts on this?

annevk commented 1 month ago

As Alex said above it might be okay to change this. My main worry is that we still have quite a few file: URL issues outstanding and I'd rather tackle them comprehensively in one go. Especially if that came with some guarantee that we'd then all try to align on those changes and never revisit it again (modulo deployment fallout).