whatwg / html

HTML Standard
https://html.spec.whatwg.org/multipage/
Other
8.07k stars 2.65k forks source link

Define behavior for `file://` documents' origin. #3099

Open mikewest opened 7 years ago

mikewest commented 7 years ago

The text as https://html.spec.whatwg.org/#sandboxOrigin defines a document's origin in the case that "the Document's URL's scheme is a network scheme" and for data: schemes, but declines to define behavior for non-network schemes like file:. Unsurprisingly, different browsers have made different choices here. When a document is loaded from file:///directory/file.html:

I wonder if we could get more alignment if we talked about it a bit. There seems to be general agreement that the page should have an opaque origin, but a little bit of disagreement about what that should mean. I'd kinda like to keep Chrome's behavior for DOM access and Fetch, for instance, as it protects against scanning the entire disk or a user's downloads directory. I'm less enthusiastic about Chrome's localStorage behavior. I'd prefer Safari's, I think, but could live with something less draconian if there's good reason to.

@annevk, @travisleithead, @johnwilander: Would y'all mind looping in relevant folks (or having opinions yourselves? :) )?

shhnjk commented 7 years ago

Maybe worth discussing the behavior of document.cookie in file URLs too.

mikewest commented 7 years ago

@shhnjk: Yeah. I didn't do exhaustive tests, but I assume that the localStorage behavior is indicative of what the browsers are doing with other storage mechanisms (cookies, IndexedDB, etc). Ideally, we'd align all of them.

mikewest commented 6 years ago

/cc @whatwg/security for thoughts.

bzbarsky commented 6 years ago

The actual Firefox behavior is more or less like so:

1) Each file:// URL gets its own origin. It's not a unique origin, in the sense that if you load the same file:// thing you get the same origin, but it's different from all other file:// URL origins. 2) A file:// load from a file:// origin that represents a file in the same directory or an ancestor directory inherits the loading origin (just like data: does in most cases, and used to in all cases in Firefox). This explains the localStorage behavior, the fetch behavior, etc.

There are some implications from this not captured by the discussion above. Specifically, if file:///A/test.html loads an iframe from file:///A/B/subframe.html which then loads a subframe from file:///A/subsubframe.html, then all three documents have the same origin, and that origin is the origin of file:///A/test.html. But if you started off by loading file:///A/B/subframe.html from the URL bar, then it and the subsubframe it loads would have different origins, because it would be loading something from an ancestor directory.

The fundamental reason for the Firefox behavior was to allow things like HTML help systems and whatnot to work. There are some drawbacks, of course. There's the problem of scanning the download directory. There's some weirdness around interacting with symlinks (see https://bugzilla.mozilla.org/show_bug.cgi?id=670514). That sort of thing.

In addition to the document origin question, there's the question of subresources. If I have a document at file:///A/test.html that loads an image from file:///B/test.png and draws it to a canvas, can it getImageData? Can it access the CSSOM of stylesheets from file:///A/test.css? Does it get sane error reporting for errors from a script loaded from file:///A/test.js?

mikewest commented 6 years ago

Thanks, @bzbarsky! That's helpful context!

  1. A file:// load from a file:// origin that represents a file in the same directory or an ancestor directory inherits the loading origin

Based on the file:///A/B/subframe.html example below, I think you meant "child directory" here? Is that right?

The fundamental reason for the Firefox behavior was to allow things like HTML help systems and whatnot to work.

I can see this as a real concern. But, Chrome's been shipping tighter behaviors than Firefox for some time now, and the anecdotes I know about personally are positive. For example, my partner often gets HTMLized schoolbooks on CD from which they can print out worksheets and etc. for their classes. Thus far, Chrome hasn't frustrated that effort. shrug Things seem to just work without DOM or XHR access.

I grant that this might not be the case for more complex material, but (again anecdotally) I haven't seen any bugs filed against on the issue. It might not be a large use case? Or perhaps everyone who needed it has migrated to Firefox? Tough to tell from metrics alone...

There's the problem of scanning the download directory.

This does seem to be a real problem. Moreso for Edge than for Firefox, though, given that it doesn't seem to do directory-based scoping.

Is this a problem that Firefox would be interested in poking at?

If I have a document at file:///A/test.html that loads an image from file:///B/test.png and draws it to a canvas, can it getImageData? Can it access the CSSOM of stylesheets from file:///A/test.css? Does it get sane error reporting for errors from a script loaded from file:///A/test.js?

I'd hope that each of these would be explained by the origin question above. If we treat file: as having a unique origin, then it seems reasonable that we'd taint the canvas in the first example, block CSSOM access to the stylesheet in the second, and mute errors for the third.

bzbarsky commented 6 years ago

I think you meant "child directory" here? Is that right?

No, I meant what I said, though the antecedents may not have been very clear. The "represents" bit was talking about "a file:// origin", not the URL being loaded. Maybe a clearer phrasing:

  1. When a file:// origin representing file X loads a file:// URL representing file Y, the resulting thing gets the origin of X if the parent directory of X is an ancestor directory of Y.

Or perhaps everyone who needed it has migrated to Firefox?

Or to IE/Edge, yes? On Windows, Chrome is the only browser that doesn't support this use case.

Is this a problem that Firefox would be interested in poking at?

Yes. We've been trying to figure out sane ways to restrict this case without breaking too many users for a while.

Maybe we could try gathering some telemetry about how much breakage users would actually encounter... It's hard to say with some of the corporate-firewall dark matter out there. :(

I'd hope that each of these would be explained by the origin question above.

Right, but the question is what browsers do right now.

Note that Chrome, for example, doesn't enforce CSSOM origin checks the way the spec says it should (see https://bugs.chromium.org/p/chromium/issues/detail?id=650534 and https://bugs.chromium.org/p/chromium/issues/detail?id=775525), which means there are cases that would work in all browsers right now but stop working if Firefox stops inheriting file:// origins into stylesheets but keeps correctly enforcing the CSSOM security checks.

mikewest commented 6 years ago

Maybe a clearer phrasing

Got it, thanks!

Or to IE/Edge, yes? On Windows, Chrome is the only browser that doesn't support this use case.

A very fair point.

Maybe we could try gathering some telemetry about how much breakage users would actually encounter... It's hard to say with some of the corporate-firewall dark matter out there. :(

Yup. That's a real problem.

What metrics would be helpful to add? I could imagine adding something along the lines of "How many pageviews are on file:?", along with "How many pageviews are on file: and block access to some other file:?". Since Chrome does block those requests, though, I can image it wouldn't be representative of usage in other browsers.

Note that Chrome, for example, doesn't enforce CSSOM origin checks the way the spec says it should (see https://bugs.chromium.org/p/chromium/issues/detail?id=650534 and https://bugs.chromium.org/p/chromium/issues/detail?id=775525), which means there are cases that would work in all browsers right now but stop working if Firefox stops inheriting file:// origins into stylesheets but keeps correctly enforcing the CSSOM security checks.

Thanks for the poke. I've pinged the bug again, let's see if we can get more alignment.

mikewest commented 6 years ago

FWIW, we can probably approximate "How many pageviews are file:?" by looking at Chrome's navigation metrics: ~1.98% of "different-page" (e.g. non-fragment, non-pushState) navigations in the last ~month were to file:, which is larger than I'd expected.

tigt commented 6 years ago

As a user, I prefer Firefox's behavior for DOM access/localStorage, because it works better when saving complete Web pages with <iframe>s (like any Tumblr page with a photoset) or localStorage.

As a developer, I often make little utilities for friends and family by giving them an .html file they can use, and localStorage is my go-to for persisting data that way. I don't mind if that storage is unsharable with anything else, but I'd really like to to keep using it, scoped to that particular file or such.

[EDIT]: these utilities are often specifically to get around corporate restrictions, so they can use an small tool at work that is considered safe by the system and doesn't require network access.

Example: a color picker for a friend who wanted a particular behavior (the colors mixed in some application-specific way) that output a pasteable snippet for the software they were using. It remembers their previous combinations for ease-of-use, since localStorage isn't guaranteed long-term, but I feel that's still a fairly important use-case.

thw0rted commented 5 years ago

I know this issue hasn't seen any activity for a while but if anybody comes back here to talk about local file access, it might be helpful to add a data point about XHR behavior, since Chrome behaves differently between XHR and fetch.

FWIW, I strongly agree with @tigt, that while you may not see huge numbers of people relying on advanced features from a file: context, it would really be a shame to rely on that data to justify restricting what can be done with "fully offline" code. Today, a browser with some clever HTML/JS is a powerful tool for implementing simple applications in a locked-down corporate environment. Since I've been working in such environments for almost 20 years now, I'd hate to see those powers hobbled just because "nobody really uses them".

mozfreddyb commented 5 years ago

Firefox 68 and newer treat files as unique origins (https://bugzilla.mozilla.org/show_bug.cgi?id=1500453). Maybe it's time to make the spec change.

bzbarsky commented 5 years ago

Note that when we did this we discovered various places where Chrome does NOT in fact treat different files as different origins. That has been pretty frustrating, with the whole "now we have to reverse-engineer this stuff" business...

MattMenke2 commented 4 years ago

Chrome seems to have a single global "file://" origin used by a lot of origin code. There's at least one place in the renderer where a file origin is replaced by an opaque origin, but a lot of code uses origins that were not parsed by that code. So localStorage for Chrome URLs is global. If you enable network state partitioning in Chrome, all file URLs use a single "file" origin partition, etc.

Also, it looks to me like the FireFox and Safari descriptions here are inaccurate, at least with respect to localStorage (which is all I tested). FireFox looks to have a per-directory localStorage. Safari looks to have a global file localStorage (I suspect it just deletes it every so often, but didn't test that).

anforowicz commented 3 years ago

Let me share some thoughts I had after reading about the cross-directory behavior described below:

There are some implications from this not captured by the discussion above. Specifically, if file:///A/test.html loads an iframe from file:///A/B/subframe.html which then loads a subframe from file:///A/subsubframe.html, then all three documents have the same origin, and that origin is the origin of file:///A/test.html. But if you started off by loading file:///A/B/subframe.html from the URL bar, then it and the subsubframe it loads would have different origins, because it would be loading something from an ancestor directory.

I assume that we want most security decisions to be based on the origin that initiates a navigation or a subresource fetch. Therefore I assume that the directory information would need to be somehow encoded in the origin (in specs + in implementations). Q1: Is this a fair assumption?

I think that the assumption leads to 2 additional questions:

Q2: What is / should be the algorithm for calculating the origin of a document?

At the high-level the algorithm used by Chromium only looks at the document's URL in most cases. The only major exceptions are "about:blank" and "about:srcdoc" and initial empty documents, which may inherit their origin from the navigation initiator ("about:blank", "about:blank#blah", etc.), from the parent frame ("about:srcdoc", etc.) or from the creator/owner (initial empty document).

It seems that inheriting directory-based-restrictions might require changes (in the quoted example the initiator origin might matter when navigating to file:///A/subsubframe.html even though it is not an about:blank, about:srcdoc, etc.)

AFAIK, Chromium's algorithm is implemented in the following places:

Q3: How should the directory information be represented in an origin?

Do we want an explicit top_level_directory_path or a similar field in the specs and implementations?

How crazy would it be to encode the directory information into the hostname? It seems rather icky, but is it off-the-table-icky?

FWIW, I assume that WindowOrWorkerGlobalScope.origin wouldn't change?