More CORB-protected MIME types - safelist-based approach

annevk commented 6 years ago

@anforowicz @csreis did you do an exhaustive evaluation of MIME types? E.g., any reason text/webvtt isn't there? We also blocked text/csv for scripts (standardized and in Firefox at least, not sure if Chrome has fixed that yet), can we block that MIME type for CORB?

anforowicz commented 6 years ago

No good reason for not covering other MIME types - covering only HTML/XML/JSON was an outcome of trying to protect the most resources likely to contain sensitive data while at the same time minimizing the risk of breaking existing websites. I very much agree that CORB should be extended to cover more resources, although I think we'd like to wait with this until HTML/XML/JSON implementation from Chromium reaches Chrome stable channel and doesn't uncover any worrisome issues.

I think we have the following options for extending CORB to other MIME types:

Explicitly adding MIME types one-by-one (csv, pdf, zip, etc.).
- It seems that mislabeling of images/audio/video/scripts with these MIME types is unlikely, so maybe CORB confirmation sniffing is not necessary for them
- This doesn't scale and won't help protect 1) many existing types or 2) a hypothetical new JSON-or-XML-like-format-from-the-future
Changing CORB so that instead of using a blacklist (HTML/XML/JSON currently) it uses a whitelist (everything other than javascript/css/images/audio/video/fonts/... - see the explainer section here)
- This seems to be much cleaner/pure design, but it also seems riskier than other approaches - it requires enumerating all the legacy types that cross-origin-safe resources are served with (for example the application/octet-stream case that burned Firefox back in the day).
Using an opt-in header (like maybe the From-Origin header)

PS. I am not sure I understand the text/webvtt example. Isn't this an example of a resource that (like video/audio/images) can be embedded cross-origin?

annevk commented 6 years ago

@anforowicz text/webvtt cannot be fetched cross-origin without CORS, unless I'm missing something.

jakearchibald commented 6 years ago

I think the safelist approach is the way forward. That way, new APIs that wanted to use no-cors requests would need to make a change to the fetch spec. Through review, we could make sure those responses are distinguishable from content that's likely to contain user data (or encourage them to use CORS instead).

annevk commented 6 years ago

I think we should forbid no-cors for new APIs (we already held the line for module scripts). Safelist would be great, but it's a little unclear to me how feasible that is.

anforowicz commented 6 years ago

@anforowicz text/webvtt cannot be fetched cross-origin without CORS, unless I'm missing something.

TIL :-) In that case I don't have concerns with including text/webvtt on the blacklist, next to HTML/XML/JSON/etc.

I think the safelist approach is the way forward.

I agree. From Chrome side we can probably use https://crbug.com/802836 for tracking. Before we proceed I think we need to:

Decide how to handle a responses with a missing Content-Type header
Flesh out the contents of the blacklist. I think so far we have:
- JavaScript MIME type like application/javascript or text/jscript
- text/css
- image types like types matching image/*
- audio or video types like audio/*, video/* or application/ogg
- font/* or one of legacy font types
- multipart/*
- Other MIME types like application/octet-stream
- I am also not sure what special treatment (if any) application/x-shockwave-flash should get

Also - if (for whatever reason) we decide to stick with the blacklist approach, then how can we decide what MIME types to cover in addition to HTML/XML/JSON? So far I think I've heard:

text/webvtt (and also text/vtt? how likely is it that sensitive/origin-bound data is present here?)
application/pdf (example of sensitive data: bank statement)
application/zip (example of sensitive data: downloading a directory from Google Drive / Dropbox / etc.)

jakearchibald commented 6 years ago

A good start on the safelist approach would be to list all the APIs that can consume opaque responses.

annevk commented 6 years ago

text/vtt is the correct MIME type. text/webvtt is something I made up. I don't think it's likely to contain private information, but if we stick to a blocklist we should block as many MIME types as we can. I think these are all the no-cors consumers:

Plugins
Images
Media (except for subtitles, those are CORS-bound)
CSS (this requires a correct MIME type cross-origin fortunately so only need to safelist text/css for this not to break)
Classic JavaScript

I think fonts are CORS-bound and therefore do not need to be considered, but perhaps not in Safari still.

However, all of the above are pretty notorious for being served with the wrong MIME type. Perhaps it's slightly better if require X-Content-Type-Options as well as allow some of the wrong MIME types (such as application/octet-stream), but then the question becomes how many sites specify X-Content-Type-Options. (Since otherwise sites still need to perform work in order to get protection.)

jakearchibald commented 6 years ago

So the safelist approach would be:

If destination is style, and mime type is not text/css, return blocked.
If the response has opted out of sniffing, and the mime-type is not one of the allowed no-cors types, return blocked.
Use the destination of the response to filter the set of sniffing algorithms to try (so if the destination is 'image', only keep the image sniffing algorithms. This is a performance optimisation, nothing to do with security).
Try each of the sniffing algorithms. If any match, return allowed.
Return blocked.

In cases where the destination doesn't filter the sniffing algorithms (eg fetch(url, {mode: 'no-cors'}), it might be slow due to the amount of checking. We either just take the hit here, or find a way to keep an opaque response's body in another process until we can perform a specific sniff.

I think fonts are CORS-bound and therefore do not need to be considered, but perhaps not in Safari still.

Yeah they're CORS requests according to the spec. If Safari's still violating the spec there, they'll probably have to violate/extend the spec here too, to add font mime type checking and content sniffing.

anforowicz commented 6 years ago

If destination is style, and mime type is not text/css, return blocked.

"destination" might not always be available (not sure if the "network service" in Chromium will be aware of it - I think it tries to shy away from any content::ResourceType knowledge) or trustworthy (the renderer can just lie). Therefore, I'd rather avoid using the "destination" for a decision here.

[...] keep the image sniffing algorithms [...]

Currently CORB only sniffs for HTML/XML/JSON to confirm that blocking is okay (sniffing for blocked resources). I think this is a fine approach going forward as well (we can discuss whether similar confirmation sniffing is needed for some other types like PDF or ZIP - I think the answer is "no - not needed" because mislabeling of an image or script as PDF should be hopefully rare).

I don't understand why we'd switch this approach to sniffing for allowed resources:

This might be problematic if "destination" is not available.
It is difficult to properly cover a hypothetical image/future-format in this kind of sniffing

fonts are CORS-bound

Good to know. This possibly makes things slightly easier.

However, all of the above are pretty notorious for being served with the wrong MIME type.

Yes :-(

The proposed restriction of safelist-based blocking to nosniff responses might help, but I do note that it didn't help in the application/octetstream case encountered earlier by Firefox. There are probably other similar cases like this :-/

annevk commented 6 years ago

I think it's fine to block future (image) formats. CORS can be used for those.

jakearchibald commented 6 years ago

@anforowicz

"destination" might not always be available (not sure if the "network service" in Chromium will be aware of it - I think it tries to shy away from any content::ResourceType knowledge) or trustworthy (the renderer can just lie). Therefore, I'd rather avoid using the "destination" for a decision here.

That step's more of an early-exit. It'll also be handled in the later steps in case of fetch(url, { mode: 'no-cors' }).

I don't understand why we'd switch this approach to sniffing for allowed resources:

Isn't it better to further limit the amount of no-cors data that can end up in the process? It just seems like no-cors is a source of so many security issues, so trying to restrict it as much as possible seems like a good thing.

anforowicz commented 6 years ago

I thought that I'd mention that @arturjanc and @mikewest published https://www.arturjanc.com/cross-origin-infoleaks.pdf where they list "doesn't protect all resources" as one of major cons of the current CORB algorithm.

anforowicz commented 6 years ago

I don't understand why we'd switch this approach to sniffing for allowed resources:

Isn't it better to further limit the amount of no-cors data that can end up in the process? It just seems like no-cors is a source of so many security issues, so trying to restrict it as much as possible seems like a good thing.

I agree that it is better to limit the about of no-cors data that can end up in a cross-origin process. I don't understand why above you are proposing to use "image sniffing algorithms" rather than continue using the current CORB approach of only doing confirmation sniffing. In other words, I don't understand what is wrong with the following sniffing scheme for CORB:

Content-Type: text/html => sniff to confirm that the response really contains html (and not a html/js polyglot or an image for example). block only if 100% confident that the response really contains html.
Content-Type: text/json, Content-Type: application/xml => sniff similarily to the above (i.e. only attempt to confirm the declared Content-Type)
New Content-Type (e.g. PDF/ZIP/etc.) - block outright without any CORB sniffing (assuming that mislabeling an image as PDF is unlikely / behaving as-if nosniff was present in response headers).

annevk commented 6 years ago

@anforowicz if you want to switch to a safelist-based approach, you wouldn't want to confirm something is HTML (which is not allowed anyway), but rather you'd want to confirm something is an image (which is allowed).

I was thinking of splitting up this issue earlier. Create a new issue for how to go about switching to a safelist. And keep this issue for extending the current approach. That might help.

anforowicz commented 6 years ago

@anforowicz if you want to switch to a safelist-based approach, you wouldn't want to confirm something is HTML (which is not allowed anyway), but rather you'd want to confirm something is an image (which is allowed).

I guess there might have been a misunderstanding of what each of us means by a "safelist-based approach" - possibly I didn't look carefully enough at @jakearchibald's earlier comment with a more detailed proposal of what is meant by "safelist-based approach".

I think there are 2 options of introducing a "safelist-based approach":

Option #1: add a safelist-based decision to the current CORB algorithm:
- HTML/XML/JSON and text/plain: current behavior of html/json/xml confirmation sniffing (unless nosniff or 206)
- image/audio/javascript/other-safelisted-type: allow
- any other type: always block without any sniffing
Option #2: replace the current CORB algorithm with a safelist-based decision:
- image/audio/javascript/other-safelisted-type: allow
- any other type: sniff to check if the response is an image/audio/javascript/other-safelisted-type and only block if it is not

I think @annevk and @jakearchibald are proposing Option #2. I think that Option #1 might be easier to implement (with a well-defined, small set of formats to sniff for).

I was thinking of splitting up this issue earlier. Create a new issue for how to go about switching to a safelist. And keep this issue for extending the current approach. That might help.

I am not opposed, although I do note that we've accumulated quite a bit of safelist discussion here already...

jakearchibald commented 6 years ago

Yeah, I'm suggesting option 2.

anforowicz commented 5 years ago

I agree that a "safelist"-based approach is correct for the long-term, but it may take some time to implement (considering the non-negligible risk of breaking existing websites). So, I've opened https://github.com/whatwg/fetch/issues/860 to discuss if in the short-term it might make sense to add extra individual MIME types to the list of types protected by CORB.

@annevk - I wonder if you could maybe change the subject of the current issue (https://github.com/whatwg/fetch/issues/721) to something like "More CORB-protected MIME types - 'safelist'-based approach"? I couldn't figure out how to change the subject myself (maybe I cannot do it as I am not the reporter?)... :-/

anforowicz commented 5 years ago

@csreis and me discussed this yesterday and I'd like to share some more thoughts about the "safelist"-based approach (having CORB block everything other than known safe types like images/audio/video/javascript/stylesheets). These are just more notes about options 1 and 2 from https://github.com/whatwg/fetch/issues/721#issuecomment-390263197:

Option 1: Look at Content-Type header to detect if response is a safe type
- This is easy to implement
- This seems likely to break existing websites (e.g. see the application/octet-stream trouble here). To prevent breaking existing websites, the current html/xml/json confirmation sniffing would probably need to be retained.
- This should cover future image/audio/video types without any implementation changes (assuming they are not xml-based...)
Option 2: Sniff to detect if response is a safe type
- Sniffing of Javascript and CSS seems difficult to implement in practice (?)
- Robust sniffing seems unlikely to break existing websites
- It is unclear if/how this might extend to future image/audio/video types (would they need to pretend that they sniff as one of earlier types so that CORB allows them? would we force CORS for them (not sure if this is doable for img tag)).

annevk commented 5 years ago

See also #870. There's many Content-Type values that'll be happily parsed as script. For option 2:

I think we should figure out to what extent cross-origin style sheets are a problem. It would be interesting to know how many Chrome fetches that lack a Content-Type header or have a Content-Type header value that cannot be parsed. (Otherwise a strict match for text/css is required.) https://bugzilla.mozilla.org/show_bug.cgi?id=1531405 tracks this idea in Firefox.
For the subset of JavaScript fetches that lack or have an improper Content-Type header it might be feasible to "validate" it in a separate process by running a parser. This would slow them down and a valid MIME type could be used to get out of that.
I think requiring CORS for new "types" is very reasonable and we already successfully did that for JavaScript modules.

anforowicz commented 5 years ago

For option 2 (sniffing) @acolwell also points out that sniffing is challenging for range requests (which have to be supported for media content). FWIW, the current CORB implementation treats responses to range requests as if the X-Content-Type-Options: nosniff response header was present.

And I would also like to point out that sniffing should ideally be based on the first X bytes of the response (right not Chromium sniffing only looks at most at the first 1024 bytes to avoid increasing the latency).

annevk commented 5 years ago

What effect does X-Content-Type-Options: nosniff have for media requests? It's not defined to have any. If we can require Content-Type for range requests that'd be ideal though, I don't really see how we'd handle those otherwise indeed. (Agreed on a 1024 limit.)

annevk commented 5 years ago

I chatted with @jakearchibald and I'm no longer convinced range requests are much of an issue. It seems we first always determine the type of a media resource and then any range requests will go to a fixed URL, meaning it'll only potentially "leak" content of a resource identified to be a media resource. See also https://github.com/whatwg/fetch/issues/144#issuecomment-368040980. The main thing remaining here is an exact list of file signatures we would want to safelist. And not allow extensions of that list (use CORS).

@lukewagner also considers a basic JavaScript-validating function to be doable (if you don't want the performance hit, use a valid Content-Type or CORS).

Which leaves CSS, getting telemetry on that would be great.

jakearchibald commented 5 years ago

any range requests will go to a fixed URL

The URL can change as long as all data is visible. Some research on this https://github.com/whatwg/fetch/issues/145#issuecomment-370757789.

But yeah, response URLs cannot be mixed if data is opaque. And opaque data cannot be mixed with visible data. That's what casued https://jakearchibald.com/2018/i-discovered-a-browser-bug/.

anforowicz commented 5 years ago

So, is your proposal to add some state to CORB, so that range requests are handled like this:

If CORB sees a 206 response then
- If this is a same-origin response, then return allowed
- If this is a cross-origin response, then check if 1) non-range response was received for the identical URL in the past and 2) whether the past response sniffed as one of safelist content types (e.g. as media / js / css).

Did I get that right?

annevk commented 5 years ago

Yeah, due to range responses it's not a clean filter, but a filter and some state about prior actions.

jakearchibald commented 5 years ago

In order to 'unlock' opaque responses that start at a range other than 0, Fetch could require the caller to provide the response for the first chunk so it can be validated. Kinda similar to https://wicg.github.io/background-fetch/#validate-partial-response-algorithm.

annevk commented 5 years ago

A safelisted MIME type in a Content-Type header should work as well. Those would generally bypass the filter. Perhaps we should sketch out the high-level algorithm so folks can poke it at more easily.

anforowicz commented 5 years ago

One thing that might need some clarification is how the state's lifetime is managed. If all previous sniff results need to be stored, then there will be unbounded memory growth. OTOH, if we discard previous sniff results, then we risk that a range request issued after waking up from a long sleep will be incorrectly blocked by CORB.

anforowicz commented 5 years ago

Actually, I guess that we can have separate CORB state for each execution context (frame, service worker, etc.) and discard CORB state when the frame goes away. So maybe the memory growth concerns are not a big isssue. And some other heuristics can also help in a non-malicious case (e.g. only storing state for range-requestable resources - for audio and video, but maybe not for javascript, stylesheets and images). I am not sure if there is something that can be done to prevent unbounded memory growth in a malicious case - maybe we can keep at most X entries in the cache and purge least recently used ones when overflowing the threshold.

annevk commented 5 years ago

Range requests are only a problem for audio/video at the moment I think, but there is some model where we could tie things to the element doing the requests that would minimize unbounded memory growth. I suspect tying it to the lifetime of the fetch group is more reasonable though. I sketched out the algorithm and necessary state below:

A user agent has a no-CORS-safelisted requesters set.

A request has an associated no-CORS media identifier (null or an opaque identifier). Null unless explicitly stated otherwise.

[The idea here is that the no-CORS media identifier is owned by the media element (audio/video only; I'm assuming we won't do range for images without at least requiring MIME types at this point). As part of the element being GC'd, it would send a message to get all the relevant entries from the user agent's no-CORS-safelisted requesters set removed. There might be better strategies available here and it's not clear to me to what extent we need to specify this, but it's probably good to have a model that does not leak memory forever so the set needs to be keyed to something. The fetch group might also be reasonable.]

A no-CORS-safelisted media MIME type is ... (audio, video).

A no-CORS-safelisted non-media MIME type is ... (image, text/css, JavaScript MIME types).

To determine whether to allow response response to a request request, run these steps:

Let mimeType be the result of parsing Content-Type of response.
If mimeType is a no-CORS-safelisted media MIME type, then:
1. Append (request's no-CORS media identifier, request's current URL) to the user agent's no-CORS-safelisted requesters set.
2. Return true.
If mimeType is a no-CORS-safelisted non-media MIME type, then return true.
If the user agent's no-CORS-safelisted requesters set contains (request's no-CORS media identifier, request's current URL), then return true.
Wait for 1024 bytes of response or end-of-file, whichever comes first and let bytes be those bytes.
If the image type pattern matching algorithm given bytes does not return undefined, then return true.
If the audio or video type pattern matching algorithm given bytes does not return undefined, then:
1. Append (request's no-CORS media identifier, request's current URL) to the user agent's no-CORS-safelisted requesters set.
2. Return true.
If the JavaScript pattern matching algorithm does not return undefined, then return true.
Return false.

Note: user agents are encouraged to optimize the above algorithm, by taking into account the context where the request originated from.

acolwell commented 5 years ago

I'm not sure I completely understand all the issues here so I apologize if I am missing something. I just wanted to point out that range requests are not only limited to requests initiated by a audio/video tag. Web applications that use Media Source Extensions (MSE) typically use XHR and fetch to get media data and will use range requests to fetch media segments from the middle of a file. I think in most sane scenarios @jakearchibald suggestion of referencing a request that accesses the beginning of the file could work because these applications usually access the beginning of the file to get the "initialization segment" for initializing codecs.

There are some "advanced" scenarios where this initialization data might be passed down embedded in the HTML/JS to "save a round trip" and in that case it is likely that an XHR or fetch with a Content-Range would occur on a URL w/o accessing the beginning of the resource. I understand we probably can't accommodate all scenarios but I just wanted to point this out as a thing applications might do.

Another thing I noticed while trying to follow this discussion is that the "audio or video type pattern matching algorithm" might not properly detect Basic Media Segments(http://wiki.webmproject.org/adaptive-streaming/webm-dash-specification) as media because they don't contain the EBML header that the sniffing logic is looking for. These types of segment are used to avoid the overhead of resending the initialization segment in each segment request. IIUC though this sniffing only occurs if the Content-Type isn't set correctly so I guess it is ok if this fails. It might encourage sites to fix their Content-Type.

jakearchibald commented 5 years ago

@annevk

but there is some model where we could tie things to the element doing the requests that would minimize unbounded memory growth

I think this is the right approach. In the case of opaque data, all responses need to come from the same place. The only thing that can assert this is the API that processes the data (eg the <audio>). It's the only thing smart enough to know that if the src changes, it's requesting a 'new' resource.

Since <audio> would need to keep the original response url (and maybe headers) around anyway, it may as well be used for CORB too.

annevk commented 5 years ago

@acolwell in the case of MSE via fetch() or XMLHttpRequest there's no opaque responses aiui, so those responses would not end up being filtered at all by this algorithm. Those responses need to be same-origin or use CORS.

annevk commented 4 years ago

I created https://github.com/annevk/orb to flush out the idea above more concretely. The challenges to me seem:

Range requests. I'm not confident the current model works with redirects and such. Or are they not allowed? Maintaining a requesters set is also not great.
One of the last steps:

If response's body parses as JavaScript and does not parse as JSON, then return true.

Perhaps waiting for and parsing the complete response is fine as it only affects a small number of responses (and maybe there are ways to further bring that number down), but I'm not sure if that's acceptable to all.

Thoughts on those more than appreciated!

annevk commented 2 years ago

There is now a PR for this in #1442.

whatwg / fetch

More CORB-protected MIME types - safelist-based approach #721