Handling Partial Content / 206

mnot commented 9 years ago

Discussed in #97 - breaking out into a separate issue.

mnot commented 9 years ago

Some discussion at TPAC. There are many use cases for range requests, initial approach might be to identify bare minimum and only support those. E.g.,

Pass through range requests to the network
Serve range requests from a complete cache entry
Cache partial content when requested
Synthesise range requests to complete partial content (for full or partial requests)
...

(0) and (1) are a straw-man target for now. Thoughts?

annevk commented 8 years ago

So from those:

Already happens.
Is something for the Cache API to decide.
Already happens in step 8 of https://fetch.spec.whatwg.org/#concept-http-network-fetch as far as I can tell.
This does not currently happen.

Here are some things I'm wondering about:

Should we change https://fetch.spec.whatwg.org/#concept-http-network-or-cache-fetch step 15 to talk about one or more partial responses?
Should step 17.1 explicitly talk about a complete response?
Is there any freshening of headers when you get a 206 response? https://tools.ietf.org/html/rfc7234#section-4.3.4 seems to suggest there isn't. HTTP Range Requests also does not talk about this as far as I can tell.
Is therefore the only thing we need to introduce a 206 clause (similar to step 17 for 304) when cache mode is either "default" or "no-cache" that does recombining?

mnot commented 8 years ago

You can take the approach that every 206 is combined into one cache entry as soon as it occurs, or you can alternatively store multiple 206s and combine them at some later time. HTTP doesn't care which, as long as the effect is the same; see http://httpwg.github.io/specs/rfc7234.html#combining.responses
Step 14 already requires a complete response.
see http://httpwg.github.io/specs/rfc7233.html#combining.byte.ranges
maybe, yeah...

mnot commented 8 years ago

Is something for the Cache API to decide.

I mean the HTTP cache, not the SW cache. Or do you mean it's effectively an implementation decision?

annevk commented 8 years ago

I mean the HTTP cache, not the SW cache.

Ah I see. Maybe we should talk about that too.

code-tree commented 8 years ago

Serve range requests from a complete cache entry

Any hope that this will be implemented, in terms of being able to do it from a SW cache? It certainly would help with a problem I currently have with the Audio element and range headers

jakearchibald commented 8 years ago

I need to do some research to see what browsers do today vs the spec & decide if the service worker spec needs changes, or browsers need to fix bugs.

In the mean time, check out https://samdutton.github.io/samples/service-worker/prefetch-video/ - it's a bit hacky, but it works. It constructs ranged responses on the fly.

code-tree commented 8 years ago

Thanks for the link, the workaround worked for my audio problem too.

jakearchibald commented 8 years ago

So, I did the research on range requests. It's a mess. https://docs.google.com/document/d/1SphP-WNxqzZrSv_6ApC9_FpM-m_tLzm57oL3SNGg-40/edit#heading=h.1k8r6xdc6vfo

I think fetch should say something about the minimum browsers are required to support here.

jakearchibald commented 8 years ago

Also, unless I'm missing something, fetch should also create a network error if a 206 is returned when no range was requested.

annevk commented 8 years ago

@jakearchibald if you have time to work on this (or can find someone else) that would be great. I'm kinda swamped.

jakearchibald commented 8 years ago

I'll talk to our media teams and see what their expectations are and get as far as I can.

mnot commented 7 years ago

FWIW, I've written some basic tests to see if the HTTP (!SW) cache that Fetch uses supports partial content (i.e., range requests).

AFAICT, the only browser that does anything is Safari TP, which will store a partial response (i.e., one whose request had a Range header) and serve that response from cache if a request with the same Range header is made.

It will not serve a subset of an already-cached response to a request containing Range; nor will it complete an existing stored partial response by creating its own Range headers when it goes forward.

Firefox and Chrome don't even handle the simple case (again, according to my test; I talked to @mcmanus about this this AM, and he was a bit surprised to hear that). Edge doesn't support the HTTP cache from Fetch, AFAICT.

jakearchibald commented 7 years ago

I've been thinking about this again, here's a brain dump:

I should run my tests again, but using a wav file, which would eliminate weirdness due to metadata being at the end of the file. (thanks @foolip!)
Research which APIs follow the validation rules for combining ranges https://tools.ietf.org/html/rfc7233#section-4.3 - @foolip suggests that media elements don't care, but what about other APIs? It would be pretty bad security-wise if some APIs didn't validate.
It's down to individual APIs to say if they can accept ranged requests, what type of validation they require, and what URL they're prepared to accept a response from. Eg, for the first part of a response, they may accept any url, but for further parts they may only accept responses from that url.

annevk commented 7 years ago

I don't think we do range requests for anything but media elements, though that might have changed recently for image elements now that I think of it?

jakearchibald commented 7 years ago

There's downloads, which I guess are outside of spec land. We want to use them in background fetch.

annevk commented 7 years ago

Downloads are somewhat defined, as they interact with navigation and <a download> and such. Also, https://github.com/whatwg/html/issues/954.

jakearchibald commented 7 years ago

I pitched the following to our security team:

Add "Range" to safelist headers, allowing it to be sent to any server. Perhaps restrict value to "bytes=[number]-[optional number]" if that offers any protection.
Return a network error if any of the following is true:
- The response has status 206 and request does not have a range header.
- The response has status 206, the response is opaque, and the request url is not the first entry in the response url list.

My assumption is that APIs that range requests are only used by media elements and downloads (which I'll need to verify). This means you can't interpret a portion of a resource as script/css/etc.

Additionally, APIs consuming ranged responses should ensure all parts of a range have the same first entry in the response url list for a given resource.

However, there's still a worry that this new capability carries significant risk, and that we should look for another way forward.

The alternative solution is to find a way to mark a request as "allowed privileged headers", and allow the Range header in that case. Modifying the request in any way would remove the "allowed privileged headers" flag, meaning you couldn't take an internally-created Range request & change the URL & make the request, but you could do fetch(fetchEvent.request) if it had a Range header.

This means new Request(request) would copy requests "allowed privileged headers" flag, although modifying request.headers will unset it. new Request(request, init) would return a request with "allowed privileged headers" unset.

In addition to this, we should still:

Return a network error if any of the following is true:
- The response has status 206 and request does not have a range header.
- The response has status 206, the response is opaque, and the request url is not the first entry in the response url list.

Additionally, APIs consuming ranged responses should ensure all parts of a range have the same first entry in the response url list for a given resource.

DanielBaulig commented 7 years ago

Just a note here on

Return a network error if any of the following is true:

The response has status 206 and request does not have a range header.

We (Facebook) are actually using this currently on XHR as a feature to circumvent CORS preflight requests requirements for some cross origin requests. We add the range as query string parameters and have the cross origin server interpret them equivalent to range headers.

We are currently in transitioning some of these things to fetch and breaking this would be a big problem for us. Adding Range to the header safe-list however, would probably meet our requirements and in fact would likely be the superior solution anyway, since it will allow us to fully remain within HTTP semantics and allow for better caching at various levels.

jakearchibald commented 7 years ago

@DanielBaulig thanks! If we shipped what I proposed it'd have broken XHR too, since XHR uses fetch.

Is the server sending a 206 status in this case? I guess you're using this in a super-safe way that doesn't allow the original content to be interpreted as script?

DanielBaulig commented 7 years ago

@jakearchibald I thought it did, but I just double checked and it actually doesn't return a 206, but a 200, so there shouldn't be any acute breakage.

We are only applying this to media content fetched through XHR/fetch and currently always have well aligned requests to the same byte ranges, meaning the byte ranges of two requests will always either be equal or excluding each other, there will never be partial overlaps, so we shouldn't have any cacheability regressions.

That said, the reason I stumbled on this issue thread in the first place is that we are looking into changing this and using less fixed and well aligned byte ranges that could end up (partially) overlapping. Since that would break caching (the query string parameters are included in the browsers cache keys, if they are not identical, there won't be a cache hit), we were looking into implementing caching for these requests in SW by breaking the byte range query parameters out of the cache key for these requests, which then lead me to this thread and also to my reply on this W3C SW thread regarding Cache API supporting Range requests and responses.

Since we are not actually returning 206 though, we should not be seeing any problems from this proposal being implemented. My bad.

annevk commented 7 years ago

Still, I think we should test what browsers actually do when the server returns a 206. If browsers just return it as-is I don't think we should change the behavior into returning a network error. It's not worth the risk.

sirdarckcat commented 7 years ago

@DanielBaulig would you mind to show an example of such artificial range parameter? How could I test the behavior you described?

jakearchibald commented 7 years ago

Seconded, it'd be interesting to know which Facebook URLs support this.

jakearchibald commented 7 years ago

All browsers seem to allow a 206 partial response for script elements. As in, it will execute the body of the response, as if it was a 200.

Chrome security were a little worried about this, and were keen on making this an error.

But yeah, we'd need to do a lot of testing.

DanielBaulig commented 7 years ago

@sirdarckcat We use bytestart, byteend query string parameters for video playback on Facebook to deal with CORS restrictions. If we added normal Range headers we would have to do preflight requests to resources served from our CDN origin.

jakearchibald commented 6 years ago

I'm going to take another swing at this.

These cases should work:

A media element makes a series of ranged requests. Each request arrives in the service worker, is fetched, and the response returned. This works without a service worker, so it should work in a pass-through case.
A media element makes a series of ranged requests. The service worker uses a mixture of custom responses, and non-opaque responses. All of these responses are non-opaque, so there's no issue.
A script element makes a request. The request arrives in the service worker, it's fetched, and the response is partial. The script element accepts and executes the partial response. It's super weird, but browsers already allow this without service worker. As @annevk said, we shouldn't break that as part of this effort. If we want to break this, we'll look at it separately.

Since the service worker can rewrite fetches, it opens up the following attacks:

Attack 1

A media element makes two requests:

Request: Resource A. No-cors. Cross-origin. Byte range 0-5000. Response: JS-constructed. Valid media container data, stopping before the bit that defines the duration of the media.

Request: Resource A. No-cors. Cross-origin. Byte range 200-5000. Response: fetch(event.request). Response is opaque.

In this case, resource A isn't a valid media resource, but its 200th byte is now leaked as mediaElement.duration.

Solution: The media element must not allow a mixture of opaque and non-opaque responses for a given piece of media.

Attack 2

A media element makes two requests:

Request: Resource A. No-cors. Cross-origin. Byte range 0-5000. Response: A fetch to resource B, which responds with opaque valid media container data, stopping before the bit that defines the duration of the media.

Request: Resource A. No-cors. Cross-origin. Byte range 200-5000. Response: fetch(event.request). Response is opaque.

Again, the 200th byte is leaked.

Solution: If the media element receives opaque data, the last URL in each response's URL list must be identical.

Looking at https://github.com/whatwg/fetch/issues/145, it seems Chrome is fine as long as the responses are all the same origin. However, I'm worried about origin A redirecting/rewriting to lots of different places in origin B.

I need to spec where a media element goes for the second part of some media, if the first part results in a redirect. Browsers behave differently here.

Attack 3

A media element makes two requests:

Request: Resource A. No-cors. Cross-origin. Byte range 0-5000. Response: A previously cached response, part of resource A, byte range 8000-8200, which just happens to be opaque valid media container data, stopping before the bit that defines the duration of the media.

Request: Resource A. No-cors. Cross-origin. Byte range 200-5000. Response: fetch(event.request). Response is opaque.

Again, the 200th byte is leaked.

Solution: In the first fetch, the start of the byte range returned does not match the start of the byte range requested. This should be rejected.

We could reject this in the fetch spec, but I don't think we should block it for manual fetch() calls.

This should be blocked by the media element, but we could have a helper in the fetch spec to make this easier.

Attack 4

A script element makes a request:

Request: Resource A. No-cors. Cross-origin. Response: An opaque partial response of resource A that just happens to contain gender = 'female', which is private user data.

In this case resource A is an html resource like:

<p>Foo</p>
<script>const gender = 'female';</script>
<p>Bar</p>

…and the browser has been previously tricked (perhaps using a media element) into making a request for the range that contains gender = 'female'.

Given that the script element accepts partial responses, this is a tricky one. It seems that we want to continue to support the case where the server has, unprompted by the range header, returned a partial response. The difference in this case is the server was promted.

Solution: The response needs to know if its associated request had a Range header. Fetch should reject if the original request did not have a range header, but the service worker provides a response that is opaque, partial, and was requested with a range header.

@annevk I'm interested in your thoughts on the solution for attack 4.

jakearchibald commented 6 years ago

(I chatted about this with @annevk on IRC & he's happy with the attack 4 solution)

annevk commented 6 years ago

I'm a little concerned with

If the media element consumes opaque data, the first URL in each response's URL list must be identical.

from Attack 2. In particular if that's what we want to do in the face of redirects. If anything we probably want to compare the last URL. (Safer would be the whole list, but unfortunately no-cors cross-origin to same-origin is already non-opaque.)

jakearchibald commented 6 years ago

My intent is to ensure that all the requests have gone to the same place, as in, the service worker hasn't tried to combine multiple sources.

If the server wants to redirect each request, that seems weird but fine. Although I'll test what browsers actually do in this case.

jakearchibald commented 6 years ago

@annevk ahh yes, I get it now. I've updated the solution to Attack 2 to be last url in the list.

ricea commented 6 years ago

@jakearchibald Would it be possible to copy-and-paste your security analysis in https://github.com/whatwg/fetch/issues/144#issuecomment-368040980 to a more official-sounding URL? It would be nice if it was linked from the Fetch standard to make it discoverable.

It would be nice if it said something about the threat model, but that might be infeasible if it came down to "the threat model of the whole web platform".

jakearchibald commented 6 years ago

@ricea I was going to leave notes in the spec next to the solutions. Does that work?

The solution for Attack 4 will be in the fetch spec, but the others will be in the HTML spec.

Is this what you meant? If not, I'm happy to put the text somewhere else, I'm just not sure where.

ricea commented 6 years ago

@jakearchibald SGTM. My concern is just to make sure it is discoverable and gets the widest possible review.

noamr commented 2 years ago

See new PR: https://github.com/whatwg/html/pull/7655. Needs some more work and lots of WPT but ready for initial feedback.

zcorpan commented 2 months ago

https://github.com/whatwg/html/pull/7655 was merged but this is still open. Should this be closed?

annevk commented 2 months ago

I think that did solve what browsers do with media, but overall range support is still bad so we should probably keep this open to keep us honest.

whatwg / fetch