Specify restriction for requests with keepalive set

yutakahirano commented 6 years ago

Problem

We, Chrome, shipped the keepalive flag in fetch API, with a non-interoperable restrictions:

If the renderer process is processing more than 9 requests with keepalive set, we reject a new request with keepalive set initiated by fetch().
If the renderer process is processing more than 19 requests with keepalive set, we reject a new request with keepalive set.
- Note: The difference between fetch() + keepalive vs. SendBeacon is from a historical reason and we are going to have a unified restriction.
If Chrome is processing more than 255 requests with keepalive set, we reject a new request with keepalive set.

These restrictions are very much visible to web developers (making 10 fetch requests is very easy), so we would like to replace them with an interoperable ones, if possible.

What’s special about keepalive?

Chrome is a multi-tab applications, and many resources (CPU, memory, network, …) are associated with tab. For example, it is easy to create a non-responding web application like this.

<html>
<head>
<script>while(true) {}</script>
</html>

But a user can free up the CPU resource by simply closing the tab without impacting other web pages (or tabs) negatively. That is also true for network resources. Web pages can issue many network requests, but they will be cancelled when the page is unloaded or killed forcibly.

Requests with keepalive set is an exception. As stated in the fetch spec, requests with keepalive set survive page unload. This is a kind of resource leak from the above POV; the request is still ongoing while the tab is closed, and we don’t have a means to abort it except for shutting down the entire browser.

Mitigations

Chrome has some mitigations for the problem.

Upload payload size [specced]

As written in the spec, we only allow 64KiB payload per fetch group at a time. Unfortunately, this restriction is easy to escape.

Timeout [chrome only]

Requests with keepalive set will be cancelled when 30 seconds passed since the associated context (mostly frame) is destroyed.

Number of requests per process [chrome only]

The number of requests with keepalive set per render process is restricted in Chrome. This restriction is hard to escape but we cannot have is in the spec because “renderer process” is Chrome-only concept. We also have some uncertainty because sometimes a renderer process can be shared.

Number of requests per browser [chrome only]

The number of requests with keepalive set is restricted in Chrome. This restriction is impossible to escape as long as you are using the browser.

Abort request after the response is received [chrome only]

Chrome aborts a request with keepalive when both of the following hold: The associated context has already been destroyed. The response has already been received. This prevents mass download after the tab is closed.

Interoperability

I think having interoperability here is particularly good from two reasons.

One is to have a unified policy. Given the leaky nature of the flag, each implementer has to make a balanced decision between developers’ convenience and users’ expectation. Having diverse policies will confuse both developers and users.

The other is the difficulty to handle errors. keepalive flag is expected to be used when the page is about to unload. In such a circumstance developers are not likely to be able to detect and handle errors correctly. Having a rigid, interoperable restrictions will be developers’ benefit.

Proposal

The policy we would have should have the following properties:

It is difficult for (evil) web developers to circumvent the restrictions without being noticed by end users.
It should be aligned with reasonable end user expectations.

Here I propose the following:

Introduce KeepaliveContext which keeps track of the inflight requests with keepalive set.
- A unit of related browsing contexts share a KeepaliveContext.
- A dedicated worker shares the KeepaliveContext with its responsible browsing context.
- A shared Worker shares the KeepaliveContext with its responsible browsing context.
- Service worker: TBD; see below.
Restrict the number of inflight request with keepalive set asynchronously in a KeepaliveContext. Put a number that all implementations should allow.
Restrict the total number of inflight body size with keepalive set asynchronously in a KeepaliveContext. Note that this changes https://w3c.github.io/beacon/#sec-processing-model which contains a synchronous check.
Cancel a request with keepalive set when a certain time period passed from the fetch group termination.
Cancel a request with keepalive when both of the following hold:
- The associated fetch group has already been terminated.
- The response has already been received.

Service Worker

A web developer can create multiple service workers by using iframes, so assign one KeepaliveContext for each service worker is not an option. Here are some ideas:

Have a global KeepaliveContext used by all service workers

This should work, but it will be too restrictive.

Use the first service worker client’s KeepaliveContext

This can be leaky in some pathological cases. This will be a bit unintuitive.

A service worker intercepted request keeps KeepAliveContext

(proposed by @ricea )

A requested initiated by a service worker is not protected, but a request coming from a page is protected by the KeepAliveContext for the original request.

Do nothing; Let the service worker keep itself alive

The spec says:

A user agent may terminate service workers at any time it:

Has no event to handle.

Detects abnormal operation: such as infinite loops and tasks exceeding imposed time limits (if any) while handling the events.

When detecting a abnormal operation, it makes sense to cancel the request even with keepalive set. That means, If we are interested in requests with keepalive set from the page, they should be protected by the service worker. This option disables keepalive protection in service workers.

I (@yutakahirano) like the last option.

yutakahirano commented 6 years ago

@annevk @yoavweiss @igrigorik @sleevi @davidben @wanderview @youennf @toddreifsteck

annevk commented 6 years ago

KeepaliveContext is just a concept, not an API, right?

yutakahirano commented 6 years ago

KeepaliveContext is just a concept, not an API, right?

It's just a concept.

yutakahirano commented 6 years ago

Does anyone have any feedback?

yutakahirano commented 6 years ago

Does anyone have any feedback?

ricea commented 6 years ago

So, from Chrome's perspective, a KeepAliveContext would replace the per-renderer restrictions?

Does this mean doing away with the 30 second timeout in Chrome? Could we at least apply an equivalent timeout to a ServiceWorker, so a page can't create a keepalive request that lives longer than a ServiceWorker could perform the equivalent fetch?

I also like the "keepalive does nothing in ServiceWorkers" approach, although it might confuse developers. Presumably if a keepalive is forwarded through a ServiceWorker it gets added to the KeepAliveContext for the page it came from?

youennf commented 6 years ago

The number of request per process might be difficult to handle for developers and might never be interoperable anyway. Number of requests per browser might be interoperable but one can argue that the actual limitation might depend on the actual configuration of the device.

It makes sense to me to limit keep alive requests according their clients/workers origin and not in terms of context or process. Clients/workers can coordinate if they do not want to hit the limit. For WebKit, the context could be double-keyed for privacy reasons.

For keep alive requests, WebKit is defining a long timeout after which request is aborted. We could try to be consistent across browsers.

Or we could have no timeout at all for these requests as long as the context of the request exists. We would then use an abort timeout when the context of the keep alive request is being destroyed. That way, a user could expect that closing a tab will eventually kill the related keep alive requests after some grace period. Not sure the timeout actually needs to be specified.

FWIW, this timeout-on-destruction is close to how WebKit is handling service workers. When there is no more clients that a service worker can talk to, the service workers get killed after a grace period.

yutakahirano commented 6 years ago

So, from Chrome's perspective, a KeepAliveContext would replace the per-renderer restrictions?

Yes.

Does this mean doing away with the 30 second timeout in Chrome? Could we at least apply an equivalent timeout to a ServiceWorker, so a page can't create a keepalive request that lives longer than a ServiceWorker could perform the equivalent fetch?

No, it should be covered by "Cancel a request with keepalive set when a certain time period passed from the fetch group termination."

I also like the "keepalive does nothing in ServiceWorkers" approach, although it might confuse developers. Presumably if a keepalive is forwarded through a ServiceWorker it gets added to the KeepAliveContext for the page it came from?

Could you tell me why it's confusing? Your proposal is very interesting. I'll add it to the options.

yutakahirano commented 6 years ago

The number of request per process might be difficult to handle for developers and might never be interoperable anyway.

Agreed.

Number of requests per browser might be interoperable but one can argue that the actual limitation might depend on the actual configuration of the device.

I'd add one more reason against this option: giving an ability to exhaust entire limit to each evil developer is not desirable.

It makes sense to me to limit keep alive requests according their clients/workers origin and not in terms of context or process. Clients/workers can coordinate if they do not want to hit the limit. For WebKit, the context could be double-keyed for privacy reasons.

It's very easy to create a cross-origin iframe (e.g., www.example.com and www2.example.com) and that's why I'd like to use a top-level frame as a basic unit instead of origin.

For keep alive requests, WebKit is defining a long timeout after which request is aborted. We could try to be consistent across browsers.

Thank you, that's really encouraging.

Or we could have no timeout at all for these requests as long as the context of the request exists. We would then use an abort timeout when the context of the keep alive request is being destroyed. That way, a user could expect that closing a tab will eventually kill the related keep alive requests after some grace period. Not sure the timeout actually needs to be specified.

That's what Chrome is doing. I'm not sure if the specific timeout value should be specified either, but describing a mechanism will be good.

FWIW, this timeout-on-destruction is close to how WebKit is handling service workers. When there is no more clients that a service worker can talk to, the service workers get killed after a grace period.

ricea commented 6 years ago

I also like the "keepalive does nothing in ServiceWorkers" approach, although it might confuse developers.

Could you tell me why it's confusing?

I'm concerned that if it appears to work but sometimes doesn't, developers might be surprised.

However, I haven't come up with any case where this really matters. I don't think we need to worry about it.

yutakahirano commented 6 years ago

I proposed a session for this topic at https://www.w3.org/wiki/TPAC/2018/SessionIdeas#Session:_restriction_for_fetch_keepalive. I'm looking forward to see you there.

yutakahirano commented 6 years ago

https://docs.google.com/document/d/1OBubpZkvxB9WioVAnfK_Fn5up_8E9o6hqX7NN0LJENY/edit

mfalken commented 6 years ago

(we talked about this offline but I wanted to have a record) It would not be good for keepalive to have no effect in service workers, especially for requests that are initiated by the worker. See https://github.com/w3c/ServiceWorker/issues/1336 for a use case.

In the notes from https://github.com/whatwg/fetch/issues/679#issuecomment-432704656 I believe the summary wrt to service workers is:

Keepalive requests from a service worker should be respected in some way.
Fetch requests with keepalive should not skip the controller service worker (a bit orthogonal to the question of whether keepalive from service worker should be respected)

yutakahirano commented 6 years ago

Based on what we discussed at TPAC, I update the proposal as following:

Introduce KeepaliveContext which keeps track of the inflight requests with keepalive set.
Each origin has its own KeepaliveContext.
Each browsing context has its associated KeepaliveContext.
If a browsing context is a top-level browsing context, its KeepaliveContext is set to its origin's KeepaliveContext.
If a browsing context is a nested browsing context, its KeepaliveContext is set to its parent browsing context's KeepaliveContext.
A dedicated worker shares the KeepaliveContext with its responsible browsing context.
All service workers and shared workers share one KeepaliveContext.
A request has an associated KeepaliveContext. It is set to
- the original request's KeepaliveContext for the service worker interception case, and
- client's KeepaliveContext otherwise.
When a fetch group is terminated, for each request r with keepalive set:
1. Let context be r's associated KeepaliveContext.
2. If the size of context doesn't exceed a certain limit, then add r to context.
3. Otherwise, abort r.

Makes sense?

youennf commented 6 years ago

I think this captures where we want to go.

We should be able to update the spec so that:

keep alive requests attached to a browsing context do not have restriction
keep alive requests that are no longer attached to a browsing context may be restricted.

I am hesitant to describe more than that given that:

implementations might not be able to agree on the definition of the KeepAlive context.
the user agent might be more restrictive in some cases like low power mode or memory pressure.

mfalken commented 6 years ago

https://github.com/whatwg/fetch/issues/679#issuecomment-433827856 makes sense to me to implement. Like Youenn, I'm not quite convinced the specification needs this level of detail, but I'll leave it for you.

The updated proposal doesn't seem to include the 30 second timeout. Is it being removed?

By the way, don't these restrictions conflict with the current Beacon spec? Beacon is specified using keepalive and says:

[Other than payload size], the sendBeacon() API is not subject to any additional restrictions. The user agent ought not skip or throttle processing of sendBeacon() calls, as they can contain critical application state, events, and analytics data.

yutakahirano commented 6 years ago

@youennf I think it's valuable to have a policy in the spec, as said in the original proposal.

I think having interoperability here is particularly good from two reasons.

One is to have a unified policy. Given the leaky nature of the flag, each implementer has to make a balanced decision between developers’ convenience and users’ expectation. Having diverse policies will confuse both developers and users.

The other is the difficulty to handle errors. keepalive flag is expected to be used when the page is about to unload. In such a circumstance developers are not likely to be able to detect and handle errors correctly. Having a rigid, interoperable restrictions will be developers’ benefit.

I think it's good to leave actual numbers hand-wavy. We can replace 9 in https://github.com/whatwg/fetch/issues/679#issuecomment-433827856 with something like:

When a fetch group is terminated, for each request r with keepalive set:

Let context be r's associated KeepaliveContext.
Add r to context.
User agent may abort r based on the number of requests in context whose done flag is not set.
User agent may abort r based on the number of requests in all KeepaliveContexts whose done flag is not set.
User agent may abort r after certain time passed.

youennf commented 6 years ago

A side note: if we decide to remove the 64KB restriction for attached keep alive requests, we might still want to warn developers that long keep alive uploads have higher chance to fail. Hence advise to keep these requests as small as possible.

MayyaSunil commented 3 months ago

Hello @yutakahirano, @mfalken

We’re excited to share that Firefox is rolling out a new feature soon, and we’re aiming to include similar resource restrictions.


    - If the renderer process is processing more than 9 requests with keepalive set, we reject a new request with keepalive set initiated by fetch().
    - If the renderer process is processing more than 19 requests with keepalive set, we reject a new request with keepalive set.
        Note: The difference between fetch() + keepalive vs. SendBeacon is from a historical reason and we are going to have a unified restriction.
    - If Chrome is processing more than 255 requests with keepalive set, we reject a new request with keepalive set.

Could you let me know if these restrictions have changed? While we’re taking a slightly different route by limiting them based on origin and restricting requests per browser instance, we want to keep our limits comparable to Chrome’s.

By the way, what’s the plan for addressing the spec issue?

Thanks a million!

ricea commented 3 months ago

Neither @yutakahirano nor @mfalken work on Chrome any more. Pinging @mingyc who has been touching keepalive as part of them deferred fetch work.

mingyc commented 6 days ago

The current keepaive-specific limits have been updated for a while:

Maximum pending fetch keeaplive requests across Chrome: 2048
Maxium pending fetch keepalive requests for a top-level frame (a Chrome-specific concept that usually maps to a top-level document in a different process, see also RenderFrameHost): 256
Maxium allowed accumulated fetch keepalive request size within a top-level frame: 512 * 1024.

The Chromium source is here.

whatwg / fetch