whatwg / html

HTML Standard
https://html.spec.whatwg.org/multipage/
Other
8.09k stars 2.66k forks source link

BroadcastChannel spec is vague about asynchronous nature of global list management #7267

Open wanderview opened 2 years ago

wanderview commented 2 years ago

Currently the BroadcastChannel spec is written such that the constructor is just a couple of synchronous steps:

https://html.spec.whatwg.org/#dom-broadcastchannel

And then in the postMessage algorithm it claims to synchronously get a list of all BroadcastChannels through the browser. See step 4 here:

https://html.spec.whatwg.org/#dom-broadcastchannel-postmessage

This is rather impractical since multi-process browser architectures pretty much require asynchronous operations to get this list (or performance kill synchronous IPC).

What's more, browsers actually do the async step in the constructor:

I think its fair to say browsers do this for performance. To collect the list for every postMessage() would involve messaging every process in the browser for each message. So instead browsers register a list of BC objects in a central place using IPC instead.

There are some subtle differences between the spec and the "register asynchronously upfront" approach.

Consider:

  1. Worker thread A creates BC1 with channel name "foo".
  2. Worker thread A calls self.postMessage() to tell its parent frame that its ready.
  3. Parent frame B creates BC2 with channel name "foo".
  4. Parent frame B calls BC2.postMessage().

Is BC1 guaranteed to receive the message?

According to the spec, yes. Because BC1 was constructed before BC2.postMessage() was invoked.

In chrome, however, separate threads message the browser process to register in the central BC list using separate IPC message queues. That means BC2 can actually construct and post its message before BC1 has registered in the list. This means there is a race condition that determines if the message is received.

We have not observed this race in firefox, though. I believe this is likely because its PBackground IPC mechanism uses a single pipe for all messages from one process to another. I believe the race would exist if the worker thread A were a SharedWorker or ServiceWorker in a separate process, though.

What is the intended behavior?

The spec says one thing, but its wildly different from implementations and its unclear if its implications were intended. Similarly, its unclear firefox's implementation is based on this nuance vs as a side effect of its IPC implementation, and it likely doesn't provide the guarantee in multi-process scenarios.

Chrome's implementation is based on the assertion that BC ordering is only guaranteed within a single thread; e.g. a synchronously scriptable set of frames, a single worker thread, etc.

Are we attempting to force guarantees across threads? And if so, how do we do that if thread A and thread B are in different processes? We can't have the same IPC pipe in the case and the spec's style of implementation would be terrible for performance.

I would like to suggest we not require cross-thread ordering of setup guarantees.

In the meantime we are likely to land some WPT tests to deflake some of the cases that hit this condition. I hope no one objects to that.

Thoughts? @mkruisselbrink @domenic @annevk @asutherland @cdumez

wanderview commented 2 years ago

cc @recvfrom, who did all the hard work to track down this race condition.

asutherland commented 2 years ago

I would like to suggest we not require cross-thread ordering of setup guarantees.

Yes, this seems like the only sane course of action. (And any accidental invariants Firefox's current IPC implementation are providing are potentially going to change in the medium term future.)

annevk commented 2 years ago

See also #5327.

cc @gterzian

wanderview commented 2 years ago

I would like to suggest we not require cross-thread ordering of setup guarantees.

Yes, this seems like the only sane course of action. (And any accidental invariants Firefox's current IPC implementation are providing are potentially going to change in the medium term future.)

Another option we could consider (although I don't like it), would be to use sync IPC for the setup of BroadcastChannel.

This would be a bit nicer for web developers because they wouldn't have to reason about this kind of setup race, etc. Of course the downside is that sync IPC is bad for performance, but at least it would only be once at construction time.

annevk commented 2 years ago

What would be bad about using the browser process for setup coordination? How does this work with shared workers?

wanderview commented 2 years ago

What would be bad about using the browser process for setup coordination?

Do you mean using sync IPC? Or are you talking about something else? Not sure how you could do it without pausing the js runtime thread while you messaged the other process.

We haven't tested shared worker, but I expect it to suffer from the same problem. In that case we have potentially multiple separate renderer processes that can try to be coordinating, though. I think sync IPC for the setup would be the only solution to remove the setup race for multiple renderer processes.

annevk commented 2 years ago

I was thinking you'd message the browser process and buffer the messages until it pings you back with the relevant information.

wanderview commented 2 years ago

I believe all implementations do buffer outgoing messages. The problem is messages coming from other BC instances before your setup is complete. The central browser process doesn't know yet to delay things or to try to route to the new instance.

The reason this is noticeable is that there are other communication paths (parent.postMessage() and MessageChannel) that communicate between threads without round-tripping through the central browser process.

So you can have two threads coordinating with direct communication that expect BC instances to both be setup, but one of them might still be processing the setup.

Honestly, I'm not sure if this is really a problem for anyone other than BC tests. It seems a bit weird to try to coordinate using two different communication methods.

asutherland commented 2 years ago

I think the good news is that if the BC tests are having coordination problems, https://github.com/web-platform-tests/rfcs/blob/remote_channel/rfcs/remote_channel.md (prototype PR at https://github.com/web-platform-tests/wpt/pull/2980) can probably help.

I'm strongly opposed to adding a (de facto) requirement to use Sync IPC without an extremely good reason, which I don't think we have. And even then I think we'd want to weigh whatever that reason is against consultation with security experts on whether the exposure of a sync IPC on Workers that provides a storage-key-wide total ordering would be a problem since it might somehow be used as a cross-agent-cluster timing gadget. While I think in many browsers Blob URL creation may already provide a means of inducing a sync IPC, the resulting Blob URL isn't something that be communicated over in a way that can help agents A and B establish an ordering.

(Note that I'm not saying there's definitely a viable security attack that could result. I'm just saying that the potential makes my head hurt and feels like it could result in a lot of interesting security papers that would in turn make my head hurt in the future.)

wanderview commented 2 years ago

Ok. Happy to leave ordering undefined across threads.