whatwg / html

HTML Standard
https://html.spec.whatwg.org/multipage/
Other
8.13k stars 2.67k forks source link

Add onclose event to MessagePort #1766

Closed isonmad closed 10 months ago

isonmad commented 8 years ago

There's currently no reliable way to detect when a MessagePort becomes disentangled, whether due to purposefully calling .close(), or the tab being unexpectedly killed.

An onclose event was proposed and discussed several years ago, but it looks like the discussion petered out after that.

Any hope of reviving it?

annevk commented 8 years ago

It doesn't really work with the MessageChannel model as discussed in that email. Do you have a specific proposal that doesn't expose GC?

isonmad commented 8 years ago

Reading over the thread and the spec again, that particular email's claim that there's ambiguity about what a port's 'owner' is is actually wrong. The spec does define what settings object owns the port, and sharing a reference doesn't change it. The pass-port-around case is already basically broken, it looks like.

What about an event that fires:

domenic commented 7 years ago

I'll dupe https://www.w3.org/Bugs/Public/show_bug.cgi?id=28813 against this.

annevk commented 7 years ago

@isonmad the problem with that is that rather than exposing GC, you end up creating a memory leak by keeping the objects alive for the duration of the global. I think service workers ended up doing this without much thought about it.

isonmad commented 7 years ago

That was in the thread too, and it turned out there was already a leak of MessagePorts, at least in Chrome in 2013, and the spec language was written around that fact that devs should be careful regardless and not depend on GC.

But is there a technical reason the MessagePort object itself actually has to be kept alive in its entirety in implementations of onclose? It seems like there just needs to be an event in Context A, where the onclose handler was set, that fires whenever the other end of the port, Context B, goes away. Which context 'Context B' refers to can change within the lifetime of the MessagePort as it's passed around, but at the point where the MessagePort in B has no more references to it, and the leak might occur, can't it just be GC'd and replaced with a simpler, equivalent internal event listener where A just subscribes to and listens for B's demise? There must be some centralized tracking of processes/contexts, right.

nornagon commented 5 years ago

I'd like to see this revived now that WeakRef is available, and in light of similar APIs which do offer the 'close' event. It seems like this event would be the only way to know when the other end of the channel goes away—especially for unexpected failures such as crashes or OOMs.

(My interest specifically comes from using MessagePorts in Electron as a general-purpose IPC tool)

Gaubee commented 4 years ago

I use LockManager to shim web-worker onclose.

/// worker
const lockReqId = 'process-live-'+Date.now()+Math.random();
navigator.locks.request(lockReqId,()=>new Promise(()=>{}));
postMessage(lockReqId);

/// master
worker.addEventListener('message', me=>{
  if(typeof me.data==='string' && me.data.startsWith("process-live-")){
    navigator.locks.request(me.data,()=>{
      worker.dispatchEvent(new CloseEvent('close'))
    })
  }
});

caniuse locks

I use LockManager to shim web-worker onclose.

asutherland commented 3 years ago

I'd like to revisit this; in Firefox we're implementing Web Locks and it seems like absent a MessagePort close event or a newly introduced disconnect event on SharedWorkerGlobalScope, the above comment's WebLocks shim will end up becoming a necessity.

I too re-read the original mailing list thread from the start that was linked to in the top comment in this issue and it seemed like the primary discussion points were:

With Web Locks (which is moving to WebApps WG) exposing same-storage-key agent termination and ServiceWorkers exposing same-storage-key client enumeration via matchAll, there already exist same-origin mechanisms to surface lifecycle.

I think the main risk with the "close" event proposal here is that it could potentially expose cross-origin life-cycles in an adversarial manner if not constrained. Specifically, if WindowProxy.postMessage(..., "*", ...) is used to transfer a MessagePort into a different origin, would this allow the sender of the message to monitor when that agent terminates with no opt-in by the recipient agent? Presumably this could be addressed by requiring the (cross-origin) recipient to take some affirmative action indicating it wants to use the MessagePort by adding an event listener on it or invoking postMessage on it or something like that.

nornagon commented 2 years ago

@asutherland I don't think that would expose recipient agent lifetime without explicit action on part of the recipient. The sender would not be able to distinguish between "the other end of the message channel was dropped due to GC" and "the recipient agent was terminated". If the recipient agent does not register any event listeners on the port it receives, and does not save a reference to it, then the port will be garbage collected and the close event would be emitted on the sender side, revealing no information about the lifecycle of the recipient.

asutherland commented 2 years ago

@nornagon I agree that if we generate the close event on GC then we side-step the problem. I was unclear about which specific proposal I was discussing, which was the proposal comment in this thread at https://github.com/whatwg/html/issues/1766#issuecomment-247507345. I read that proposal as explicitly not letting it be known when GC happens but instead when the global is terminated, which means that the message port post message steps which explicitly deserializes regardless of whether there are any listeners will then run the transfer-receiving steps would inherently need to tie the MessagePort to the global in a way that goes out of its way to ignore GC. @annevk pointed out issues related to this in his comment at https://github.com/whatwg/html/issues/1766#issuecomment-266814601.

@domenic @annevk Do either of you have a feeling about what the current state of things is in terms of concerns about exposing GC? I know in Firefox that the existence of Web Locks and the (ServiceWorkers) Clients API means that it is conceptually possible to detect when Firefox GC's a Worker, for example.

mhofman commented 2 years ago

I think there are 2 kind of collection we need to differentiate: an object within an agent, and an agent itself.

Like has been expressed in the email threads, I would be worried about exposing the collection of objects inside an agent (MessagePort) to another agent. Even with the existence of WeakRef and FinalizationRegistry, the capability to observe an object lifecycle requires a direct access to these APIs and to the object to observe. As such, tying the observable lifecycle of a remote MessagePort to program actions or to the remote agent's overall lifecycle seems more appropriate.

From what I understand, automatically tying the observable remote MessagePort lifecycle to the receiving agent creates its own problems. As @asutherland mentions, the solution is to tie the observation capability to explicit action of the program when receiving a MessagePort.

More specifically, a sending agent should not be informed of a remote MessagePort disconnection unless the program running in the receiving agent starts the port (registers a message listeners). At that point the sending agent would receive a close notification when either the receiving agent terminates, or when the program explicitly closes the received port.

That is different from what @nornagon suggests: a sending agent cannot be made aware of the disconnection of the MessagePort if the receiving program ignores the received port.

domenic commented 2 years ago

@domenic @annevk Do either of you have a feeling about what the current state of things is in terms of concerns about exposing GC?

I think since the introducing of WeakRefs the boundary has been broken a good deal, and anything which is equivalent to WeakRefs in power is allowable. It's still not something you want to do lightly, as minimizing the number of ways in which an application can be GC-dependent is important. But if it's useful (and I think this instance qualifies; we've been hearing this request for over a decade now) then it's probably reasonable to do.

It sounds like we have a similar precedent with Web Locks and ServiceWorker for whole-agent GC, so I would extend the same reasoning to that case.

asutherland commented 2 years ago

For the "close" event itself, CloseEvent is already a thing and is defined in the web-sockets chapter of HTML. Would we just reuse that?

It has 3 attributes, defined like so quoting from here:

  1. The wasClean attribute must return the value it was initialized to. It represents whether the connection closed cleanly or not.
  2. The code attribute must return the value it was initialized to. It represents the WebSocket connection close code provided by the server.
  3. The reason attribute must return the value it was initialized to. It represents the WebSocket connection close reason provided by the server.

The defaults per CloseEventInit are false, 0, "". wasClean seems like it could potentially be reused usefully here with appropriate semantics, but the others presumably would be left as their defaults?

domenic commented 2 years ago

I don't think we should use CloseEvent for this, as it seems rather WebSocket-specific. We can just use Event.

christianliebel commented 2 years ago

@domenic I'd also like to add my two cents here. For a client, I am building a micro-frontend application where users can dynamically add and remove additional apps hosted in iframes loaded from various origins. We're connecting to those third-party browsing contexts via channel messaging, which works perfectly fine.

However, we must do some cleanup work after an iframe is removed, when the site was unloaded, has crashed, etc. Unfortunately, due to the lack of an onclose event, we currently have to poll all connected client apps to determine whether they are still connected. This heartbeat mechanism is quite inefficient though, and the detection of a closed client can be significantly delayed.

For our use case, it would also be okay to opt-in to exposing lifecycle events to third-party browsing contexts (as @asutherland mentioned) as the client apps are under our control. For us, it would also be fine if the event would be dispatched sometime after the message port has actually been closed.

All in all, it would be really cool to have this event. :)

annevk commented 2 years ago

When we discussed this at TPAC in 2022 others did share the concern around running JS on shutdown so it might still be better to only do this at deterministic times for now. I.e., when the specification dictates shutdown.

asutherland commented 2 years ago

When we discussed this at TPAC in 2022 others did share the concern around running JS on shutdown so it might still be better to only do this at deterministic times for now. I.e., when the specification dictates shutdown.

@annevk Can you clarify what you mean here? I understand the central decision here is about when to notify a close has occurred for an opted-in MesagePort (by starting it after it's been deserialized in the given global) for situations other than the obvious invocation of close. Those times are:

  1. When it's no longer possible to send a message over the MessagePort because the MessagePort has become eligible for GC by virtue of there no longer existing any strong references to the MessagePort and there are no MessagePort event handlers for the "message" event (which exposes a reference to the MessagePort on its source property).
  2. When the agent terminates, or, if all browsers already implemented MessagePorts so that when the document cleans up we close the MessagePort (I believe Firefox does), then we can spec document cleanup + agent termination. Currently the web-locks language on termination runs for windows/documented at document cleanup time and for workers uses the agent. But as I read the spec right now, I don't think we have explicit closure at document cleanup? And so for windows/iframes, although a MessagePort that was entangled into a detached iframe probably can't process received messages because the port message queue gets aimed at is the not fully active Document, I think the agent may still be able to transfer the port somewhere else to make it usable again according to the spec. Although hopefully I'm missing some shutdown verbiage.

Specifically, which of the two above is compatible with the TPAC 2022 discussion? If the answer is none, how do the concerns deal with what Web Locks has already enabled? Thanks!

annevk commented 2 years ago

I think it indeed tied back to @isonmad's original proposal of embracing the memory leak in order to not expose GC. I also realize now that the script wouldn't run in the agent being shutdown as presumably we'd only dispatch the close event on the other side, which helps. (Though what if both sides go down?)

I'm not sure what the model with Web Locks is and whether that can still change. It does seem like that would indeed end up exposing similar things.

(The discussion at TPAC was quite brief and I somewhat regret bringing it up now, but on the other hand I learned a few things due to your reply...)

raegen commented 1 year ago

...the other side, which helps. (Though what if both sides go down?)

I'd say in that particular case, we could probably delegate the handling to the service tailored exactly for such cases - NOOP, since taking two contexts, out of total two (max) two, there is a noticeable lack of, well, anything. Jokes aside, if it's the agent (subscriber, consumer, context w/e) is the one shutting down, handling of any sort as a concern falls to the agent.

fergald commented 1 year ago

I've written an explainer that covers the problem, the background, a proposal and a discussion of alternatives.

TL;DR the proposal is to send a close event as soon as the ports become disentangled (explicit close(), crash, destruction of JS context, unreferenced port gets GCed). This makes no new information available vs status quo but it does make the information available in a more timely manner (less dependency on GC). The justification for the choice vs alternatives is in the doc. If there's a reason why that's problematic or if you think I've missed some important part of the discussion, please let me know. Inparticular @asutherland you raised the issue of exposing navigation. This is already exposed but this would make it timely. Did you have a specific attack in mind or was it the usual avoidance of unnecessary cross-origin exposure?

asutherland commented 1 year ago

The explainer proposal is great, thank you. I'm on board with implementing this in Firefox.

Re: exposing navigation, the below is just me restating why this proposal works in relation to things I've previously said; I'm not raising any new issues:

My concern was a more abstract one of creating non-trivial new side-channels, but that was only a concern if we were intentionally tying the MessagePort to the global in an intentional memory leak to avoid exposing GC in a way that would let navigation be detected in the future without an opt-in from the recipient global. In particular, with that capability, being able to "ship" the MessagePort outside the agent cluster would be a new capability beyond what having access to the WindowProxy would mean (or other similar browsing context hierarchy shenanigans).

This proposal does not create a non-trivial new side-channel because an ignored MessagePort postMessaged to a WindowProxy that doesn't care will be subject to GC. As the proposal explains in great detail (and consistent with @domenic's comment, this potential exposure of GC timing is fine because WeakRefs already expose GC. As the explainer also points out, there is a trivial new side-channel where, because the MessagePort will currently become entagled with the global[1], a close event could be delivered as a result of a very rapid navigation or something like that, but that's a very short timeline and we would expect the WindowProxy to still be available on that timeline, etc.

1: Per spec, at least, the deserialization in step 7.3 of the message port post message steps is unconditional. And I believe in Firefox we unconditionally do this, but I think I remember seeing a bug filed at one point that suggested that other engines may lazily perform the deserialization so that it only happens on demand.

asutherland commented 1 year ago

Edit: Skip this comment, I was confused about when "source" was set, but @rakina's comment below cleared things up for me and you can probably skip my comment explaining my confusion as well.

A related change we can make if introducing the close event that came up while talking with @smaug---- was that 9.4.5 Ports and garbage collection normatively says:

When a MessagePort object o is entangled, user agents must either act as if o's entangled MessagePort object has a strong reference to o, or as if o's relevant global object has a strong reference to o.

Without having done the historical investigation, my impression of this text is that it was an effort to prevent GC/CC being observable. But given the changes here, it would seem to make sense to either remove this text or to alter it to make it clear that a strong reference should be conditioned on both 1) being entangled and 2) having "message" or "messageerror" listeners which could obtain a new strong reference to the MessagePort via the source property on the events. (Note that the explainer does explicit cover GC but doesn't mention relaxing this limitation on GC that I'm aware of.)

Note that while the streams spec uses MessagePorts for transferred streams, its hookup explicitly adds the above event listeners and it explicit disentangles the ports without removing the event listeners, so the above proposed change shouldn't affect streams, just manually created MessageChannels/MessagePorts.

jjkola-eskosystems commented 1 year ago

I would love to see this implemented. My case is that I have shared web worker which may have more than one window connecting to it.

Because there is no close event in the case one of the windows goes away, I have to use combination of WeakRef (on the worker side) and polling alongside with explicit window close events in the cases where I can get those on the window side, so that I can try to guess when window is no longer available.

To combat worker going under, I have added missing polling timeout on the window side (in the case timeout is fired I try to poll worker one last time before giving up). I have also added explicit worker close event to worker in the case an event happens to be processed during worker close, ie closing flag is true (althought I'm not sure in what situation that can happen outside of last connection dropping). With these combined I should be able to know if window needs to reconnect to shared web worker.

All in all this is quite much code, just to see if the connection is still alive. Also, because of polling, there is quite much extra processing involved (I have to poll quite often to ensure that I can determine missing window in a reasonable timeframe).

In the case I'm implementing I'm using shared web worker as a communication platform between the windows. It is used as a secure side channel alongside with window to window communication, as the window to window communication is not secure by design (at least in my case as I'm routing most of the messages through thirdparties). Because the shared web worker knows all the connecting windows it is imperative that it can know what windows are still available to prevent unnecessary wait time (for the user) and/or unwanted behaviour (in the case where we think the window had gone under but it was just bogged) in the cases where there is multiples windows affected.

rakina commented 12 months ago

FYI, Chromium has started implementing the proposal from the explainer. Please let us know if there are still unresolved questions that haven't been answered in the explainer etc. We'll be sending out a spec PR soon too, but I have a question about @asutherland's comment above about GCs:

A related change we can make if introducing the close event that came up while talking with @smaug---- was that 9.4.5 Ports and garbage collection normatively says:

When a MessagePort object o is entangled, user agents must either act as if o's entangled MessagePort object has a strong reference to o, or as if o's relevant global object has a strong reference to o.

Without having done the historical investigation, my impression of this text is that it was an effort to prevent GC/CC being observable. But given the changes here, it would seem to make sense to either remove this text or to alter it to make it clear that a strong reference should be conditioned on both 1) being entangled and 2) having "message" or "messageerror" listeners which could obtain a new strong reference to the MessagePort via the source property on the events. (Note that the explainer does explicit cover GC but doesn't mention relaxing this limitation on GC that I'm aware of.)

Can you elaborate more on why we would need this part to change? From the note in the spec, this seems to be there to allow the message event listener for the port to continue to run even when there's no references to the port itself. Is the intent behind changing that to allow the port to be GCed more often, or something else? How does that relate to the close event?

Also, even if we need to change it, I don't think point #2 is relevant, since the source property isn't actually set on message or messageerror events. From the spec, only connect events on shared workers has the property set for MessagePorts. What would be a reasonable alternative if we need to change the text?

(Thanks!)

asutherland commented 12 months ago

Can you elaborate more on why we would need this part to change? From the note in the spec, this seems to be there to allow the message event listener for the port to continue to run even when there's no references to the port itself. Is the intent behind changing that to allow the port to be GCed more often, or something else? How does that relate to the close event?

Also, even if we need to change it, I don't think point #2 is relevant, since the source property isn't actually set on message or messageerror events. From the spec, only connect events on shared workers has the property set for MessagePorts. What would be a reasonable alternative if we need to change the text?

Aha, I definitely had gotten confused and thought in the general MessagePort case that we would set the source. Thank you for correcting my misunderstanding! I believe this was a combination of 1) the similar but unrelated ServiceWorker.postMessage behavior, and 2) some extra flexibility in Gecko plumbing code that supports passing the MessagePort as a source, but we definitely do not pass it in this case.

And then I think this clears up my confusion about 9.4.5 Ports and Garbage Collection that you link. The bit about how if we have a global gA and a global gB having entangled ports pA and pB in gA and gB respectively, it's saying either...

...makes more sense to me now as I don't need to worry about the potential to create new strong references, so GCing a port and thereby automatically closing it is much more straightforward.

I'll edit my comment above to indicate it should be skipped to save at least future me headaches :)

Thanks again!

fergald commented 8 months ago

Unfortunately Chrome's plan to implement this has been blocked by our security review. Their response is available here. Quoted below

About the feature itself, this reveals the GC activity of the process holding the other end of the pipe. It does so, even if this document isn't willing to process incoming messages from unknown origins. Here is an example: https://message-port-close-learning.glitch.me/

GC events might correlate with various events. For instance, it will leak navigations. Those could be used to learn what are the users connected to some websites.

We could ship this feature if there was some opt-in from the cross-origin website before sending the "close" event. You can potentially consider one of:

  • Requiring the MessagePort to be "used" before sending a close event. That is to say, using one of: MessagePort.start(), MessagePort.postMessage(), or registering a MessagePort.onmessage / MessagePort.onmessageerror handler.
  • A new dedicated API to register a close event "beacon".

Unfortunately the argument that this information is already leaked was not persuasive. There's a demo here (source). On Chrome and Safari this demo shows that the info is already leaked. On FF it seems that GC of MessagePorts was avoided to prevent this information leaking.

I would rather not make sending the close event conditional on some kind of opt-in because it leaves a bunch of cases un-covered but it might be the best we can do.

robertknight commented 8 months ago

A challenge with an opt-in is that a frame may navigate or be unloaded before it has had a chance to call the API on the MessagePort instance that enables sending of the close event to the other end of the connection. In Hypothesis's case the MessagePort pair is created in a parent frame, and the two ports are sent to different child frames which subsequently communicate.

Assuming the act of posting a message opts in, the frame that listens for the close event could work around this by waiting for either an initial message or a timeout to expire, before it considers the MessagePort properly connected (and thus expected to send a close event).

asutherland commented 8 months ago

How about opt-in but same-origin is automatically opted in? Because ports can be infinitely re-shipped this might be most sanely modeled as a new flag "has been shipped cross-origin". The flag gets set when entangling based on the environment, so once a port goes across origins it's now opt-in even if it's subsequently re-shipped to its original origin.

annevk commented 8 months ago

@fergald @robertknight @asutherland can one of you please ensure this is tracked in a new issue? Also, if it's not resolved in some timely manner (say a couple of weeks) we should revert the specification change lest it accidentally gets implemented as-is.

fergald commented 8 months ago

I've filed #10201 Let's move the conversation over there.

James-E-A commented 1 week ago

A work-around, if you're able to re-jigger your use-case to use Streams instead of MessagePort

let { readable, writable } = new TransformStream(); // fundamental stream pair

const canary_transform = new TransformStream(); // hack to detect when fundamental stream pair is closed -- workaround https://github.com/whatwg/html/issues/1766
const canary = readable.pipeTo(canary_transform.writable);
readable = canary_transform.readable;

TARGET.postMessage({ replyStream: writable }, { transfer: [writable] });
canary.finally(CLEANUP_HOOK);
USE_STREAM(readable);
fergald commented 1 week ago

@James-E-A I guess the intention here is that canary's finally will be called when TARGET is closed but I had a quick go at that and I couldn't get it to work. Do you have a working demo somewhere?

James-E-A commented 1 week ago

@fergald Sure, here's a quick demo: https://webpages.uah.edu/~je0029/tmp/demo_stream_canary/

The stream-pair workaround as described above is found on line 33 of helper.html; it prints “Stream cleanup handler called.” to the debug console in a finally handler on the canary transform that's injected before transferring out the stream pair.

I found this to trigger (at least on Firefox) regardless of whether the stream was ended from the writable end (my test transfers that to its parent frame, and I confirmed it works even when that parent frame is on a different origin, such as http://localhost) or from the readable end (my test transfers thatto a serviceworker).

James-E-A commented 1 week ago

Actually, that's my mistake. Further testing shows that this does not actually work to detect a transfer target being ungracefully killed; it only works if the target is able to either close() or abort()/error() the stream. (However, using an onunload hook to ensure that the stream's killed seems to work reasonably well, if TARGET is co-operative.)

jjkola-eskosystems commented 5 days ago

But that requires that the unload event is fired which may not be always the case...