Serialise with implicit transfer (for Streams)

whatwg / html

HTML Standard

https://html.spec.whatwg.org/multipage/

Other

8.1k stars 2.67k forks source link

Serialise with implicit transfer (for Streams) #3684

Closed ricea closed 4 years ago

ricea commented 6 years ago

I am working on a way to make Streams transferrable, ie. make it possible to transfer them between the main page, workers and other contexts with postMessage(). Once a stream has been transferred, objects put into the stream in one context will come out in the other context, meaning that serialisation / deserialisation is performed.

For many use cases, it is critical that this happen efficiently. Specifically, copies must be minimised.

Unlike postMessage(foo, [foo.data]), there is no way to supply out-of-band information to say how an object is to be transferred.

The changes to the Streams Standard are being discussed at https://github.com/whatwg/streams/issues/244, however it will need hooks in the StructuredSerializeWithTransfer() algorithm, and it might be useful to specify a more general mechanism that could be used by other parts of the platform. So I am raising this issue here to discuss the transferring part in isolation.

Here are some options under consideration:

1. Top-level only greedy transfer

If the object itself was an ArrayBuffer, it would be transferred rather than copied. However, if it was { data: ArrayBuffer } then data would be copied, not transferred. This is sufficient for some use cases, but very poor for others. For example, a video frame might contain the frame bitmap as an ArrayBuffer embedded in an object with other metadata. Requiring a copy of the bitmap to be taken would probably render the API unusable for this use case.

2. Deep recursive greedy transfer

The algorithm would transfer any transferable object it found while traversing the object, to any depth. In this case, with { data: ArrayBuffer }, data would be transferred. This is a good match for Stream semantics: putting an object into a stream implies passing ownership. However, there are a few foreseeable problems:

It may be hard to reuse for other parts of the platform that don't have such clear-cut ownership semantics.
It would be hard to go back and make the algorithm less greedy in future, since it would mean adding an "opt-out" of greediness to get less greedy behaviour.
People would inevitably be confused when the object they intended to transfer contained an incidental reference to some other object they didn't intend to transfer.

3. Object transfer meta-protocol

An object would contain metadata indicating how it would like to be transferred. For example, in the above example, data would not be automatically transferred, but changing it to { data: ArrayBuffer, [Symbol.transferKeys]: ['data'] } would cause data to be transferred.

Each transferKeys would only apply to the current level, but it could mark deeper levels that should be recursed into. For example:

{
  value1: {
    data: ArrayBuffer,
    [Symbol.transferKeys]: ['data']
  },
  value2: {
    data: ArrayBuffer,
    [Symbol.transferKeys]: ['data']
  },
  [Symbol.transferKeys]: ['value1']
}

Here, value1.data would be transferred, but value2.data would not, because value2 itself was not selected for transfer.

(@domenic pointed out that we can't actually use Symbol.transferKeys here, as Symbol is defined in ECMAScript. But I think for strawman purposes it makes it clear what is going on.)

annevk commented 6 years ago

For the meta-protocol, why do you need to mark deeper-levels? You don't want to check for the existence of Symbol.transferKeys on each object? Why is it okay to do it on the top-level object?

Presumably this meta-protocol would work with postMessage() as well? If so, we need to be careful about the interaction with transferList.

(I like putting it on Symbol since serializing and transfer is somewhat part of JavaScript.)

cc @wanderview @bakulf @surma

ricea commented 6 years ago

For the meta-protocol, why do you need to mark deeper-levels?

I want authors to be able to embed a third-party object into their own object without knowing about its implementation details. And also have the option of not transferring something even if it supports it.

Why is it okay to do it on the top-level object?

I don't have a good justification for this. My idea is just that streams default to "transfer if supported" for the top-level object and postMessage() defaults to "don't".

wanderview commented 6 years ago

@ricea Have you considered an API where the objects are "transferred" into the stream? So make the controller passed the underlying source take a transferrance argument in its enqueue() method. Something like:

  let r = new ReadableStream(controller => {
    // push some ArrayBuffers into the stream
    controller.enqueue(buffer, [buffer]);
  });

This would immediately detach the transferred buffer from the current context when its enqueued. This would be more predictable than trying to transfer only at some later time when the stream is drained into the postMessage operation.

The stream can just keep track of transferrables for each chunk.

Would this help at all? Sorry if it was already considered in the other issue. I've only been skimming that.

ricea commented 6 years ago

@ricea Have you considered an API where the objects are "transferred" into the stream? So make the controller passed the underlying source take a transferrance argument in its enqueue() method.

I haven't given it much thought. It seems okay to put the burden on the underlying source for a ReadableStream, but more problematic for a WritableStream:

const writer = writable.getWriter({transferring: true});
writer.write(buffer, [buffer]);

I think it's undesirable that the customer of a stream should have to know whether it was transferred or not.

It's even more difficult for pipeTo():

readable.pipeTo(transferredWritable, ?);

It's also not consistent with the one "transferred stream" that already exists in the platform: the body of a Response that is passed to event.respondWith() in a ServiceWorker.

My goal is that "almost all" streams will be transferable. Nothing useful will happen if the chunk type is something that cannot be cloned, but hopefully that will be intuitive to users.This aligns with the principle of composability: combining streams authored by different people together in novel ways should "just work".

wanderview commented 6 years ago

writer.write(buffer, [buffer]);

This seems good to me.

I think it's undesirable that the customer of a stream should have to know whether it was transferred or not.

I'm not sure what you mean by this. What I am suggesting makes it explicit whether the data passed to the stream is transferred or not. It is not dependent on what happens to the stream later on. From the perspective of the data passed into the enqueu() or write() method a transfer (potentially to same realm) occurs immediately. Then the stream may internally transfer again if it has to cross realms.

It's even more difficult for pipeTo():

I don't understand the difficulty you are trying to describe here. The stream just needs to track which objects it has transferred for each chunk. If it does a pipeTo() to another stream then internally it passes that transfer list on just like is done for an internal source enqueue().

This seems somewhat similar to what you are proposing, no? I'm just suggesting the stream implementation could store the meta data about what to transfer based on API calls instead of the chunks having implicit meta data stuck on them. Or maybe I don't understand your proposal.

It's also not consistent with the one "transferred stream" that already exists in the platform: the body of a Response that is passed to event.respondWith() in a ServiceWorker.

How is that a transferred stream? The ArrayBuffer is not transferred to another js context. It is consumed by native code. I don't think whether the buffer was transferred into the stream changes how native code would consume it.

domenic commented 6 years ago

I think this thread has gotten a little confused because @ricea tried to tightly scope it in the OP, but then is mentioning other considerations which he didn't port over. Mainly about transferring streams themselves.

Still I think @wanderview's instincts are a good conservative starting point and we should assess them as a fourth option. Let me try to put it together:

4. Transfer lists in all streams APIs

Concretely, this would be

let r = new ReadableStream(controller => {
  // push some ArrayBuffers into the stream
  controller.enqueue(buffer, [buffer]);
});

writer.write(buffer, [buffer]);

r.pipeTo(w, { transfer: true }); // or maybe transferIfPossible

In option (4), how does transferring a stream work? We have a few cases:

postMessage(r, [r]). Recall that this creates a "proxy stream", rPrime, on the other side.
- If the creator of r calls the two-arg enqueue(), it transfers the chunk from r to rPrime as desired
- If the creator of r calls the one-arg enqueue(), it errors
postMessage(w, [w]). This creates a proxy stream wPrime on the other side
- If the consumer of wPrime calls the two-arg write(), it transfers the chunk from wPrime to w as desired
- If the consumer of wPrime calls the one-arg write(), it errors
- If the consumer of wPrime calls pipeTo(wPrime, options) without setting options.transfer to true, the pipe errors

With this in mind, I think the concern is especially in the wPrime case: it's unfortunate that you need to know what kind of writable stream you're consuming, to know whether you should call write(buffer) or write(buffer, [buffer]).

The r case is also a bit tricky: you need the cooperation of the creator of r before you can usefully transfer r.

At the other extreme, contrast this with option (2). In option (2) if you write to a stream that's in "transfer mode" (e.g. because it is a proxy for a stream transferred on another side, or just because it wants to be efficient with how it consumes your data), your chunks will get transferred implicitly. You won't need to worry about whether your stream is in transfer mode or not; you just write to it. And similarly, as a creator of a readable stream you can just call enqueue(), and if someone transfers you, it'll just work; you didn't need to account for such transferring-consumers.

So one way of framing the question is about whose responsibility it is to write code in a way that is friendly to future transfers:

(1) says it's mostly nobody's responsibility. Things will just work, if you stick to simple chunks.
(2) says it's nobody's responsibility, ever.
(3) says it's the chunks' responsibility. So whoever creates the chunks in the first place needs to annotate them correctly.
(4) says it's the responsibility of the readable stream creator/writable stream user.

wanderview commented 6 years ago

If the creator of r calls the one-arg enqueue(), it errors

Why would this need to error? Objects in chunks that were not transferred would be structure cloned instead. It would only become a problem for something that is transferable, but not structure clonable. I guess stream types would be in that category. Is there anything else?

domenic commented 6 years ago

Today if you try to transfer something that is only cloneable, not transferrable, it errors. I assumed we'd want to keep that invariant for streams, i.e. if you try to transfer a stream which does not allow transferring its chunks, that should not start creating copies for you.

wanderview commented 6 years ago

Ok, but I think you are conflating "transfer a stream" with "transfer a stream and every chunk within that stream now and into the future". I don't think those have to be equivalent.

It seems perfectly reasonable to consider a case where some chunks are transferred and others copied when passing a stream to postMessage(). Perhaps the client wants to continue to read from one of those objects, etc. Adding a strict requirement that all chunks are transferred or the stream cannot be transferred seems unnecessary.

Conceptually I am proposing we split the transfer of the stream from the transfer of the chunks. So:

A stream can be transferred. Indeed, it must be transferred since it can't be copied. In this case its marked disturbed, drained, and the chunks are sent in the best way possible. That means transferred if marked as being transferred or copied otherwise. If neither is possible then an error occurs.
Chunks can either be added to a stream normally (so that js can still reference it and use it) or they can be transferred into the stream (and detached from the js context). Transferring into the stream is strictly better for memory management since it allows the browser to more aggressively consume the chunk in some cases. It also enables the chunk transfer in (1).

In theory these could be implemented completely independently of one another. Without (2), you would only get chunk transfer with postMessage in (1) for native produced streams (that presumably mark them transferred). Without (1), transferring into the stream for (2) would still be useful for avoiding some GC when the stream is being consumed by a native c++ API.

ricea commented 6 years ago

I think this thread has gotten a little confused because @ricea tried to tightly scope it in the OP, but then is mentioning other considerations which he didn't port over. Mainly about transferring streams themselves.

Sorry for the scope creep. Since the discussion over here is productive, I feel we may as well continue.

It's also not consistent with the one "transferred stream" that already exists in the platform: the body of a Response that is passed to event.respondWith() in a ServiceWorker.

How is that a transferred stream? The ArrayBuffer is not transferred to another js context. It is consumed by native code. I don't think whether the buffer was transferred into the stream changes how native code would consume it.

What I meant was that it resembles a transferred stream in the sense that you put a ReadableStream in one end and get a ReadableStream with the same data out on the other side. It would be nice if we could eventually explain it with the same mechanism, but that may be a pipe dream.

I don't understand the difficulty you are trying to describe here. The stream just needs to track which objects it has transferred for each chunk.

I think the difference in perception stems from my mental model of pipeTo() as a loop of read() -> write() calls. Once I switched to viewing it as a cog in the streams machine, I could see your point. From this point of view it doesn't really make sense to talk about "transferring" and "non-transferring" versions of pipeTo, as the chunk has already been transferred when it entered the machine.

The same has to apply to TransformStream. It also has to preserve the transfer list.

This creates a problem for people trying to implement transform streams that are not TransformStream. Since they can't see the transferList that was attached to the chunk, they can't pass it back into the machine.

In general, I am not enthusiastic about this approach.

I find the postMessage transferList API confusing, and I would prefer something that preserves encapsulation.
I think it the argument that transferrable streams are an easier way to do cross-thread communication will be less convincing if it entails using a transferList-based API.
I don't want to impose the overhead of pinning a transferList to every chunk on use cases that never transfer anything.

wanderview commented 6 years ago

It seems perfectly reasonable to consider a case where some chunks are transferred and others copied when passing a stream to postMessage().

It occurred to me today that this case pretty much exactly matches the existing streaming transfer primitive in the web platform. You can transfer a MessagePort and then either copy or transfer individual objects through that MessagePort.

I think the parallels to MessagePort are pretty compelling from an API point of view.

Also, I would argue that the decision about whether to transfer or not is really something that has to take place at the interface boundary. It makes sense to consider the question when you pass ownership of an object to a black box like a stream or a MessagePort. Its very explicit and locks in an immutable decision. I personally like these characteristics.

For example, with option 3 its not clear to me if code could continue to change the meta data after an object is passed to a stream. I think we should avoid that sort of thing if we can. (Yes, it could be frozen or something, etc.)

This creates a problem for people trying to implement transform streams that are not TransformStream. Since they can't see the transferList that was attached to the chunk, they can't pass it back into the machine.

I agree that is not ideal, but we require strict brand-checked ReadableStream and WritableStream in other places for full optimization AFAIK. I think requiring a real TransformStream for full optimization would be consistent and reasonable.

I don't want to impose the overhead of pinning a transferList to every chunk on use cases that never transfer anything.

I'm not sure I see that the overhead would be that great. Once the object is within the boundaries of the stream the engine could process the transfer list immediately and flip a private bit on the object to flag it for transfer. It would just need to clear the bit if the object is removed without actually being transferred away. This doesn't seem that onerous to me.

I find the postMessage transferList API confusing, and I would prefer something that preserves encapsulation.

That's a little subjective, but ok. I agree the ergonomics could be improved. But that kind of effort does not need to be tied to stream transfer. We can make streams consistent with existing APIs that use transfer lists for now and add an implicit transfer mechanism later. That would essentially make transfer list optional for all APIs equally, not just streams. These seem like orthogonal features to me.

ricea commented 6 years ago

I'm not sure I see that the overhead would be that great. Once the object is within the boundaries of the stream the engine could process the transfer list immediately and flip a private bit on the object to flag it for transfer. It would just need to clear the bit if the object is removed without actually being transferred away. This doesn't seem that onerous to me.

I don't understand how you can process the transfer list without actually performing the serialisation.

ricea commented 6 years ago

I've been investigating this and it appears that the write(chunk, [chunk]), enqueue(chunk, [chunk]) syntax is doable. It will have some performance overhead, even when you're not using it, but the only way to find out how much is to implement it.

I'm working on making transferrable streams work with always-clone semantics as the short-term goal. This is useful by itself, and is a cleanly isolation chunk of work. Then in a few months I will get back to doing transfer.

I think if we're not going to do implicit transfer specifically for streams, then we should close this issue, possibly opening a new issue(s) for the general questions of transfer ergonomics and how to extend transferability to user-created objects.

annevk commented 6 years ago

User-created objects? We don't even know how to do cloning for those.

annevk commented 4 years ago

It seems this got resolved by https://github.com/whatwg/streams/pull/1053. Let me know if I misread that.