whatwg / streams

Streams Standard
https://streams.spec.whatwg.org/
Other
1.35k stars 161 forks source link

Custom tee function #401

Open martinthomson opened 9 years ago

martinthomson commented 9 years ago

At TPAC, we discussed the use of streams in fetch. The problem of having to rely on the default implementation of tee() is that this introduces the potential need for an infinite buffer. If the JS were able to provide a tee implementation that didn't depend on creating a clone, that would be significantly more performant.

annevk commented 9 years ago

(The only potential problem in using this for fetch would be that we might have to provide a consistent body across redirects. That hasn't yet been discussed with the security teams. But we could nevertheless use it for clone().)

martinthomson commented 9 years ago

Having the custom tee() invoked on clone(), but not on the clones used for redirect might be necessary, yes. The important point that @annevk raises is that we might get the performance gain in the places where it is safe (calling clone() on a Request or Response) but we have a separate choice to make about how to deal with the internal cloning needs.

yutakahirano commented 9 years ago

https://github.com/whatwg/streams/issues/271

wanderview commented 9 years ago

Note, the current tee() implementation does not actually duplicate the contents of the buffer chunks. We're really just talking about the memory savings of the array holding the pointers to the buffer chunks. This seems like pretty minimal savings to me for such added complexity to the API.

martinthomson commented 9 years ago

@wanderview, I assume that you refer to the Firefox implementation? What happens when you have vastly different consumption rates for the two streams? If one reader is 1 gigabyte ahead of the other, do you exert back pressure? The view was that the actual chunks for the stream could be garbage collected once both teed streams have read them.

Oh, and the requirement for SameObject on the chunks would need to be relaxed to accomplish this. That assumes of course that the SameObject requirement wasn't needed for a reason that we couldn't think of during the meeting.

tyoshino commented 9 years ago

This complicates pipeTo optimization. When an IdentityTransformStreams (Pipe) are involved, we need to propagate the tee() signal to ask the first data producer in the pipeTo chain, and have all the members in the chain convey data flow for both slow and fast consumer. Needs careful investigation.

wanderview commented 9 years ago

In my previous comment I meant the current tee() spec.

From you original post in this thread:

If the JS were able to provide a tee implementation that didn't depend on creating a clone

Which is already stated in the stream spec. See:

Note that the chunks seen in each branch will be the same object. If the chunks are not immutable, this could allow interference between the two branches. (Let us know if you think we should add an option to tee that creates structured clones of the chunks for each branch.)

The cloning issue is different from what you mention here, though:

What happens when you have vastly different consumption rates for the two streams? If one reader is 1 gigabyte ahead of the other, do you exert back pressure?

This is really the crux of the problem and AFAICT not optimizable in the general js case.

@martinthomson can you explain under what scenario this can actually be optimized not to consume memory equal to the difference in consumption? Particularly for a js defined stream (which you seemed focused on in your original post).

The only optimization that seems possible is if its a dom produced stream reading from a file. But I don't see how js defined custom tee() would help with this.

the requirement for SameObject on the chunks would need to be relaxed to accomplish this

I'm really confused what you want. The initial comment was asking for no clone, but this seems to be suggesting you do want a clone.

I do agree allowing the tee's to be different objects would help with the from-disk-optimization, at the expense of penalizing all other tee's. Perhaps some kind of copy-on-write chunk mechanism could support both.

martinthomson commented 9 years ago

If one reader is 1 gigabyte ahead of the other, do you exert back pressure?

This is really the crux of the problem and AFAICT not optimizable in the general js case.

The point is that it is not generally optimizable, but that there were cases where massive state commitment could be avoided. If, for example, you could quite reasonably produce the same stream again, then you could implement tee() by producing two different streams that happen to have the same content.

wanderview commented 9 years ago

Ok. I'll wait to see a more concrete proposal. I'm happy with what the current stream spec says for tee() for a v1 implementation.

martinthomson commented 9 years ago

Understood. I think that the general view was that tee() would be sufficient for the needs of fetch(), just not necessarily perfectly optimal.

tyoshino commented 9 years ago

Hmm, I see. When saving a fetch result in a SW into cache and responding to the controlled page, the source is network which is not replayable. It's possible that we want to issue a POST with a file on a disk as body, but I don't quickly come up with any situation where we tee() the data into two consumer in a SW. Issuing two or more requests to multiple destinations to distribute data for reliability (disaster), for example?

Then, ideally all the identity transforms and communication between SW and the page should be optimized (skipped), and each of the requests would read from the file individually.

wanderview commented 9 years ago

I think it would be nice to remove the same object requirement for tee chunks. We probably won't be able to relax that in the future once code starts depending on shared state between the streams.

Removing same object on the chunks creates a memory penalty with the simple implementation, but it seems like copy-on-write is a simple optimization that could be applied here. (I say "simple", but I'm not a JS engine guy...)

martinthomson commented 9 years ago

@wanderview, if we relax the SameObject requirement that doesn't mean that it has to be a different object. If it makes sense to return the same object, then you can.

wanderview commented 9 years ago

I think we should spec same object or different object. Otherwise we will end up with compat issues between different stream implementations.

martinthomson commented 9 years ago

As long as it is same value, why does it matter if the objects are same or different?

wanderview commented 9 years ago

Because sites will do things like chunk.processed = true to communicate between the two sides of the tee() if we return same object.

If we don't spec same object or different object, a browser returning same object may still lead to sites doing this. Which then breaks the site in other browsers. I think its better to be explicit.

martinthomson commented 9 years ago

Ugh. Yes, I can accept that reasoning. Of course, we can't make the same guarantee for custom tee() functions, but the default behaviour will be the important thing.

yutakahirano commented 9 years ago

Can we throttle fetch stream-reading from a request by N bytes in the fetch algorithm until the http status code arrives (N will be given via the RequestInit parameter)? Then we don't have to modify tee function, I think.

annevk commented 9 years ago

Pretty sure that in the normal case we will not get the status code until the server receives all bytes, so I don't think that works, unless I'm missing something.

yutakahirano commented 9 years ago

@annevk

fetch(new Request(url, {body: rs}))

Sorry I mixed these two issues.

For the latter, I would set shouldClone parameter for tee(). If ReadableByteStream (or ReadableStream for bytes) allows an implementer to coalesce buffers, the implementer can optimize the operation without an observable behavior change, I think.

annevk commented 9 years ago

A user might want to limit the number of in-flight bytes until the response status arrives (especially for infinite streams).

How does that work in HTTP? That requires Expect: 100-continue, but even then I don't think we are guaranteed to not get a redirect?

yutakahirano commented 9 years ago

If limiting the number of bytes is impossible, it looks dangerous to issue a fetch request with a stream body and "follow" redirect mode.