whatwg / streams

Streams Standard
https://streams.spec.whatwg.org/
Other
1.35k stars 161 forks source link

ReadableStream.prototype.arrayBuffer() #1019

Open ricea opened 5 years ago

ricea commented 5 years ago

It would be good to have a convenience method to concatenate the contents of a ReadableStream into a (promise for a) Uint8Array.

It is unclear what should happen if the stream contains chunks that are not ArrayBuffer or ArrayBufferViews.

Other similar methods we might consider are .blob() and .text().

domenic commented 5 years ago

A more generic method might be stream.toArray() which returns a promise for an array of chunks. (If https://github.com/tc39/proposal-iterator-helpers lands this would basically become an alias for stream.values().toArray(), although potentially easier to optimize since it's more direct.)

If we had that, then we'd just need a way of turning an array of Uint8Arrays into a single Uint8Array. Surprisingly I can't find any easy way to do this, especially not performantly. So, maybe that is not the right path.

annevk commented 4 years ago

Probably also want to consider an equivalent API where the buffer is already allocated (similar to encodeInto(), BYOB).

MattiasBuelens commented 4 years ago

Probably also want to consider an equivalent API where the buffer is already allocated (similar to encodeInto(), BYOB).

I like the idea of a method that fills up an entire ArrayBuffer (or ArrayBufferView). I think it would work better as a reader method, for example ReadableBYOBReader.readFully(view) (taken from Java's DataInput.readFully()). Optionally, we could also add a shorthand method on ReadableStream, similar to how ReadableStream.cancel() is a shorthand for ReadableStreamDefaultReader.cancel(). The readInto() function in this example seems like a good starting point.

It would be different from ReadableStream.arrayBuffer() and/or ReadableStream.toArray() though. Those two methods would always read the entire contents of the stream, whereas a readFully() method may leave some bytes on the stream if the given buffer is not large enough to fit the entire contents.

jakearchibald commented 4 years ago

How about:

Maybe we can look at text() later, but (await stream.array()).join('') doesn't seem too bad.

jimmywarting commented 3 years ago

The new Response(blob).body hack was the reason why we got blob.arrayBuffer(), blob.text() & blob.stream() You can use the same hack here ... new Response(stream).blob() Could you imagine what the fetch would look like instead if we had done response.body.blob()? then we would probably not have response.blob() in the first place

It would be nice to have all array/blob/text/arrayBuffer - hopefully node streams would follow suit than you could use express.js to do stuff like

app.post('route', (req, res) => {
  await req.arrayBuffer()
})

Rejects and cancels stream if any chunks are not BlobPart.

Not sure what you mean by this

Blob constructor can take any kind of object you throw at it. The default fallback behavior is to cast anything it can't handle into a string... if you do: new Blob([{}]) then it will dump "[Object object]" as the content. Maybe should do the same thing.

jakearchibald commented 3 years ago

Could you imagine what the fetch would look like instead if we had done response.body.blob()? then we would probably not have response.blob() in the first place

That would have delayed shipping fetch and therefore service workers for many years. Also, the methods on response are still useful, because:

Blob constructor can take any kind of object you throw at it. The default fallback behavior is to cast anything it can't handle into a string... if you do: new Blob([{}]) then it will dump "[Object object]" as the content. Maybe should do the same thing.

Yeah, agreed. My intention was: Chunks can be anything that can appear in the sequence used in the blob constructor. As you say, almost everything is acceptable, with a few exceptions like detached buffers.

taralx commented 3 years ago

FWIW, +1 to a readFully method on BYOB readers. Maybe a new issue for it?

ricea commented 3 years ago

An interesting use of .blob() that I hadn't thought of is to efficiently handle streams that would use excessive amounts of memory to load completely, and so would ideally be stored on disk. The platform doesn't provide a primitive to stream data into a Blob, but it appears implementations do have this as part of their implementation of the blob() method on Response.

See https://stackoverflow.com/questions/67362823/progress-for-a-fetch-blob-javascript/67364010#67364010 for an example we could make much cleaner.

o-t-w commented 3 years ago

Looks like Node has just implemented this for arrayBuffer, blob, json, and text, but the API is a bit different. https://nodejs.org/api/webstreams.html#webstreams_utility_consumers

wffurr commented 1 year ago

Any chance of this ever landing? It's amazing to me that there's no "stream.collect" equivalent.

jimmywarting commented 1 year ago

just figured if this could be part of something like iterator helpers sense streams are iterable then this should be essentially free. and they would then also work for node streams and for any kind of iterator that could yield any form of ArrayBufferView

bakkot commented 1 year ago

just figured if this could be part of something like iterator helpers

As mentioned above, iterator helpers have a .toArray method, and async iterator helpers will have the same. The helper will let you do await stream.values().toArray(), which is nicer than what you currently do - but getting all the bytes into a single buffer after is still going to be awkward, and even with a helper like this you'd still have a choice between either copying every chunk after reading all of them or giving up locality. This use case really wants a more specific "put all the bytes from a readable byte stream into a single buffer" helper.

And iterator helpers are intended as fully general utilities, so they're probably not the right place to have a method specifically for producers which yield chunks of bytes. That's a pretty uncommon thing to do with iterators anyway; the ReadableStream machinery is much better suited to handle such producers efficiently.

datner commented 1 year ago

I believe that is native streaming instruments had better api they would be used more, which should not be a controversial opinion.. Even if we are collectively 'stuck' on this issue, letting it rot will not move this forward. Streaming is a great concept, and is becoming more and more relevant in the browser. Do we not think that this should be given another inspection?

jasnell commented 7 months ago

Related: https://github.com/whatwg/streams/issues/1238

Jarred-Sumner commented 7 months ago

We would like to implement this in Bun, along with ReadableStream.prototype.text, ReadableStream.prototype.blob, and ReadableStream.prototype.json.

We also would be interested in implementing ReadableStream.prototype.array and ReadableStream.prototype.formData but feel less strongly about those. FormData can accept an optional Content-Type argument and if provided parse as multi-part form and otherwise parse as url-encoded form, rejecting the Promise if invalid data is passed (and canceling the stream)

bakkot commented 7 months ago

Can I suggest instead having a .bytes() method which gives you a Uint8Array? See https://github.com/w3ctag/design-principles/issues/463, https://github.com/whatwg/fetch/issues/1724#issuecomment-1814626681, etc. My proposal for base64 support in JS also operates on Uint8Arrays.

jasnell commented 7 months ago

I'm happy with this approach also (in contrast to what I suggest in #1238).

A key challenge I have with having these methods directly on ReadableStream is that they are not useful with ReadableStreams that are not byte-oriented. Putting these on a Reader instead makes it a bit nicer to limit to just byte-oriented streams similar to the how ReadableStreamBYOBReader does. If we do add these directly to ReadableStream.prototype then they should only work on byte-oriented streams.

ReadableStream.prototype.array() really isn't necessary given that Array.fromAsync(...) (https://github.com/tc39/proposal-array-from-async) is coming.

const chunks = Array.fromAsync(readable);

ReadableStream.prototype.text() is a bit tricky if we can't specify the text encoding. With response.text() we can use the content-type to identify the encoding, and at that point there's a bit of overlap with TextDecoderStream that we'd need to reconcile.

Honestly, the more I think about it, the more I'd really prefer the ...fromAsync(...) pattern for multiple things...

const u8 = await Uint8Array.fromAsync(iterable);  // iterable must provide BufferSources
const str = await String.fromAsync(iterable);  // iterable must provide stringifiable chunks
const val = await JSON.fromAsync(iterable);  // iterable must provide stringifiable chunks
// etc

For text encoding, the pattern is pretty natural:

const str = await String.fromAsync(readable.pipeThrough(new TextDecoderStream("...")));
Jarred-Sumner commented 7 months ago

If we do add these directly to ReadableStream.prototype then they should only work on byte-oriented streams.

If the streams are passing strings that should be fine too for json and text, right? Particularly if they're internally buffering rather than returning the next chunk of valid data

Honestly, the more I think about it, the more I'd really prefer the ...fromAsync(...) pattern for multiple things...

From an implementer perspective, it sounds nice.

From a JS developer perspective, I have trouble imagining developers prefer this over .text(), .bytes(), .json().

My gut here is that most JS developers would prefer the idea of both, but probably 98% of all usage would be one of .text(), .json() .arrayBuffer() .bytes() rather than the Constructor.fromAsync(iterable) variant.

People are used to fetch-like .text() instead of fromAsync and fromAsync is more complicated to explain.

This chain of reasoning is more complicated:

  1. it's a ReadableStream
  2. therefore it's an async iterable
  3. therefore to get the value as a String you call String.fromAsync

Compared to:

  1. call .text() on the ReadableStream

if we can't specify the text encoding

Isn't there precedent for defaulting to UTF-8? Or is that Windows-1252?

For text encoding, the pattern is pretty natural: const str = await String.fromAsync(readable.pipeThrough(new TextDecoderStream("...")));

I think this pattern composes well, but would only feel natural to developers who understand a lot of streams and spec authors. It would feel complicated for less experienced developers

bakkot commented 7 months ago

I would guess that something like Uint8Array.fromAsync would be unlikely to be optimized well: that is, it would probably end actually allocating and reifying each chunk of bytes as a JS object. It's possible to optimize that away under some circumstances, but it would require special cases for each producer/consumer pair, and there's a lot of fiddly details which would need to get implemented. Whereas something like ReadableStream.prototype.bytes() could trivially avoid those intermediate allocations.

annevk commented 7 months ago

There is precedent for enforcing UTF-8 for text() and I'd object to doing anything else here. Folks can write their own wrapper if they can't be bothered to fix the legacy content.

ricea commented 7 months ago

It looks like we have a plan here, and we just need someone to do the work. I don't have time to commit to this in the near future, but if someone has time to draft a PR, I'd be happy to look at it.

I believe we have rough consensus on the following issues:

  1. Chunks should be of the same types as accepted by the Blob constructor. As an implementer, I'm a little unhappy with implementing Blob chunks because reading it is async and will make my life hard, but I can live with it. I also have some ergonomics concerns that accepting USVString will result in a lot of [object Object] in the output, but people have been living with that with Blob and it isn't the end of the world.
  2. We should only handle UTF-8 for input and output. TextDecoderStream already exists for people who need more encoding support.
  3. text(), arrayBuffer() and blob() seem uncontroversial. json() is easy to do if we spec it as parsing at the end like fetch does but maybe there isn't much demand? Other methods seem to require further discussion.
annevk commented 7 months ago

I think it's worth considering doing bytes() (Uint8Array) instead of arrayBuffer() as @bakkot suggested. We should probably do bytes() on the other objects that offer such methods.

ricea commented 7 months ago

I like bytes() but I thought it might be controversial.

Jarred-Sumner commented 7 months ago

bytes() seems nice. I think it’s unusual for developers to want to create an ArrayBuffer, they’re anlmost always going to either pass it somewhere else or create a Uint8Array wrapping it. In JSC iirc creating an ArrayBuffer ends up being more expensive since it becomes a WastefulTypedArray internally. As a sidenote, are there plans for BodyMixin / Response / Request & Blob to get a bytes()?

annevk commented 7 months ago

No concrete plans, but PRs will be accepted.

jasnell commented 7 months ago

All SGTM!

bakkot commented 7 months ago

As a sidenote, are there plans for BodyMixin / Response / Request & Blob to get a bytes()?

I have https://github.com/whatwg/fetch/issues/1732 but there's no expressed implementer interest yet.

saschanaz commented 7 months ago

I believe we have rough consensus on the following issues:

  1. Chunks should be of the same types as accepted by the Blob constructor

Maybe I'm missing something but I don't see this part in this thread, has there been discussion about this somewhere else? new Response(stream).arrayBuffer() throws if the chunk is not Uint8Array, so this is a bit surprising to me.

ricea commented 7 months ago

Maybe I'm missing something but I don't see this part in this thread, has there been discussion about this somewhere else? new Response(stream).arrayBuffer() throws if the chunk is not Uint8Array, so this is a bit surprising to me.

This was based on @jimmywarting and @jakearchibald's comments: https://github.com/whatwg/streams/issues/1019#issuecomment-587358358

However, on re-reading I realised that @jakearchibald was only saying this should work for the blob() method. This comes for free if you create a Blob by converting the ReadableStream to an array and then calling the Blob(array) constructor, but I would like to stream data into the Blob if at all possible.

Anyway, it looks like we don't have consensus on this item, so let's discuss it a bit more.

First question: should all of text(), blob() and bytes() accept the same chunk types?

Arguments for:

Arguments against:

Anything else?

saschanaz commented 7 months ago

First question: should all of text(), blob() and bytes() accept the same chunk types?

I vote yes.

My second question: do we really want to accept everything possible in non-byte ReadableStream instead of simply requiring ArrayBufferView as ReadableByteStreamControllerEnqueue does? If yes, should we also accept everything possible in the byte stream controller?

taralx commented 7 months ago

Is there an equivalent for getting the endings option if we allow strings?

If I had a vote I'd say "yes, but only buffer sources". Allowing Blobs and strings in a constructor is not the same as allowing them in a stream.

jasnell commented 7 months ago

I've made a first attempt at drafting some spec around this in #1311 ... it intentionally limits to just text, Uint8Array, and Blob, ignoring json() for now. Still likely a number of rough edges but it's a start

ricea commented 7 months ago

1311 allows cancellation of blob(), text() and bytes() with an AbortSignal. It cancels the ReadableStream if abort is signalled. An alternative would be to leave data in the stream. Personally I prefer the cancellation approach since it is more likely to be consistent between implementations, but if anyone feels differently please speak up.

jasnell commented 7 months ago

Yeah, definitely worth talking about that point. I had considered make it so that it would only cancel the read loop and not the underlying stream. A preventCancel option like with pipeTo/pipeThrough might be worthwhile?