whatwg / html

HTML Standard
https://html.spec.whatwg.org/multipage/
Other
8.03k stars 2.62k forks source link

Add modern binary encoding APIs #6811

Open lucacasonato opened 3 years ago

lucacasonato commented 3 years ago

Many protocols, APIs, and algorithms require that some binary data (byte array) is serialized into a string that represents that binary data losslessly. Common formats for this are for example base64 encoding and hex encoding. Often the reverse - so deserializing the string back into the original data - is required too.

Here are some (common) use cases that require base64 or hex encoding / decoding some binary data:

The web platform does not provide a fast an easy approach to base64 / hex encode and decode. Because of this web developers have often resorted to slow and inefficient encoders built on atob/btoa, Number#toString, and parseInt.

I propose to add a new modern binary encoding API as explained in the proposal-binary-encoding repository. This would add a new BinaryEncoder and BinaryDecoder interface (akin to TextEncoder and TextDecoder). The proposed API is simple enough to fit comfortably into section 8.3, "Base64 utility methods" of this spec. The section should likely be renamed to Binary encoding in case this makes it.

For more background reading with usage examples and previous art, a preliminary spec text, and a polyfill please view the above linked repository.

Previous discussions in this repository related to binary encoders: https://github.com/whatwg/html/issues/351 and https://github.com/whatwg/html/issues/6779, specifically this comment: https://github.com/whatwg/html/issues/351#issuecomment-158973197.

A TC39 proposal for a similar issue with lesser scope was opened a few days ago too: https://github.com/bakkot/proposal-arraybuffer-base64. For discussion on the question of "language vs platform feature" and scope, view this issue: https://github.com/bakkot/proposal-arraybuffer-base64/issues/4.

Kaiido commented 3 years ago

note: I'm here only as a web-dev, and this is only my own opinion.

Wouldn't a more "modern" approach here be to update whatever requires this format to accept streams of binary data directly?

A few notes about the use cases exposed here:

  • Encoding a png image into a data URL (base64 encoding the png)

Why would you do that? If it's to pass that data: URL to a consumer with the goal to display it in an HTML page, then the correct approach is to keep a Blob and pass it through a .srcObject (in the future) or for now through a blob: URL. data: URLs are inefficient in many aspects and I believe they should stop getting love from us.

  • Creating a hex string from a cryptographic digest (hash)
  • Generating a random ID from crypto.getRandomValues

Why should this require a fast or performant path? Cryptographic hashes are generally in the order of a few hundred bytes only, getRandomValues() max length is 65,536 bytes; modern hardware can perform hex-dump on this in no time even with the "slow"(?) paths currently available.

  • Send binary data over transports that only supports string values (base64 {de/en}coding)

IMM, these transports should get updated, sending b64 over the network means that 33% of the original data is sent for nothing. If the backend really needs b64, then the conversion should be done there. I appreciate that the explainer also envisions a streaming version, but until then you'd soon enough face string size limit issues (currently around 512MB in V8 64bits).

  • Parsing PEM files (binary data is stored as base64 encoded strings)

I am really not an expert so excuse any foolishness from me, but if that's such a common use-case, wouldn't it better to solve it by adding this format into what can be imported from webcrypto instead?


All in all, I am not against this API, I too think it's too bad atob/btoa work on “bit strings” instead of on ArrayBuffers directly, but I don't think we should promote most of these use-cases either.

lucacasonato commented 3 years ago

Hey @Kaiido

Very valid point regarding data URLs. This should have probably not been my first talking point 😅

Why should this require a fast or performant path?

The point is not just the speed of the actual encoders, but also the peripheral cost of shipping extra bytes of JS to the client for the encoders / decoders. Bundlephobia reports nearly 2 KB gzipped + minified for js-base64, the most popular base64 encoding package on NPM (5.8 million weekly downloads). That is already 50% of preact, a fully fledged reactive UI framework.

Having this primitive built into the platform could save GBs of network transfers on a daily basis. The developer experience is arguably also a lot better for builtin APIs than having to pull in third party packages, or handroll something.

IMM, these transports should get updated, sending b64 over the network means that 33% of the original data is sent for nothing.

I do agree, but sometimes this is just outside of the hand of developers. Many public APIs with large user bases require that certain payloads are encoded using base64. For example:

I am really not an expert so excuse any foolishness from me, but if that's such a common use-case, wouldn't it better to solve it by adding this format into what can be imported from webcrypto instead?

This might not be an exceptionally common use case, but one I have run into multiple times previously. The example was more meant to generally represent file types requiring the use of base64 encoding.

I too think it's too bad atob/btoa work on “bit strings” instead of on ArrayBuffers directly

Me too - too late to change now though.

pshaughn commented 3 years ago

I don't think it's too late to change atob and btoa. atob would need an extra argument to tell it to output an ArrayBuffer, but adding arguments like that has been done before. btoa might not even need an extra argument, unless there's some page in the wild that's been base64-encoding the default stringification of an ArrayBuffer and would break if it base64-encoded the content instead.

lucacasonato commented 3 years ago

@pshaughn That wouldn't be extensible to hex, base64url, or other binary encodings though.

pshaughn commented 3 years ago

@lucacasonato Options dicts could specify different encodings, but that makes the simple cases harder to write. It does seem better to have a namespace with one method pair per encoding like your proposal, instead of a couple very overloaded methods.