whatwg / encoding

Encoding Standard
https://encoding.spec.whatwg.org/
Other
278 stars 77 forks source link

Add a static decode and encode method to `TextEncoder` and `TextDecoder` #267

Open lucacasonato opened 3 years ago

lucacasonato commented 3 years ago

The large majority of users of TextEncoder and TextDecoder do not make use of streaming capabilities. Instead they just need to decode a single chunk, or encode a single chunk. We should consider adding a static utility method to TextEncoder / TextDecoder to cater to this use case.

// current
const data = new TextDecoder().decode(new Uint8Array(/** data here */));

// proposed
const data = TextDecoder.decode(new Uint8Array(/** data here */));

I propose the following API be added:

dictionary TextDecoderOptionsWithLabel : TextDecoderOptions {
  DOMString label = "utf-8";
};

interface TextDecoder {
  static USVString decode(optional [AllowShared] BufferSource input, optional TextDecoderOptionsWithLabel options = {});
}

interface TextEncoder {
  static Uint8Array encode(optional USVString input = "");
  static TextEncoderEncodeIntoResult encodeInto(USVString source, [AllowShared] Uint8Array destination);
}

The exact name of the method should be something different though - the static method and prototype method would name clash in this example which is confusing.

Internally this would behave exactly like a single use non streaming encoder.

domenic commented 3 years ago

What about encodeInto()?

The exact name of the method should be something different though - the static method and prototype method would name clash in this example which is confusing.

Why would that be confusing?

lucacasonato commented 3 years ago

What about encodeInto()?

Added - forgot it existed.

Why would that be confusing?

Just checked spec - wouldn't just be confusing but also against webidl spec:

"The identifier of a static operation also must not be the same as the identifier of a regular operation defined on the same interface." - https://heycam.github.io/webidl/#idl-operations

As to why it would be confusing: "just call TextDecoder.decode" would have two meanings now. Also what would for example the MDN link be? https://developer.mozilla.org/en-US/docs/Web/API/TextDecoder/decode is already taken for the prototype method.

domenic commented 3 years ago

We can fix Web IDL.

As for MDN's problem, they've had a big issue with bad documentation for prototype methods for a long time. They need to fix it. TextEncoder.decode() has always meant a static method (which hasn't existed); TextEncoder.prototype.decode() or textEncoder.decode() are how you write the instance methods.

inexorabletash commented 3 years ago

It's unclear what the benefit of the proposed change is. It doesn't save significant typing (just new); in cases where you want to avoid allocating a new instance, the encoder/decoder can be cached and used statelessly.

Given a time machine to redesign the API now that streams are a thing, this makes more sense for non-streaming encodes/decodes, but is it worth adding to the platform?

hsivonen commented 3 years ago

If we add API surface that doesn't work in old browsers, I'd like to know if users of the potential API surface care about non-UTF-8 encodings. If not, I'd prefer adding new API surface to JavaScript String itself instead of TextDecoder, since the WebIDL layer is a perf bottle neck.

annevk commented 3 years ago

cc @bakkot @littledan

lucacasonato commented 2 years ago

https://github.com/whatwg/webidl/pull/1098 should technically unblock this. Are any browser vendors objected to this? If not, I'd like to start work on this.

lucacasonato commented 2 years ago

since the WebIDL layer is a perf bottle neck

@hsivonen How so? In Deno we have an optimized path in our WebIDL bindings for this specific API that bypasses the slow USVString conversion completely. This should be possible in other implementations too.

During encode you can convert the first argument to a DOMString (which is fast), and then just perform a lenient UTF-8 encoding on that string, which will already perform lone surrogate conversion. https://github.com/denoland/deno/blob/main/ext/web/08_text_encoding.js#L166-L175

hsivonen commented 2 years ago

since the WebIDL layer is a perf bottle neck

@hsivonen How so? In Deno we have an optimized path in our WebIDL bindings for this specific API that bypasses the slow USVString conversion completely. This should be possible in other implementations too.

Gecko doesn't actually use USVString in TextEncoder WebIDL, either. However, the WebIDL layer still has enough overhead that e.g. wasm-bindgen conditionally avoids going through the WebIDL layer.

lucacasonato commented 2 years ago

Gecko doesn't actually use USVString in TextEncoder WebIDL, either. However, the WebIDL layer still has enough overhead that e.g. wasm-bindgen conditionally avoids going through the WebIDL layer.

Mh, that seems very weird to me. I very much doubt that WebIDL inherently needs to be a slowdown here. Currently the overhead likely stems from most engines slowing down when calling from JS to native code. This is a performance issue that can be fixed. In V8 based runtimes, TextEncoder.encodeInto can use V8's "fast API" for better JIT inlined performance for example.

The only WebIDL conversion taking place here is the DOMString conversion, which really is just a call to ToString which is a 0 cost op if the call is already a string. A native API in ES would also need to do some runtime type checking (likely the exact same checks), so I really don't see why performance should be a reason to have a native JS API.

TLDR: I don't think spec decisions should be driven by current performance issues in engines, as I don't think that there are any technical reasons for why TextEncoder#encodeInto must be slower than String#encodeIntoUTF8. It's just a matter of optimizing the codepaths as far as I can tell.

Please correct me if I missed something obvious here.