whatwg / html

HTML Standard
https://html.spec.whatwg.org/multipage/
Other
8.03k stars 2.62k forks source link

Allow objects to customize serialization / deserialization for structured clone #7428

Open jasnell opened 2 years ago

jasnell commented 2 years ago

In Node.js we have implemented a Node.js specific ability for various built-in objects to provide their own serialization/deserialization for cloning or transfer. The mechanism work by placing a platform host object into the JavaScript objects prototype chain then attaching specific symbols to the JavaScript object that implement the serialize and deserialize functions.

For instance,

const {
  JSTransferable,
  kClone,
  kDeserialize
} = require('...');

class Foo extends JSTransferable {
  // ...

  [kClone]() {
    return {
      data: { a: 1, b: 2 },
      deserializationInfo: '{module specifier}:Foo'
    };
  }

  [kDeserialize]({ a, b }) {
    // ...
  }
}

The JSTransferable here is a host object implemented in C++. Our value serializer delegate understands that to clone any object that extends from JSTransferable, it simply needs to look for it's [kClone] method and serializes the returned data in the object's place.

The value deserializer delegate is a bit trickier. Currently, only Node.js core objects can extend from JSTransferable because the deserializer has to be sure it can locate and resolve the definition of the Foo class in order to create it and deserialize it properly. Essentially the process is: use the deserializationInfo to resolve the class, then once resolved, pass the data in to the kDeserialize method.

The mechanism works well but is definitely limited because it (a) only works in Node.js and (b) only works with Node.js core objects. We'd like to be able to extend the capability to user defined objects.

There are three key pieces here that would need to be standardized:

  1. The symbols that are to be attached to be objects. We currently define three, kClone, kTransfer, and kDeserialize.
  2. The structure of the intermediate object that is returned by kClone and kTransfer to feed into the serializer.
  3. The mechanism for resolving the deserialization implementation on the receiving side.

This mechanism does allow for transferring native host objects but that's more specialized and I'm not asking for that here.

For the receiving side, one possible way of accomplishing the resolution of the deserializer is to use a registry in the form of an Event on the MessagePort

const mc = new MessageChannel();
mc.port1.addEventListener('deserialize', ({ serialized, done }) => {
  done(deserialize(serialized));
});
mc.port1.onmessage = () => {}

For structuredClone(), the same basic pattern can be used, passing in an EventTarget as a "deserialization controller"

const deserializer = new EventTarget();
deserializer.addEventListener('deserialize', ({ serialized, done }) => {
  done(deserialize(serialized));
});
structuredClone(new Foo(), deserializer);

If a deserializer is not provided or fails, an appropriate DOMException is reported.

domenic commented 2 years ago

This feels like a case where we need to step back and deal with use cases, per https://whatwg.org/faq#adding-new-features .

In particular, it's not clear to me what use cases this proposal covers which can't be covered by

const serializable = customPreSerialize(data);
postMessage(serializable);

and

onmessage = e => {
  const data = customPostDeserialize(e.data);
};
bathos commented 2 years ago

one possible way of accomplishing the resolution of the deserializer is to use a registry in the form of an Event on the MessagePort

In the browser, structured clone is used for IndexedDB and History state in addition to postMessage. [Serializable] platform objects are supported by all of them (if not entirely consistently across browsers). Is this idea about extending the message channel API only or is the goal to extend structured clone in general?

Side question: In HTML, “transfer” is a special case of “clone” where “[the object is] not just cloned, [but also becomes] no longer usable on the sending side”. Is this the same meaning intended by JsTransferable / kTransfer?

jasnell commented 2 years ago

Node.js itself presents a solid set of use cases here. Take AbortSignal, for instance. It already inherits from EventTarget. In order to make it cloneable as has been discussed in another issue over in the dom repo, we also have to make it have the JSTransferable in it's prototype chain. We accomplish it by creating the AbortSignal first, then actually creating an instance of JSTransferable then setting its prototype to the AbortSignal. It ends up being largely transparent to the user but it's really a bit of a hack. Given that many of the Web Platform APIs are implemented in JavaScript in Node.js that becomes the only way we currently have to make them cloneable or transferable. Because we're creating a native object, there's also a performance penalty. It would be great if we could avoid that.

The other use case is the known issue that JS class instances can't currently be cloned. Sure we could require that every application come up with it's own intermediary format for any JS class object they might want to clone but it would be nicer if we made it a bit easier for them -- and doing so in a way that would work consistently across multiple javascript runtimes by baking it into the standard. Basically, I'd like to be able to create a cloneable JavaScript class that works with structuredClone and postMessage no matter what platform the code is run in.

domenic commented 2 years ago

I don't understand the applicability of your first paragraph to the HTML and DOM Standards. How Node.js chooses to implement those specs has nothing to do with whether the specs make these things clonable, and in general implementation limitations or choices should have no bearing on standardization or use case discussions.

So it sounds like this reduces to making something about user classes nicer. Can you state that in the form of use cases that have come up concretely, like the example in the FAQ entry I linked to?

annevk commented 2 years ago

Note that there seems to be some demand for this on the TC39 side so proxy objects can be serialized, which seems like a legitimate use case.

jasnell commented 2 years ago

Proxies are a good case, yes. I would argue that making it easier for instances of a class is also quite valid, as would the ability to allow an object include private state in the clone.

The other use case are deeply nested object graphs or maps where you don't really know what's necessarily there in advance and can't really know if it's even possible to extract a serializable representation.

Consider a case such as:

class Foo {
  #bar = undefined;
  constructor(a) {
    this.#bar = a;
  }
  doSomething() {
    if (this.#bar === 1) { /* do something */ }
    else { /* do something else */ }
  }
}

const map = new Map();
map.set("abc", new Foo(1));
map.set("xyx", new Foo(2));

postMessage(map);

Or, the case where the above Foo instance is deeply nested into some complex object graph.

Using the postMessage(getSerializableRep(obj)), I would have to walk the entire tree and build a new one that is guaranteed to contain all serializable objects, then walk that same graph on the deserialization side to get the proper object types. That's extremely cumbersome when it could be done while the original graph is being serialized/deserialized.

bakkot commented 2 years ago

Note that there seems to be some demand for this on the TC39 side so proxy objects can be serialized, which seems like a legitimate use case.

I think Mark in that thread would probably prefer that Proxies be treated like other objects (i.e. iterate the properties and serialize the values, plus a little complexity around arrays and stuff), rather than explicit support for serializing them in any special way. (That's also my preference, full disclosure.) So I don't think this should count as a use case for the purposes of this thread, necessarily.

bathos commented 2 years ago

@bakkot That is also what I’d like to see re: proxies, both to close the Proxy-exotic-object-status observability hole it creates and because there are already user-code-invoking paths (including getter invocation) so it seemingly(?) isn’t accomplishing anything useful. (Likewise “value is an Array exotic object” → IsArray(value)).

Regarding the proposed idea itself, I’m still pretty curious if the concept would be specific to messaging. We currently use a “descriptor wrappers” pattern towards these ends sometimes with History. A well-known-symbol contract sounds like an appealing alternative, but I’m not sure how (or if) the premise could really work in that context.

annevk commented 2 years ago

@bakkot wouldn't that mean you can still sniff out proxies from sets or platform objects and such?

bakkot commented 2 years ago

@annevk A Proxy for a Set already does not behave like a Set: Set.prototype.has.call(new Proxy(new Set, {})) throws. (Similarly for platform objects like Image or whatever.) That's not really a problem.

Rather, the concern is whether you can distinguish between a Proxy for a regular, non-platform object and a bare such object.

Edit: on discussing with @erights, he's also concerned about a Proxy for a Set being practically usable like a Set, so the sketch above wouldn't entirely satisfy him.

annevk commented 2 years ago

I see, that seems like a relatively straightforward change then. Nice.

Ginden commented 2 years ago

Personally, I would like to see API like:

structuredClone.register(Symbol.for('Foo'), {
   deserialize(v: Serializable): Foo;
   serialize(v: Foo): Serializable;
})
class Foo {
    [Symbol.structuredCloneIdentifier]: Symbol.for('Foo');
}

If object with custom Symbol.structuredCloneIdentifier is passed and there is no deserializer on receiving side, error is thrown.

jimmywarting commented 2 years ago

I just found my self in need of cloning a custom built class that i wish to save in IndexedDB

Jamesernator commented 2 years ago

One idea for deserialization of custom objects across realms could to be utilize a feature like module blocks to provide deserialization steps.

i.e. Suppose we have some class we want to make serializable/deserializable:

// just a toy example to demonstrate the API
class Point {
    #x;
    #y;

    constructor(x, y) {
        this.#x = x;
        this.#y = y;
    }

    get x() {
        return this.#x;
    }

    get y() {
        return this.#y;
    }
}

Then serialization would be pretty trivial by just providing some method:

class Point {
    // ...
    [structuredClone.serialize]() {
        return { x: this.#x, y: this.#y };
    }
}

However because Point may not in general exist in a given worker or such we send an object to, a simple [structuredClone.deserialize] can't work, however something like a module block would:

class Point {
    // ...
    static [structuredClone.deserializerModule] = module {
         // Actually import the Point class, this way
         // we can create the Point objects in any
         // worker/etc that has this module block
         import Point from "./Point.js";

         // The actual deserializer function
         export function deserialize({ x, y }) {
             return new Point(x, y);
         }
    }
}

This would work with something like worker.postMessage, by when serializing a point say Point(3,4), also captured is a reference to the deserializerModule, this is passed as part of the serialization. i.e. The custom object would really be serialized to something like:

{
    [[Type]]: "custom",
    // The serialized data returned by [structuredClone.serialize]
    [[Data]]: { x: 3, y: 4 },
    // The deserializer [structuredClone.deserializerModule]
    [[Deserializer]]: module { ... }
}

During deserialization when [[Type]]: "custom" is seen, it imports the module into the worker/etc (which if it's already been imported would be idempotent, as that's how module caching works). It then calls the resulting module's .deserialize(...) export with [[Data]] to produce the result.

Now there is one caveat here, because import(...) is asynchronous this would be okay for passing asynchronously cross-thread, but structuredClone is sync. For this we could easily just have separate properties for cross vs local thread:

class Point {
    // thread local deserializer
    static [structuredClone.deserialize]({ x, y }) {
        return new Point(x, y);
    }

    // cross-thread deserializer
    static [structuredClone.deserializeModule] = module {
        import Point from "./Point.js";

        export function deserialize(data) {
            return Point[structuredClone.deserialize](data);
        }
    }
}

The actual API shape is fairly immaterial, but it shows the idea that we can transfer a deserializer across threads to perform deserialization. And technically the dependency on module blocks isn't really true either, they could be replaced by just providing a deserializer url (although module blocks definitely solves issues regarding CSP and such):

class Point {
    static [structuredClone.deserializeModule]
         = new URL("./Point_deserializer.js", import.meta.url).href;
}

We could even imagine something a bit less dynamic if that would help implementations by having an explicit register step (as previously suggested by @Ginden) akin to how customElements.define works:

i.e.

class Point {
    // ...rest of impl

    static [structuredClone.serialize](point) {
        return { x: point.#x, y: point.#y };
    }

    static [structuredClone.deserialize]({ x, y }) {
        return new Point(x, y);
    }

    static [structuredClone.deserializeModule] = module {
        import Point from "./Point.js";

        export function deserialize(data) {
            return Point[Symbol.deserialize](data);
        }
    }
}

// Capture the initial value of [structuredClone.serialize], [structuredClone.deserialize]
// and [structuredClone.deserializeModule] similar to how customElements.define captures
// the initial values of connectedCallback and stuff so it can optimize them more easily
structuredClone.register(Point);
jimmywarting commented 2 years ago

Think it would be handy if we could 1) clone something into a (shared)ArrayBuffer or Blob, 2) Send it via some api to NodeJS / Deno / Bun.js / WebRTC peer to peer 3) And then serialize it back from some binary data as a way to replace JSON that loses some information when you for example convert a Date into a String. JSON.parse(JSON.stringify(new Date())) -> "not the same thing"

JSON can't handle binary data very well it lacks support for stuff like: circular ref pointers, Blob, File, Set, Map, BigInt, TypedArrays, ArrayBuffer, Date, and everything else that structuredClone supports

JSON is fairly limited to what you can do with it. I think It's time to replace the old legacy JSON api with something newer that dosen't need to convert images to base64 and also have the potential to decrease the payload with something more compact.

Jamesernator commented 2 years ago

Think it would be handy if we could 1) clone something into a (shared)ArrayBuffer or Blob

I think you're more so asking for: #3517

WebReflection commented 4 months ago

no progress but the demand is greater and greater these days ... my proposal was to add Symbol.clone that acts ust like toJSON.

I don't have strong opinion on the Symbol.clone name, it could be as well more verbose as long as something makes it possible to postMessage proxies without breaking or requiring manual intervention from users.

Thanks for considering any progress around this topic, it's essential also in WASM related projects and specially when WASM code runs in Workers.

Ginden commented 4 months ago

We could even imagine something a bit less dynamic if that would help implementations by having an explicit register step (as previously suggested by @Ginden) akin to how customElements.define works:

Explicit registration step, using either strings or symbols stored in global symbol registry, is the only solution that I can think of that would reasonably satisfy following constraints:

Just registering Point is not enough, because receiving side can't identify that Point class matches Point on sender side - duplicated class names are pretty common.

Jamesernator commented 4 months ago

Just registering Point is not enough, because receiving side can't identify that Point class matches Point on sender side - duplicated class names are pretty common.

I wasn't suggesting using the class name whatsoever, rather the prototype itself is the registry key.

Yes my suggestion doesn't support storage, i.e. it only supports postMessage/structuredClone similar to other non-storage types (SharedArrayBuffer, MessagePort, etc), but has the advantage of being able to deserialize in the agent cluster without registering per agent (i.e. structureClone.register would allow the engine to prepare the deserializer in whichever agents it wants).

WebReflection commented 4 months ago

FWIWI I don't think registering is helpful + if it uses global symbols it still collides. In my specific use case proxies in a realm don't actually want/need or can't be deserialized, they are forwarded back (Atomics + Proxy) so that serialization is all it's needed.

Once serialization can return something else compatible with the structuredClone algorithm I think it'd be up to the user / developer to decide what to do with that serialized data. automagic deserialization looks more dangerous than useful to me as it requires registering things twice per each realm and if the registration is not aligned who knows what happens while if there is just enough control to decide what to send in a postMessage related dance we should be good, as we've been good to date using just toJSON for more complex use cases/data.

WebReflection commented 1 month ago

Another day in WASM / JS interop, another issue to tackle around this topic.

First of all, I read again the whole thread and I think we don't strictly need Symbol.serialize and Symbol.unserialize, let me expand on this: the same way toJSON has been already good to solve similar cases, the fromJSON dance is an implementation detail of the receiver of the data.

Having just "a say" around how stuff should be passed along when postMessage happens, as example, and in deeply nested data, would save tons of CPU cycles for whoever, like me, needs to traverse simple to complex data trees to find out if something is a Proxy, and proxies are the most common identity anyone could find in the WASM / JS interop world: they proxy internal WASM pointers to the JS world and that's the story, no strings attached.

The moment any of those proxies need to survive a postMessage dance is the moment:

Accordingly, I would suggest to focus solely on a Symbol.structuredClone (no hard opinion on the name) that allows WASM targeting projects as well as JS targeting libraries to define how some specific Proxy or special class should travel across realms.

Thanks in advance for eventually considering that part in particular and hopefully moving forward this still open requirement for the Web platform.

WebReflection commented 1 month ago

On the other hand ... I wonder if this whole request could be confined to a special symbol/behavior for proxies only, so that it could be moved from something broader to something Reflect.structuredProxy like trap:

new Proxy(thing, {
  structuredProxy(target) {
    // custom
    return { thing: [...Object.entries(target)] };
    // default
    return Reflect.structuredProxy(target);
  }
});

This way would confine the issue to proxies only, which is the main/major pain point (imho) without exposing the Proxy nature of the instance directly as that needs to be cloned anyway and there's no direct way to tell, on the other end, if that object literal, as example, was a proxy or not before.

If this makes sense to anyone, I'd be super happy to help moving forward with it.

P.S. it could be named, at the Reflect level, just clone too, or even postMessage as there is where it matters the most and it could also be confined for that use case only.

bakkot commented 1 month ago

I'm opposed to making this available only for Proxies. That will lead to people using them where not otherwise necessary just so they can customize cloning, which would be bad.

WebReflection commented 1 month ago

@bakkot I am not opposed to you being opposed to my latest idea, I am just out of ideas/use cases to present to eventually move this forward anyhow so I am trying to stretch the goal ... let's scratch that Proxy only idea already but please let's try to figure out a way forward around this issue, thank you!

edit ... cause if anyone sees that happening, people using proxies just to have this ability, it means this ability is more than desired so please let's find a way forward!

bakkot commented 1 month ago

I agree this would be nice but alas I do not personally have the time to push for anything concrete here.

WebReflection commented 1 month ago

@bakkot fair enough ... now, imagine I could patch globally postMessage so that it crawls any crawl-able reference before actually posting data and if it finds a Symbol.clone method in any reference it invokes it so that we can have a real-world implementation of this idea for tests and stuff ... would that help moving this topic forward? If it does, I'm pretty quick at "monkey patching the world" so just let me know, thanks.

WebReflection commented 1 month ago

@bakkot ... better done than said ...

symbol-structured-clone

index.html test page

<!doctype html>
<html lang="en">
  <head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width,initial-scale=1.0">
    <script type="module">
      import 'symbol-structured-clone';

      class LiteralMap extends Map {
        // when cloned or via postMessage
        // it will return an object literal instead
        [Symbol.structuredClone]() {
          return Object.fromEntries([...this]);
        }
      }

      const lm = new LiteralMap([['a', 1], ['b', 2]]);

      structuredClone(reference);
      // {"a": 1, "b": 2}

      postMessage(lm);
      // event.data received as
      // {"a": 1, "b": 2}
    </script>
  </head>
</html>

For the little I've tested, I think this demo shows what users need to do in order to fix possible issues before postMessage or before structuredClone is invoked.

The dance is terse but it took a while around the field to try to make it right and even if I publish this tomorrow on npm not all users would understand what it does, why it's needed, and most importantly, why their code is suddenly more correct and less error prone, but also slower in every raw benchmarks results.

I hope this helps understanding what we, Web users, need to deal with when it comes to cross realms or cloning situations and I still hope this issue would move forward sooner than later.

WebReflection commented 1 month ago

I went ahead and created a polyfill based on previous code that automatically patches structuredClone and postMessage in either the main or worker thread: https://github.com/WebReflection/symbol-structured-clone#readme

I am not suggesting anyone to use this in production but it surely shows that it is possible to have this feature and if it would be backed into the specs and accepted as new symbol by TC39 too it could be a polyfill to use until all vendors are aligned.

Note that if Symbol.structuredClone already exists, the little code that patches the world does literally nothing. I find that name pretty explicit and convenient but if there's any interest in moving this forward and a different name is used, please let me know and I'll just update the polyfill.

Thanks again for eventually considering this hugely desired improvement to the Web platform.

Offroaders123 commented 1 month ago

My personal use case for this would be to be able to add my own custom primitives that can be passed along between postMessage() for Web Workers.

Right now, I'm using my own custom class Int8 extends Number { } class which adds what would be an int8 number type to JavaScript (in my use cases, I need to preserve the original precision of specific number types on an object, with regards to what the user set them to originally, and using number on it's own isn't specific enough, because that would be float64 essentially).

Right now, when passing my Int8 instance to the worker with postMessage(), the resulting object is just a plain number type again, it loses the branding of the type it's meant to describe.