API for structured serialized data

devsnek commented 6 years ago

Problem

JSON is being left in the dust as we get more and more stuff for JS, and it probably won't be getting any updates.

Goals

support all the js types (except for things like weakmaps, functions, etc, keep state safe)
supports cold storage
standardized format so we can use it between engines
forward compatibility

Intuitions

JSON.parse/JSON.stringify type api
StructuredSerializeForStorage
optionally async/streaming(?)
binary format

Prior art

https://github.com/addaleax/cold-storage
https://nodejs.org/api/v8.html#v8_serialization_api (just a wrapper around v8's serdes)

Other considerations

non-js servers wanting to generate this format for client consumption (maintain a reference C impl? steal an existing system like ETF?)
new mime type, new method for fetch Response

bmeck commented 6 years ago

Symbol.for should be cross realm safe, idk if we can show exceptions on the differences for these versus other symbols.

macdja38 commented 6 years ago

Would be nice to have circular references for something like this as well.

guest271314 commented 6 years ago

One possible option would be creating dynamic modules which could be exported and imported https://github.com/w3c/FileAPI/issues/97#issuecomment-366764094.

devsnek commented 6 years ago

@guest271314 this format shouldn't perform any evaluation (all stored data should be completely static)

guest271314 commented 6 years ago

The data can be stored as text and only evaluated when necessary.

devsnek commented 6 years ago

this format cannot rely on any form of evaluation; it presents inherent security and performance concerns.

guest271314 commented 6 years ago

Is a goal to encrypt the data? Plain text could be converted to and from an ArrayBuffer. Are import and export insufficient? What are the use cases?

devsnek commented 6 years ago

the goal here is a static format for storing serialized javascript values analogous to JSON but with support for many more types. this api would most likely be similar to https://nodejs.org/api/v8.html#v8_serialization_api but without the class or transferables support.

guest271314 commented 6 years ago

Map and WeakMap do not meet requirement?

jakearchibald commented 6 years ago

It feels like the intent here is to expose https://html.spec.whatwg.org/multipage/structured-data.html#serializable-objects as some kind of format.

devsnek commented 6 years ago

@jakearchibald indeed that is what inspired me to open this

Jamesernator commented 5 years ago

There might be security considerations for platform objects (step 18) that are [Serializable] as they might expose data that isn't usually exposed to user-land.

pierogitus commented 3 years ago

I think this is worth another look. As I just mentioned in the fetch repo, CBOR with tags is well suited to this problem. Its also part of WebAuthn so there must be some CBOR logic already happening in all the major browsers.

WebReflection commented 1 month ago

Symbol.for should be cross realm safe

it's not, I had to implement my own orchestration to make that useful cross realm. It works, but it's not ideal.

I think this is worth another look.

I work with this algorithm daily and I am super sad to see my "yet another beg for it" with reasons and use cases behind being just shut down due this long-standing, stale, issue that is going nowhere 😥

WebReflection commented 1 month ago

If anyone is still followng this ... what is blocking Chromium based browsers to offer what NodeJS has been offering for a long time?

This API is allowed where security usually matters the most, the back-end, and it's not flagged as deprecated or something + security concerns are not even mentioned: it would just allow users on the Web too to store JS values as buffers and get these back without needing user-land projects to do that (I maintain structured-clone polyfill which offers that toJSON and fromJSON convention but it's weird I need to offer that because the platform cannot).

It would be lovely from this group to explain what are the caveats, blockers, security concerns, or issues, for something that internally seems to be already implemented and used outside the Web ... I love providing polyfills and yet I always can't wait for these to be redundant, unnecessary, just overhead with modern features offered by the Web.

Thank You!

WebReflection commented 1 month ago

It would be lovely from this group to explain what are the caveats, blockers, security concerns, or issues ...

I don't want to be the sloppy one here so I've read this thread and I'd like to summarize my thoughts as a user.

Symbol.for should be cross realm safe

agreed, but also all known symbols, as I do it already in my code. Any Symbol.wellKnown can be mapped and re-proposed at the end target out of that realm Symbol, this works well and it's the easy part.

Would be nice to have circular references for something like this as well.

My polyfill handles circular references already and I think (???) v8.serialize does that too ... after all, anything structuredClone based gives that feature out of the box, as specified, right?

Plain text could be converted to and from an ArrayBuffer. Are import and export insufficient? What are the use cases?

My use case, which is in production already, is the following:

I have a worker that has a proxy reference to the main (page) global context
I use Atomics.wait to ask for the length of that recursively safe serialized data as stringify text (and I need my polyfill for that) and then a second Atomics.wait right after to read that UTF-16 data from the SharedArrayBuffer with the proper length, truncate it to avoid 0 at the end, and parse back the result ... this both works and it mandatory requires some orchestration, especially around the serialization and deserialization process. On the Worker side of affair the operation was fully synchronous, allowing me to enable use cases such as a = input("what's your name?") in Pyodide, MicroPython, or any other WASM targeting PLs
again, this works, but on benchmarks, this is suboptimal and the bottleneck is mostly around that serialize / deserialize dance which happens out of the blue and surely faster when a basic postMessage(structuredClonableData) is performed to communicate data x-realms ... too bad while Atomics.wait happens, no message listener can ever trigger with results on the Worker

Most WASM targeting PLs are better off the main thread because they block on bootstrap, WASM blocks on bootstrap, all the things block on bootstrap when these are not "that tiny" so that many WASM targeting PLs are chosing the Worker full async way to provide their PL but that easily fails on anything REPL like related. You can see a fully working MicroPython REPL here, make it a Pyodide if you like, and see the main thread is never blocked.

I hope one of the use cases is clear here:

I could just do the double Atomics.await dance to know the length of the v8.serialize(any) buffer
I could just v8.unserialize(buffer) truncated at the right length ... no user-land code needed

Map and WeakMap do not meet requirement?

I believe an awesome achievement would be to provide any valid type supported by the algorithm and nothing else ... there Map would be safe.

It feels like the intent here is to expose https://html.spec.whatwg.org/multipage/structured-data.html#serializable-objects as some kind of format.

Yes and no. Operations here are per-browser and don't need to be universally the same ... meaning there's no need to agree on a standard buffer result to me, any vendor is free to use the convention they like or use already internally.

This could be an enabler for the feature to land sooner than later, as no bike-shedding is needed for the intermediate buffer:

if it's used via SharedArrayBuffer and Atomics, nobody cares
if it's used to store data in IndexedDB "just because we can", that IndexedDB works only on that browser, it should be OK to not have a portable bufer format
if it's used to store data on the back-end to be able to retrieve it later, the most common scenario here is that the user would still use the same browser to store that and eventually retrieve that, but it's fairly trivial from any service to state: "look, you gave us this data from Chrome/ium/Edge browser, Safari can't really handle it the same ... please login again with Chrome/ium/Edge to have access"

I hope these thoughts make sense.

There might be security considerations for platform objects (step 18) that are [Serializable] as they might expose data that isn't usually exposed to user-land.

This one concerns me but I wonder why that's not an issue when using postMessage ... any way to expand on this concern? Thanks!

Hopefully I've grabbed all relevant topics and maybe helped this forward ... it's just a hope, I take no as an answer, I just would like to understand the why as nothing, behind the scene, seems to be missing.

edit

the double Atomics.wait dance will be unnecessary once resizable SharedArrayBuffer lands cross browser but that's still not the point, is the need to conversion into UTF-16 string and back that is not ideal and likely slower than anything else native code could do
I did write the wrong link for supported structuredClone types, now fixed
if this would land anyhow, I can already write a polyfill for it because "my dance" is proven already to work well and such polyfill would be trivial to propose as all moving parts are already there ... once that polyfill won't be needed anymore, anyone will benefit from this extra feature that is so welcomed in the modern Web, thanks!

annevk commented 1 month ago

Operations here are per-browser and don't need to be universally the same

FWIW, that's a non-starter. People will write code that depends on some serialization format sooner rather than later. That's also what makes this hard, it needs to be a universal format that has buy-in from all parties.

What makes platform objects hard: e.g., a Blob or File has a wildly varying data model that often involves a pointer to some disk-backed data structure. Actually, probably similar for SharedArrayBuffer? When the serialization format is opaque as it is with postMessage() and cannot be constructed with arbitrary inputs you run into none of this.

All of this might be solvable, but it's a fairly large undertaking that compared to other issues hasn't gotten an awful lot of traction.

WebReflection commented 1 month ago

@annevk I hear you, but that's why I think that should rather be "the starter", or this won't ever happen (or it'll take forever).

On the other hand, we have already tons of unpredictable API results on the Web:

what canvas.toDataURL() produces as result, with or without base64 around ... that's not even strictly browser related, rather users' hardware choices too
what media produces as audio or video quality / format streams
what even DOMParser. parseFromString could produce, as AFAIK there's no standard implication for the underlying library to perform such parsing (so that subtle differences might be expected?)
what a customElements.define API call could produce or work with ...

All I am saying is that this requirement would benefit a lot of projects that understands caveats around, it's like asking JSON.parse to understand php.serialize(value) (metaphorically speaking), but if that's the no-starter for everyone instaed, how can we start a conversation about a reasonable format able to represent and satisfy cross engines requirements?

After 1+ year working with WASM I've learned everyone is using their own convention around FinalizationRegistry and whatnot to make it happen, and that actually worked to date ... so here I am asking: what is the use case to make it cross-browser when presented use-cases don't need that and at the documentation level we can all say "you can't do this or that" like it's already the case for many other APIs? Thanks.

WebReflection commented 1 month ago

@annevk last from me .... could FlatBuffers be a starting discussion point to provide such API? It's already x-platform/browser and IIRC implemented in most vendors for a reason or another ... I just think that if the "agreement on the format" is what's blocking this, we have previous work around similar topics and FlatBuffers seemed to address most issues (personal experience with a company that implemented those).

ggoodman commented 1 month ago

FWIW, that's a non-starter. People will write code that depends on some serialization format sooner rather than later. That's also what makes this hard, it needs to be a universal format that has buy-in from all parties.

@annevk is there a world in which the desired serialization format can be specified as an argument? That would potentially allow the delivery of immediate value with a vendor-specific format without compromising the longer-term goal of shipping a vendor-neutral format.

annevk commented 1 month ago

No, part of what makes standardization hard is that you have to think through and solve for the edge cases as you will be stuck with it essentially forever.

WebReflection commented 1 month ago

@annevk if that argument though is the reason this issue has been stuck for 6 years, is it necessary and productive to block intermediate pragmatic approaches? ‘cause the result is otherwise no progress, and the issue being stuck “forever” due arguments about not wanting such issue to be stuck forever … I’m seeing a catch-22 / dead loop here and I’m trying to propose “no need to standardize the format, keep it opaque and move forward” but also “how about FlatBuffers to start moving forward standardizing it?”

annevk commented 1 month ago

I think the reason it's been stuck is because in part there's not enough web developer demand for this functionality and mostly because nobody has taken it upon themselves to try to solve it. Having a serializer and deserializer though where the intermediate format is exposed but implementation-defined is just not something that I see succeeding. Implementation-defined behavior needs to be extremely well motivated and this does not meet that bar at all.

WebReflection commented 1 month ago

so you are saying v8.serialize and v8.unserialize are something implemented for the sake of it?

the thing is, Atomics and SharedArrayBuffer after meltdown and spectre got low adoption due tons of friction around these primitives to start with ... I've found a way to circumvent those issues without even needing special headers around so it's time to make these primitive shine again, but of course until perf are subpar, nobody would use these primitives ... for those who do anyway, having these "niece" API working well together is crucial, so another catch-22 to me ... nobody wants to use features nobody needs because they don't know they might need such features. After all, before Atomics or SharedArrayBuffer existed, who was proposing these APIs? I hope the answer is not "some internal" or "some member of the group" because there's no way through that from users' perspective.

Again, I am not trying to be hostile or anything, but not wanting APIs because non existent so that not even people using all the bricks around can say "but there are use cases!" feels off from Web standards users' perspective.

I am trying to propose valid use cases that already exist out there (we collaborate with Universities too and we use all these primitives behind the scene) and trying to unlock by proposing APIs already known, such as FlatBuffers ... what else can a user do, as you mentioned it's my fault my interactions here are not productive? I don't see way arounds or forward and it sadden me. This is open ... since 2018 ... use cases only increased from then, not decreased, we woudln't be here discussing this otherwise.

devsnek commented 1 month ago

I think if you want this api to exist you will need to convince individual whatwg members that it would be a good idea and get them to implement it or implement it for them (note that this can be difficult, they might each say "we'll do it if another browser does it first" for example) and from that effort you can put together a spec and a test suite.

WebReflection commented 1 month ago

@devsnek fair enough ... but again, v8.serialize is there ... I haven't investigated if structuredClone uses it, but if it does (and it should in Chromium?) I just don't know how to convince people an API used already internally is useful externally too. If that's not the case though, I might check Chromium internals and see where's the catch/deal around it. Still, if the argument is "not enough users showed interest around this API" my counter argument would be "by reading this thread/issue, they wouldn't dare/care about asking further" which is the part that sadden me.

WebReflection commented 1 month ago

This conversation is going in parallel at TC39 too and that summarizes my latest thoughts around this matter ... here again for the WHATWG audience:

We have already Compression and Decompression Streams where the user is in charge of picking deflate over gzip (too bad brotli is not an option) so, if there is previous work around this topic, we can let the user decide which "transformer" is desired as long as all of them are compatible with structuredClone types?

Internally, all browsers already have a preferred (ad-hoc) choice for that, so that the API I can see is something like:

const serializer = new Serializer('CBOR' || 'syrup' || 'default');
// default menas ... whatever the current browser/engine can provide itself

const buffer = serializer.serialize(anyStructuredCloneFriendlyData);
const clone = serializer.unserialize(buffer);

It doesn't even need to be synchroonus for Atomics.wait use cases, as it can land async and then be resolved into the SharedArray buffer so anything similar would work to me plus it does answer a few points:

like it is for compressing deflate VS gzip, it's the user responsibility to pick a format that's more convenient, without expecting everyone else to be able to consume that format differently. We are OK with TextEncoder and TextDecoder too, where the former only produces utf-8 but the latter can handle utf-16 too (my use case for my stringified logic). Nobody (to my memory) raised concerns that different buffers might or might not work here or there ... so whhy couldn't this be the key to move this forward?
for one-off use cases anyone can use the hopefully faster default option but when that buffer is meant to be stored and reused, people can instead use the CBOR or syrup or the format the WHATWG agreed to offer as portable one ... is this really that bad as compromise to enable the best per each engine while allowing portable options too?
in my specific case I'd use that native API to create a buffer, notify the worker the minimum needed size to work is X, so it can wait synchronously on a new SharedArrayBuffer where that buffer can be copied into and that's it, no userlandString.fromCharCode(i) is ever needed, no library at all to mimic in a slower and not as capable structuredClone thing is needed, everyone wins?

I hope this opens a chance to at least think about a similar API that can be incrementally landed so that users of the first kind, the default one (or call it internal, temporary or unstable or not-portable) can live happily ever after.

pshaughn commented 1 month ago

I feel like comments in this issue thread are coming from at least two subtly different expectations, and disambiguating them might help. Here's a question that might reveal some implicit assumptions: what does it mean to do this to a Blob or File?

If the idea is cold storage or cross-network communication, does that imply it wants to serialize the contents of the Blob or File (which is no longer really doing the same thing structuredClone does)?

If the idea is to mimic exactly what structuredClone does but into a plain bucket of bytes, how do we know whether a Blob or File reference found in a particular bucket of bytes is referring to something that can be meaningfully reinflated by the current process?

WebReflection commented 1 month ago

My assumption is that:

structuredClone deals with Blob and File already … interestingly enough, my poly cannot deal with those opaque types indeed
structuredClone can pass, via postMessage, those types too … admittedly I haven’t lurked implementation details behind the algorithm, but I’m assuming there’s a “share-nothing” there (or it’b be as vulnerable as SharedArrayBuffer?) so that data can safely be passed as binary
if previous assumptions are correct, I might understand the argument around the opaque entity suddenly somehow revealed, but I’m not sure that’s a real-world issue as File and Blob can be posted via forms anyway (unless I’m missing the bigger issue/worry behind)

it’s true though that in this issue implications for Blob or File are nowhere mentioned or explained, for what I could read, but hopefully it’s clear now where I come from, what I’m interested in (expose somehow the in/out process of that algorithm) so maybe I’ve answered part of your question?

Kaiido commented 1 month ago

so that data can safely be passed as binary

I believe this is where the misunderstanding of the issue comes from. In case of a Blob, no data is copied, the Blob object itself is just a pointer to another location where the data is supposed to be accessible. For instance it can be a pointer to an actual file on the user's drive. When postMessaging to another context a new pointer to the same location is added "magically" on the new Blob instance, but that pointer wasn't serialized, it's all part of the "opaque" implementation.

Maybe the case of an OffscreenCanvas transferred from a DOM <canvas> element would be clearer? These need to keep an internal pointer to the DOM element they were transferred from so that they can be painted there. But how do you serialize a particular DOM element so that its deserialization points to the exact same node?

So for your case structuredClone actually does too much, it seems you want an API in between JSON and structuredClone that would serialize only JS objects and not platform objects.

WebReflection commented 1 month ago

But how do you serialize a particular DOM element so that its deserialization points to the exact same node?

if I read the MDN correctly IndexedDB should be able to do that ... how can it restore an opaque entity if that's gone and the pointer wouldn't have the original reference?

it seems you want an API in between JSON and structuredClone that would serialize only JS objects and not platform objects.

the polyfill I am using (and maintaining) can deal with all structuredClone capable data ... except:

not supported yet: Blob, File, FileList, ImageBitmap, ImageData, and ArrayBuffer, but typed arrays are supported without major issues, but u/int8, u/int16, and u/int32 are the only safely suppored (right now).

What I would need, in an ideal world, is everything but Blob which is the only case I understand problematic.

This "can't Blob" could be a limitation of the new Serializer primitive I've proposed before. If there are other opaque use cases, it'd be OK to not have those in neither as long as everything else doesn't require a conversion to string after crawling data to find and solve recursion and back from string to then re-define the original data ... there's nothing optimal in this process, it's just a workaround but it's a needed one to survive cross realm or Atomics wait right now which is why I am exposing non standard utils too.

whatwg / html