whatwg / html

HTML Standard
https://html.spec.whatwg.org/multipage/
Other
7.96k stars 2.6k forks source link

API for structured serialized data #3517

Open devsnek opened 6 years ago

devsnek commented 6 years ago

Problem

JSON is being left in the dust as we get more and more stuff for JS, and it probably won't be getting any updates.

Goals

Intuitions

Prior art

Other considerations

bmeck commented 6 years ago

Symbol.for should be cross realm safe, idk if we can show exceptions on the differences for these versus other symbols.

macdja38 commented 6 years ago

Would be nice to have circular references for something like this as well.

guest271314 commented 6 years ago

One possible option would be creating dynamic modules which could be exported and imported https://github.com/w3c/FileAPI/issues/97#issuecomment-366764094.

devsnek commented 6 years ago

@guest271314 this format shouldn't perform any evaluation (all stored data should be completely static)

guest271314 commented 6 years ago

The data can be stored as text and only evaluated when necessary.

devsnek commented 6 years ago

this format cannot rely on any form of evaluation; it presents inherent security and performance concerns.

guest271314 commented 6 years ago

Is a goal to encrypt the data? Plain text could be converted to and from an ArrayBuffer. Are import and export insufficient? What are the use cases?

devsnek commented 6 years ago

the goal here is a static format for storing serialized javascript values analogous to JSON but with support for many more types. this api would most likely be similar to https://nodejs.org/api/v8.html#v8_serialization_api but without the class or transferables support.

guest271314 commented 6 years ago

Map and WeakMap do not meet requirement?

jakearchibald commented 6 years ago

It feels like the intent here is to expose https://html.spec.whatwg.org/multipage/structured-data.html#serializable-objects as some kind of format.

devsnek commented 6 years ago

@jakearchibald indeed that is what inspired me to open this

Jamesernator commented 5 years ago

There might be security considerations for platform objects (step 18) that are [Serializable] as they might expose data that isn't usually exposed to user-land.

pierogitus commented 3 years ago

I think this is worth another look. As I just mentioned in the fetch repo, CBOR with tags is well suited to this problem. Its also part of WebAuthn so there must be some CBOR logic already happening in all the major browsers.

WebReflection commented 1 month ago

Symbol.for should be cross realm safe

it's not, I had to implement my own orchestration to make that useful cross realm. It works, but it's not ideal.

I think this is worth another look.

I work with this algorithm daily and I am super sad to see my "yet another beg for it" with reasons and use cases behind being just shut down due this long-standing, stale, issue that is going nowhere šŸ˜„

WebReflection commented 1 month ago

If anyone is still followng this ... what is blocking Chromium based browsers to offer what NodeJS has been offering for a long time?

This API is allowed where security usually matters the most, the back-end, and it's not flagged as deprecated or something + security concerns are not even mentioned: it would just allow users on the Web too to store JS values as buffers and get these back without needing user-land projects to do that (I maintain structured-clone polyfill which offers that toJSON and fromJSON convention but it's weird I need to offer that because the platform cannot).

It would be lovely from this group to explain what are the caveats, blockers, security concerns, or issues, for something that internally seems to be already implemented and used outside the Web ... I love providing polyfills and yet I always can't wait for these to be redundant, unnecessary, just overhead with modern features offered by the Web.

Thank You!

WebReflection commented 1 month ago

It would be lovely from this group to explain what are the caveats, blockers, security concerns, or issues ...

I don't want to be the sloppy one here so I've read this thread and I'd like to summarize my thoughts as a user.

Symbol.for should be cross realm safe

agreed, but also all known symbols, as I do it already in my code. Any Symbol.wellKnown can be mapped and re-proposed at the end target out of that realm Symbol, this works well and it's the easy part.

Would be nice to have circular references for something like this as well.

My polyfill handles circular references already and I think (???) v8.serialize does that too ... after all, anything structuredClone based gives that feature out of the box, as specified, right?

Plain text could be converted to and from an ArrayBuffer. Are import and export insufficient? What are the use cases?

My use case, which is in production already, is the following:

Most WASM targeting PLs are better off the main thread because they block on bootstrap, WASM blocks on bootstrap, all the things block on bootstrap when these are not "that tiny" so that many WASM targeting PLs are chosing the Worker full async way to provide their PL but that easily fails on anything REPL like related. You can see a fully working MicroPython REPL here, make it a Pyodide if you like, and see the main thread is never blocked.

I hope one of the use cases is clear here:

Map and WeakMap do not meet requirement?

I believe an awesome achievement would be to provide any valid type supported by the algorithm and nothing else ... there Map would be safe.

It feels like the intent here is to expose https://html.spec.whatwg.org/multipage/structured-data.html#serializable-objects as some kind of format.

Yes and no. Operations here are per-browser and don't need to be universally the same ... meaning there's no need to agree on a standard buffer result to me, any vendor is free to use the convention they like or use already internally.

This could be an enabler for the feature to land sooner than later, as no bike-shedding is needed for the intermediate buffer:

I hope these thoughts make sense.

There might be security considerations for platform objects (step 18) that are [Serializable] as they might expose data that isn't usually exposed to user-land.

This one concerns me but I wonder why that's not an issue when using postMessage ... any way to expand on this concern? Thanks!

Hopefully I've grabbed all relevant topics and maybe helped this forward ... it's just a hope, I take no as an answer, I just would like to understand the why as nothing, behind the scene, seems to be missing.


edit

annevk commented 1 month ago

Operations here are per-browser and don't need to be universally the same

FWIW, that's a non-starter. People will write code that depends on some serialization format sooner rather than later. That's also what makes this hard, it needs to be a universal format that has buy-in from all parties.

What makes platform objects hard: e.g., a Blob or File has a wildly varying data model that often involves a pointer to some disk-backed data structure. Actually, probably similar for SharedArrayBuffer? When the serialization format is opaque as it is with postMessage() and cannot be constructed with arbitrary inputs you run into none of this.

All of this might be solvable, but it's a fairly large undertaking that compared to other issues hasn't gotten an awful lot of traction.

WebReflection commented 1 month ago

@annevk I hear you, but that's why I think that should rather be "the starter", or this won't ever happen (or it'll take forever).

On the other hand, we have already tons of unpredictable API results on the Web:

All I am saying is that this requirement would benefit a lot of projects that understands caveats around, it's like asking JSON.parse to understand php.serialize(value) (metaphorically speaking), but if that's the no-starter for everyone instaed, how can we start a conversation about a reasonable format able to represent and satisfy cross engines requirements?

After 1+ year working with WASM I've learned everyone is using their own convention around FinalizationRegistry and whatnot to make it happen, and that actually worked to date ... so here I am asking: what is the use case to make it cross-browser when presented use-cases don't need that and at the documentation level we can all say "you can't do this or that" like it's already the case for many other APIs? Thanks.

WebReflection commented 1 month ago

@annevk last from me .... could FlatBuffers be a starting discussion point to provide such API? It's already x-platform/browser and IIRC implemented in most vendors for a reason or another ... I just think that if the "agreement on the format" is what's blocking this, we have previous work around similar topics and FlatBuffers seemed to address most issues (personal experience with a company that implemented those).

ggoodman commented 1 month ago

FWIW, that's a non-starter. People will write code that depends on some serialization format sooner rather than later. That's also what makes this hard, it needs to be a universal format that has buy-in from all parties.

@annevk is there a world in which the desired serialization format can be specified as an argument? That would potentially allow the delivery of immediate value with a vendor-specific format without compromising the longer-term goal of shipping a vendor-neutral format.

annevk commented 1 month ago

No, part of what makes standardization hard is that you have to think through and solve for the edge cases as you will be stuck with it essentially forever.

WebReflection commented 1 month ago

@annevk if that argument though is the reason this issue has been stuck for 6 years, is it necessary and productive to block intermediate pragmatic approaches? ā€˜cause the result is otherwise no progress, and the issue being stuck ā€œforeverā€ due arguments about not wanting such issue to be stuck forever ā€¦ Iā€™m seeing a catch-22 / dead loop here and Iā€™m trying to propose ā€œno need to standardize the format, keep it opaque and move forwardā€ but also ā€œhow about FlatBuffers to start moving forward standardizing it?ā€

annevk commented 1 month ago

I think the reason it's been stuck is because in part there's not enough web developer demand for this functionality and mostly because nobody has taken it upon themselves to try to solve it. Having a serializer and deserializer though where the intermediate format is exposed but implementation-defined is just not something that I see succeeding. Implementation-defined behavior needs to be extremely well motivated and this does not meet that bar at all.

WebReflection commented 1 month ago

so you are saying v8.serialize and v8.unserialize are something implemented for the sake of it?

the thing is, Atomics and SharedArrayBuffer after meltdown and spectre got low adoption due tons of friction around these primitives to start with ... I've found a way to circumvent those issues without even needing special headers around so it's time to make these primitive shine again, but of course until perf are subpar, nobody would use these primitives ... for those who do anyway, having these "niece" API working well together is crucial, so another catch-22 to me ... nobody wants to use features nobody needs because they don't know they might need such features. After all, before Atomics or SharedArrayBuffer existed, who was proposing these APIs? I hope the answer is not "some internal" or "some member of the group" because there's no way through that from users' perspective.

Again, I am not trying to be hostile or anything, but not wanting APIs because non existent so that not even people using all the bricks around can say "but there are use cases!" feels off from Web standards users' perspective.

I am trying to propose valid use cases that already exist out there (we collaborate with Universities too and we use all these primitives behind the scene) and trying to unlock by proposing APIs already known, such as FlatBuffers ... what else can a user do, as you mentioned it's my fault my interactions here are not productive? I don't see way arounds or forward and it sadden me. This is open ... since 2018 ... use cases only increased from then, not decreased, we woudln't be here discussing this otherwise.

devsnek commented 1 month ago

I think if you want this api to exist you will need to convince individual whatwg members that it would be a good idea and get them to implement it or implement it for them (note that this can be difficult, they might each say "we'll do it if another browser does it first" for example) and from that effort you can put together a spec and a test suite.

WebReflection commented 1 month ago

@devsnek fair enough ... but again, v8.serialize is there ... I haven't investigated if structuredClone uses it, but if it does (and it should in Chromium?) I just don't know how to convince people an API used already internally is useful externally too. If that's not the case though, I might check Chromium internals and see where's the catch/deal around it. Still, if the argument is "not enough users showed interest around this API" my counter argument would be "by reading this thread/issue, they wouldn't dare/care about asking further" which is the part that sadden me.

WebReflection commented 1 month ago

This conversation is going in parallel at TC39 too and that summarizes my latest thoughts around this matter ... here again for the WHATWG audience:


We have already Compression and Decompression Streams where the user is in charge of picking deflate over gzip (too bad brotli is not an option) so, if there is previous work around this topic, we can let the user decide which "transformer" is desired as long as all of them are compatible with structuredClone types?

Internally, all browsers already have a preferred (ad-hoc) choice for that, so that the API I can see is something like:

const serializer = new Serializer('CBOR' || 'syrup' || 'default');
// default menas ... whatever the current browser/engine can provide itself

const buffer = serializer.serialize(anyStructuredCloneFriendlyData);
const clone = serializer.unserialize(buffer);

It doesn't even need to be synchroonus for Atomics.wait use cases, as it can land async and then be resolved into the SharedArray buffer so anything similar would work to me plus it does answer a few points:

I hope this opens a chance to at least think about a similar API that can be incrementally landed so that users of the first kind, the default one (or call it internal, temporary or unstable or not-portable) can live happily ever after.

pshaughn commented 1 month ago

I feel like comments in this issue thread are coming from at least two subtly different expectations, and disambiguating them might help. Here's a question that might reveal some implicit assumptions: what does it mean to do this to a Blob or File?

If the idea is cold storage or cross-network communication, does that imply it wants to serialize the contents of the Blob or File (which is no longer really doing the same thing structuredClone does)?

If the idea is to mimic exactly what structuredClone does but into a plain bucket of bytes, how do we know whether a Blob or File reference found in a particular bucket of bytes is referring to something that can be meaningfully reinflated by the current process?

WebReflection commented 1 month ago

My assumption is that:

itā€™s true though that in this issue implications for Blob or File are nowhere mentioned or explained, for what I could read, but hopefully itā€™s clear now where I come from, what Iā€™m interested in (expose somehow the in/out process of that algorithm) so maybe Iā€™ve answered part of your question?

Kaiido commented 1 month ago

so that data can safely be passed as binary

I believe this is where the misunderstanding of the issue comes from. In case of a Blob, no data is copied, the Blob object itself is just a pointer to another location where the data is supposed to be accessible. For instance it can be a pointer to an actual file on the user's drive. When postMessaging to another context a new pointer to the same location is added "magically" on the new Blob instance, but that pointer wasn't serialized, it's all part of the "opaque" implementation.

Maybe the case of an OffscreenCanvas transferred from a DOM <canvas> element would be clearer? These need to keep an internal pointer to the DOM element they were transferred from so that they can be painted there. But how do you serialize a particular DOM element so that its deserialization points to the exact same node?

So for your case structuredClone actually does too much, it seems you want an API in between JSON and structuredClone that would serialize only JS objects and not platform objects.

WebReflection commented 1 month ago

But how do you serialize a particular DOM element so that its deserialization points to the exact same node?

if I read the MDN correctly IndexedDB should be able to do that ... how can it restore an opaque entity if that's gone and the pointer wouldn't have the original reference?

it seems you want an API in between JSON and structuredClone that would serialize only JS objects and not platform objects.

the polyfill I am using (and maintaining) can deal with all structuredClone capable data ... except:

not supported yet: Blob, File, FileList, ImageBitmap, ImageData, and ArrayBuffer, but typed arrays are supported without major issues, but u/int8, u/int16, and u/int32 are the only safely suppored (right now).

What I would need, in an ideal world, is everything but Blob which is the only case I understand problematic.

This "can't Blob" could be a limitation of the new Serializer primitive I've proposed before. If there are other opaque use cases, it'd be OK to not have those in neither as long as everything else doesn't require a conversion to string after crawling data to find and solve recursion and back from string to then re-define the original data ... there's nothing optimal in this process, it's just a workaround but it's a needed one to survive cross realm or Atomics wait right now which is why I am exposing non standard utils too.