whatwg / html

HTML Standard
https://html.spec.whatwg.org/multipage/
Other
8.19k stars 2.71k forks source link

Add `getInnerHTML()` that optionally serializes declarative Shadow DOM #8867

Closed mfreed7 closed 8 months ago

mfreed7 commented 1 year ago

This issue is split out from the discussion that starts roughly here https://github.com/whatwg/html/pull/5465#discussion_r579593543. The original Declarative Shadow DOM spec PR included this functionality, which allowed a DOM sub-tree to be serialized in a way that doesn't throw away all of the contained shadow roots. To facilitate landing that PR, this functionality was removed. I'd like to use this issue to resume that discussion.

The problem statement is that DOM content containing developer shadow roots (i.e. not UA shadow roots) cannot be serialized. Doing so today requires a manual tree walk in JS land, manually grabbing shadow roots and serializing them into <template shadowrootmode> elements.

The ideal solution would be for const html = element.innerHTML; to just start serializing shadow roots like it already serializes the rest of the DOM into HTML. However, this is almost surely not web compatible. One option is to provide an attribute on Document and/or ShadowRoot that opts-in to this serialization behavior. However, given the recent push to define new parser entrypoints that are functions instead of attributes, it seems to make more sense to also define a new functional getter that optionally (or always?) serializes shadow roots:

const html = element.getInnerHTML({includeShadowRoots: true});
// html contains `<template shadowrootmode=...>` for any contained "open" shadow roots.

To preserve the encapsulation of "closed" shadow roots, another option would be needed in the case that "closed" shadow roots want to be serialized:

const html = element.getInnerHTML({
  includeShadowRoots: true,
  closedRoots: [shadowRoot1, shadowRoot2, ...]
});

Thoughts? Concerns?

hsivonen commented 1 year ago

innerHTML is unfortunate in that it changes behavior depending on the HTML vs. XML bit of the owning document. Having this API that says "HTML" always serialize as HTML (and the "set" counterpart parse as HTML), then having XML counterparts, and having these on all nodes regardless of the HTML vs. XML bit of the owning document would probably be a less problematic design than the old innerHTML behavior that varies with the HTML vs. XML bit of the owner doc, since a given method would always do the same thing regardless of owner-document-level state.

hsivonen commented 1 year ago

Oops, the previous comment went to the wrong issue. For this issue, being able to serialize what can be parsed makes sense for symmetry.

mfreed7 commented 1 year ago

Oops, the previous comment went to the wrong issue. For this issue, being able to serialize what can be parsed makes sense for symmetry.

Thanks for the (positive) feedback! Any thoughts on the proposed shape of the API?

mfreed7 commented 1 year ago

Now that the DSD patches have all landed, can we come back to this one? Any objections to adding:

const html = element.getInnerHTML({
  includeShadowRoots: true,
  closedRoots: [shadowRoot1, shadowRoot2, ...]
});

?

annevk commented 1 year ago

So the new parser entry points are called setHTMLUnsafe() and setHTML(). For symmetry I'd expect this to be called getHTML(), assuming we don't need the safe/unsafe distinction.

As for the shape of the API, I don't like that it makes a distinction between closed and open shadow roots. I thought that many years ago we reached agreement that what you need for closed shadow roots we'd also use for open.

mfreed7 commented 1 year ago

So the new parser entry points are called setHTMLUnsafe() and setHTML(). For symmetry I'd expect this to be called getHTML(), assuming we don't need the safe/unsafe distinction.

Yep, makes sense to me. I don't believe (?) there's any need for an "unsafe" version, at least that I can see.

As for the shape of the API, I don't like that it makes a distinction between closed and open shadow roots. I thought that many years ago we reached agreement that what you need for closed shadow roots we'd also use for open.

The problem is that providing a list of the contained roots is fairly cumbersome. You have to walk the entire tree, before getHTML then walks it again. I can see this performance penalty alone, let alone the ergonomics, being a developer complaint. How about:

const html = element.getHTML({
  includeShadowRoots: true,
});

which just serializes them all? That certainly treats them equally.

mfreed7 commented 1 year ago

I thought that many years ago we reached agreement that what you need for closed shadow roots we'd also use for open.

Incidentally, can you link me to that agreement? I see it cited often, but I don't quite understand the rationale. I'd love to read the back-story. (I was not part of the "we" in that sentence.)

hsivonen commented 1 year ago

Oops, the previous comment went to the wrong issue. For this issue, being able to serialize what can be parsed makes sense for symmetry.

Thanks for the (positive) feedback! Any thoughts on the proposed shape of the API?

The shape looks OK. I think it would be worthwhile to write down that if a way to obtain the XML serialization was added, it should be a second method and not a flag on the option bag (for easy feature detectability and for not having something as big as selection of the serialization syntax as an option bag item). We don't need to add an XML counterpart right now absent evidence of Web developer demand, but I think we should think through and write down how we'd go about introducing an XML counterpart if demand shows up.

hsivonen commented 1 year ago

The shape looks OK.

I should clarify, that I meant the shape on the level of 1) new (relative to innerHTML) HTML serialization getter, 2) it being about the HTML serialization (potential XML serialization as a distinct method), and 3) tunables within the HTML serialization as an option bag.

I didn't review the specifics of the tunable options.

mfreed7 commented 1 year ago

The shape looks OK. I think it would be worthwhile to write down that if a way to obtain the XML serialization was added, it should be a second method and not a flag on the option bag (for easy feature detectability and for not having something as big as selection of the serialization syntax as an option bag item). We don't need to add an XML counterpart right now absent evidence of Web developer demand, but I think we should think through and write down how we'd go about introducing an XML counterpart if demand shows up.

Thanks! I agree with your points about XML. Where do you think we should write this down, just as a Note in the spec?

Coming back to the shape, what about this (small) tweak?

const html = element.getHTML({
  shadowRoots: [shadowRoot1, shadowRoot2, ...]
  includeAllShadowRoots: true,
});

The idea would be:

el.getHTML({shadowRoots: [root1, root2, ...]}) // Only serializes into `root1`, `root2`, ... and skips other roots. Roots can be open or closed here.
el.getHTML({includeAllShadowRoots:true}) // Serializes all open shadow roots, and no closed ones (because they're closed).
el.getHTML({includeAllShadowRoots:true, shadowRoots: [root1, ...]}) // Serializes all open shadow roots plus anything in the shadowRoots array.

I'm just trying to avoid the problem that this API should be usable without having to do full recursive DOM traversal first. I'm trying to incorporate the "open and closed roots should be the same" feedback, admittedly without completely understanding it.

mfreed7 commented 1 year ago

So we discussed this at the https://github.com/whatwg/html/issues/9937 meeting, and came to rough consensus on a model like this:

  1. add an option to attachShadow() and <template shadowrootmode> to allow the shadow root to say "I opt in to serialization of this shadow root without providing a reference to me".
  2. provide a boolean option to getHTML() that says "please serialize opted-in shadow roots". The default for this option is false.
  3. provide an option to getHTML() containing a sequence of shadowRoots that should be serialized, independent of whether they opted in via the flag above.

Assuming the above is correct, I propose this set of names:

element.attachShadow({serializable: true});  // (1)
// Or <template shadowrootmode serializable> // (1)
const html = element.getHTML({
  includeSerializable: true,   // (2)
  shadowRoots: [shadowRoot1, shadowRoot2, ...]  // (3)
});

Comments? Suggestions for better names?

annevk commented 1 year ago

I think includeShadowRoots (which defaults to true if you supply shadowRoots and throws if you set it to false while also supplying shadowRoots) makes more sense than includeSerializable.

We probably want to have ShadowRoot.prototype.serializable as well for introspection (similar to the clonable suggestion).

mfreed7 commented 1 year ago

I think includeShadowRoots (which defaults to true if you supply shadowRoots and throws if you set it to false while also supplying shadowRoots) makes more sense than includeSerializable.

Yep - good suggestions, both.

We probably want to have ShadowRoot.prototype.serializable as well for introspection (similar to the clonable suggestion).

I agree.

mfreed7 commented 11 months ago

Quick update to include the last set of comments:

element.attachShadow({serializable: true});  // (1)
// Or <template shadowrootmode serializable> // (1)
const html = element.getHTML({
  includeShadowRoots: true,   // (2)
  shadowRoots: [shadowRoot1, shadowRoot2, ...]  // (3)
});

// Examples
element.getHTML(); // does not serialize shadow roots
element.getHTML({includeShadowRoots: true}); // returns roots whose root.serializable is true
element.getHTML({shadowRoots: roots}); // also returns roots whose root.serializable is true
element.getHTML({includeShadowRoots: false, shadowRoots: roots}); // throws an exception
element.shadowRoot.serializable; // Returns boolean

If no objections, I'll put up a spec PR to add this.

mfreed7 commented 10 months ago

I'd like to clarify two things that came up as part of the Chromium review of the changes listed above.


First is that in this example:

element.getHTML({includeShadowRoots: true}); // returns roots whose root.serializable is true

I assumed, but would like to confirm, that closed roots, even with serializable set to true, will not be returned by the code above. The only way to serialize a closed shadow root is to specify it explicitly using shadowRoots:[root].


The second question is about the includeShadowRoots argument. I had assumed it was some kind of tri-state:

element.getHTML();                           // A) does not serialize *any* shadow roots
element.getHTML({includeShadowRoots: true}); // B) serializes all roots marked as serializable
element.getHTML({shadowRoots: roots});       // C) serializes the specific `roots` provided
element.getHTML({includeShadowRoots: false, shadowRoots: roots}); // D) throws an exception

The question is related to example C). Does providing shadowRoots effectively set includeShadowRoots to true or false? As I wrote the code, this behavior felt like it would be weird and surprising:

  1. the default for includeShadowRoots is clearly false, from A).
  2. providing shadowRoots (as in C) silently sets includeShadowRoots to true which will result in additional (non-specified) roots being serialized.

From a developer point of view, the least surprising behavior would seem to be that the default for includeShadowRoots is false, and providing shadowRoots doesn't change that. So in C), only the provided roots are serialized, and not others that are marked as serializable. To get those, use element.getHTML({includeShadowRoots: true, shadowRoots: roots}). The only weird inconsistency would then be D) which throws an exception if includeShadowRoots is explicitly set to false while also providing shadowRoots. That makes includeShadowRoots kind of tri-state. That's fine with me, but I wanted to confirm.

annevk commented 10 months ago

I see, your assumption is that when you specify particular shadow roots (with shadowRoots), you're not necessarily also interested in serializable shadow roots (includeShadowRoots)? I feel ambivalent to that reframing, but I think we should rename includeShadowRoots to serializableShadowRoots if we go that way.

mfreed7 commented 10 months ago

I see, your assumption is that when you specify particular shadow roots (with shadowRoots), you're not necessarily also interested in serializable shadow roots (includeShadowRoots)? I feel ambivalent to that reframing, but I think we should rename includeShadowRoots to serializableShadowRoots if we go that way.

I'm kind of ambivalent also. It just feels a little odd that getHTML() returns none, and getHTML({shadowRoots: list}) returns list plus others. Sounds like you're ok with my proposed behavior, as long as there's a rename? @smaug---- any thoughts?

mfreed7 commented 9 months ago

I'm kind of ambivalent also. It just feels a little odd that getHTML() returns none, and getHTML({shadowRoots: list}) returns list plus others. Sounds like you're ok with my proposed behavior, as long as there's a rename? @smaug---- any thoughts?

Alright, I've put up HTML and DOM spec PRs for this behavior. It's on the larger side of the spec PRs I've written, so please go easy on me. Suggestions for improvement are welcome.

One thing: having written it, with the rename to serializableShadowRoots, it now feels weird to throw when serializableShadowRoots is false and shadowRoots is non-empty. It feels like perhaps this should be the behavior instead:

element.getHTML();                                // A) does not serialize *any* shadow roots
element.getHTML({serializableShadowRoots: true}); // B) serializes all roots marked as serializable
element.getHTML({shadowRoots: roots});            // C) serializes the specific `roots` provided
element.getHTML({serializableShadowRoots: false, shadowRoots: roots}); // D) same as (C)

I.e. without (D), it'll be quite awkward to have to remove the serializableShadowRoots member in order to just serialize specific roots. I also don't see the danger or need for an exception.

annevk commented 9 months ago

Yeah, that makes sense.

annevk commented 9 months ago

Thinking about this method more I have a concern. On the parsing side we get to choose what parser to use and therefore can enforce HTML. And therefore setHTML[Unsafe]() are good names. However, I doubt that is the plan on the serialization side?

This will need some thought. If we just want to reuse the existing serialization algorithm maybe serialize() is a reasonable name. Enforcing HTML serialization seems hard. (What if someone inserts a CDATASection, ProcessingInstruction, or non-HTML element?)

@zcorpan might have thoughts about this. Also curious to hear from @hsivonen and @smaug----.

mfreed7 commented 9 months ago

I.e. without (D), it'll be quite awkward to have to remove the serializableShadowRoots member in order to just serialize specific roots. I also don't see the danger or need for an exception. Yeah, that makes sense.

Great, thanks, I've made that change to the PR.

Thinking about this method more I have a concern. On the parsing side we get to choose what parser to use and therefore can enforce HTML. And therefore setHTML[Unsafe]() are good names. However, I doubt that is the plan on the serialization side?

Well, in the PR I explicitly run the HTML fragment serialization algorithm, which should correctly serialize a contained foreign element, right? I suppose we could rename the method to serialize() and instead call DOMParsing's fragment serializing algorithm. I kind of liked the parallelism to setHTML[Unsafe]() here. But I don't have terribly strong opinions.

smaug---- commented 9 months ago

I'd assume this work similarly to innerHTML getter in HTML documents, if there are things like ProcessingInstructions

annevk commented 9 months ago

innerHTML's getter has an "XML code path". Given that we explicitly opted to not give setHTML() such a code path, it would be strange for getHTML() to have one. And I think it would also be strange if it worked on non-HTML elements but didn't serialize things such as namespaces.

domenic commented 9 months ago

For what it's worth it seems totally fine to me, and in fact desirable, to have a method that specifically runs the HTML fragment serialization algorithm. There are plenty of non-serializable things possible in the DOM, no matter what serializer you're using. I don't think we should be worried about what happens if authors use old discouraged things like namespaces or processing instructions, in combination with this new method.

zcorpan commented 9 months ago

@annevk https://w3c.github.io/DOM-Parsing/#dom-innerhtml-innerhtml is exists on all elements, but switches on the HTMLness of the document, so any strangeness is not new. https://software.hixie.ch/utilities/js/live-dom-viewer/saved/12394

I agree that getHTML() shouldn't have an XML code path, like @hsivonen said. Instead we can add getXML() if fragment serialization as XML is needed.

mfreed7 commented 9 months ago

innerHTML's getter has an "XML code path". Given that we explicitly opted to not give setHTML() such a code path, it would be strange for getHTML() to have one. And I think it would also be strange if it worked on non-HTML elements but didn't serialize things such as namespaces.

Right - as I mentioned, the PR points getHTML() to the HTML fragment serialization algorithm, so it'll be parallel to both setHTML() and calling innerHTML from within an HTML document.

Given the other comments, it sounds like maybe the PR is ok as-is then, modulo other issues?

mfreed7 commented 8 months ago

Thanks everyone!