Closed dominiccooney closed 3 years ago
But stepping back, the real API I want is to be able to create a tree of DOM and easily get references to particular nodes in cheaply
this is the same I've discussed already, but here they are proposing something better: while you are parsing, you can intercept and pollute the stream at runtime.
It's more powerful, but I understand the async painful point you made.
Here https://github.com/whatwg/html/issues/2993#issuecomment-326547102
I've proposed a way to retrieve a UID from either attributes or content, which is, I believe, similar to the API you want to retrieve particular templates.
It does seem this issue has kind of exploded. Everyone has interpreted the OP as if it's interested in helping with their particular problem. Roughly:
This is all interesting discussion, and I don't want to discourage it. But I do want to highlight that it's unlikely we'll end up solving all of these problems with a single API. The OP was specifically spun off of #2827, which is more about progressive rendering (thus, async, streams, and no template interpolation). I just want people to be aware that we may solve that separately, and and leave templating and an integer-based instruction set for tree construction to other APIs.
Of course by now this thread is mostly about templating engines and their needs, so maybe it should be repurposed. We'll see where the discussion takes us :). Personally I am with @justinfagnani that template parts is the most promising direction so far for that particular problem.
I think @wycats has a reasonable point that if we want to provide a low-level parser API that can also be used in workers, the result of that needs to be some kind of changeset that can be applied to trees. That's also roughly how off-the-main-thread parsers in browsers need to be modeled today. Finding the primitives upon which the whole thing is built seems like the best idea to me given past frustration with higher-level alternatives that don't quite address the needs.
@annevk sure, if the point of this feature was workers, then a change list might make sense. But I am personally more interested in allowing the browser to do work in an asynchronous and streaming fashion, instead of focusing on workers. That async/streaming fashion could be potentially off-thread if that provides some benefit, but purely as an implementation detail.
For example, browsers are likely to start just using the main thread, just asynchronously to prevent jank. Then later they may investigate using native threads, which they can do much more effectively than JS. JS can only put binary data in shared memory, whereas browsers could put actual C++ Node objects. (Not that those are thread-safe today, but it's a possible future implementation strategy.)
Stated another way, I don't think it's correct to identify a change list as an underlying primitive. It's a new high-level API aimed at a very specific, new use case; it's not related to the primitives that currently underlie streaming HTML parsing.
What kinds of errors? @jakearchibald, @inikulin, et. al
I mean where the HTML spec says to "report an error." I strongly agree with @domenic et al that we should not make a different/strict syntax for HTML. Whether/where errors are reported is a relatively unimportant detail, let's worry about it later.
@jakearchibald, I think we would need to build a controller which balances not only parsing and tree construction work, but append, style, layout and paint work. In the worst case lets say the controller is terrible and behaves like innerHTML; is this feature still worth it? What's the point where it becomes compelling?
One benefit of keeping this API pretty high-level without hooks into substitution, states, etc. is that in future when there's something like "async append" the existing uses of the API could become async appends, just with the UA doing the commit ASAP.
Crudely, DOM Ranges can point to an element and an offset. The DOM doesn't have a way to point to tag names, attribute names or values. Additionally, tag names can't be changed later. So if hole finding is a thing, it needs to either happen as part of tokenization or have some restrictions placed on it to make sense.
The HTML templating system I worked on in 2007 got a lot of benefit out of requiring that a substitution did not flip the HTML parser into different modes. That system also cared about semantics, for example, that a given attribute value contains script. The HTML parser doesn't but our (Blink's) built-in XSS protection knows which attribute names are event handlers and so on. If the there was a tokenization-level API supporting (or even requiring) those sort of restrictions could be useful.
I think if you want to construct elements from a worker, HTML's text serialization is a hard format to write correctly and read correctly. This parsing API could help make the reading side of things better but doesn't do anything for writers. HTML's text serialization is primarily useful because there's a good chance you already have your data in this format.
@wycats ' proposal for DOM tree construction command buffers is reminiscent of the HTML parser's operations, but you have to squint (appendHTML is a bit like "inject document.written bytes".) I'm not sure how usable HTML's thing would be as an API. You have to do attributes before content, for example. There's also some things missing: HTML knows a context it is parsing into before it starts (like, "fragment parsing into a table", etc.) and has some wild operations (like "reconstruct the set of open formatting elements") and so on.
The thread here seems to assume a lot of context that is not stated explicitly. What should I read to learn about the use cases?
@dvoytenko , could you tell us more about your specific use case for the document.open/write/close code in #2827?
This thread has 3-4 different proposals yet not clear goal or use cases for this API. What problem(s) are we trying to solve here?
@rniwa Quite a few sites, including GitHub, hijack link clicks and perform the navigation themselves to avoid reparsing/executing the same JavaScript on the next page. However, this can become a lot slower on long github pages, as you lose the benefit of streaming with innerHTML
. See https://jakearchibald.com/2016/fun-hacks-faster-content/.
Thinking of the extensible web manifesto, a streaming parser would expose this exiting browser behaviour to JavaScript, without having to resort to iframe hacks.
If it could also expose some parser state mid-parse, it would help (although not completely solve) some template cases. But helping somewhat unrelated cases feels like a win in terms of the extensible web.
@domenic
Personally I am with @justinfagnani that template parts is the most promising direction so far for that particular problem.
I agree that there are better targeted solutions for that particular case, but isn't this 'appcaching' it? Offering low-level parser details feels like it would help more use-cases.
@dominiccooney
I think we would need to build a controller which balances not only parsing and tree construction work, but append, style, layout and paint work
I think that's compatible. If this thing supports streams, it can support back-pressure. Also, the "get parser state" method could return a promise that waits for the queued HTML to flush.
Additionally, tag names can't be changed later. So if hole finding is a thing, it needs to either happen as part of tokenization or have some restrictions placed on it to make sense.
You could achieve this with some parser info:
whatever`
<${foo} src="hi">
`;
After flushing <
the parser should know it's in the tag-open state.
@hsivonen
https://jakearchibald.com/2016/fun-hacks-faster-content/ might help.
FWIW I see this pattern as a footgun
whatever`
<${foo} src="hi">
`;
Attributes have different meaning accordingly with the kind of node you want to put there. It might play well with strings on the server side, but on DOM side I don't see that as a must have, quite the opposite.
What I mean is that the following, which at this point could also be allowed too, doesn't look good at all.
whatever`
<${foo} ${attr}=${value}></${foo}>
`;
What if foo
is a br
or any other void element or vice-versa (you wrote <${foo} />
but it's not void) ? IMO this goes a bit too far from what I (personally) ever needed, from a template parser/engine.
@WebReflection I agree that this kind of interpolation wouldn't make sense for HyperHTML, but it would make it easy to detect this situation and throw a meaningful error.
I'm much more interested in exposing existing browser internals to create new possibilities and make existing things easier, than creating a new inflexible API that solves one use-case.
it wasn't about hyperHTML, it was more about common sense.
What does the following produce?
whatever`
<${bar} ${attr}=${value}>
<${foo} ${attr}=${value}></${foo}>
`;
It's absolutely unpredictable and it's also XSS prone, IMO, but surely I don't want to block anyone exploring anything, I'm just thinking loudly about that pattern.
Any system that's piping text directly into a parser needs to be very careful with user input. Using parser state, the developer can pick the appropriate escaping method, or throw if it's a state they don't want/wish to support.
@jakearchibald from what I can tell you're still thinking of the parser states as a "real" thing, and as a low-level primitive. But as @inikulin pointed out, they're not really primitives, they're just implementation details and spec devices we use to navigate through the algorithm. I also don't think we should expose them.
@dominiccooney We use inactive document's open/write to stream shadow DOM. We display relatively big documents in shadow roots and streaming helps a lot with perceived latency. The way it works is this:
a. We create a buffer (inactive) document document.implementation.createHTMLDocument
and call open
on it.
b. For each new chunk arriving via XHR, we call document.write
on the buffer element
c. We do some preprocessing on bufferDocument.head
d. For each new bufferDocument.body.firstChild
we move it to the real attached shadow root.
These steps achieve a rather good perceived streaming performance. Once a node is moved, the subsequent streaming happens in the shadow root. It works much like https://jakearchibald.com/2016/fun-hacks-faster-content/ suggests, but we don't want to use iframes.
@dvoytenko Thanks for those details. Roughly how much content are we talking about here?
@jakearchibald Your "fun hacks faster content" is awesome—this is the kind of scenario I have in mind. (Actually I read "fun hacks" with interest in 2016 and it has been irritating me ever since. It bothers me how hard it is to do this; that you have to break up chunks yourself; that Blink runs script elements adopted from the iframe when the spec says don't do that; etc.)
@rniwa, @justinfagnani I think the template filling is meeting a different set use cases. The abstraction and staging is different: Template filling seems more focused on DOM, whereas this is about splicing characters without breaking tokenization; template filling seems more focused on having an artifact which is instantiated, maybe multiple times, whereas this is about streaming taking bytes from somewhere and rehydrating them exactly once. I could even envisage these things being used together, for example, you stream a component definition and use the API proposed here to parse it; that includes a template you fill when an instance of the component is newed up.
@inikulin, @rniwa, is that satisfactory? Do you have any follow up questions about use cases?
@jakearchibald wrote:
I think that's compatible. If this thing supports streams, it can support back-pressure. Also, the "get parser state" method could return a promise that waits for the queued HTML to flush.
I agree! I'm just worried that there's a path dependence here. How naive could an implementation of this API be and still be useful?
@WebReflection, below is an extended meditation on how we ameliorate the XSS problem you mention in your example. This doesn't solve the problem of self closing tags causing the structure to be unpredictable, though. I don't think that is a terrible problem. A conservative author could just always write immediate closing tags for any spliced tagnames; I believe it is always safe to write closing tags, even for self-closing tags. (@domenic?)
I agree that exposing HTML parser states is a bad idea because it will limit parser evolution and probably just annoy authors anyway ("oh, I handled the comment state but forgot to handle the comment end dash state".)
What if we exposed a smaller set of states?
For example, we could map the tag open state and tag name state into one "abstract" state, say, tag name. After feeding the slice to the parser we require the parser to be in the HTML spec tag name state; if not, then that might be a hard error.
We could start conservatively by allowing splicing in a small set of states—tag names, attribute names, attribute values, and text—and impose restrictions, for example, maybe parsing the splice in html`<div>${thing}</div>`
is only allowed in the HTML spec data, character reference, named character reference, numeric character reference, hexadecimal character reference, ... etc. states and must end in the data state. This would allow thing to be "fun & games"
but not "fun <script>alert('and games')"
(we would abort when we hit the < and try to transition into tag open state) or "fun &"
(we would abort when we finished parsing the splice and find ourselves in the character reference state and not the data state.)
I expect the implementation would carry around a bitset of allowed states which it tests on transitions. There's a bunch of states but many could be collapsed because we never allow splicing near the DOCTYPE states and so on. This could slow main document parsing, but making the parser yield more often probably means we're on a slow path anyway. I think it's probably fine.
We also have the option of implementing different syntax for splices so you can splice a string and not worry about whether it's being spliced into text or an attribute, and whether that attribute was single quoted, double quoted, or unquoted.
But say in future we want to allow arbitrary markup there. We can do this with a set of functions authors use to communicate how they want the splice handled; these return an object that the outer html function interprets, for example html`<div>${hi_parser_trust_me`${thing}`}</div>`
where _hi_parser_trustme is another platform function which returns an object that the outer html function knows to interpret with relaxed parsing rules. Of course we'd need to take care with the design and design a useful set of those functions with intuitive names and make shorthands like html`<div>${hi_parser_trust_me(thing)}</div>`
also work.
I still don't understand what the use cases of this feature are. If Github is using a single call to innerHTML
, and it's slow, the correct fix is to break that up into chunks and then process each chunk separately. I have a hard time believing that Github's scenario warrants a completely new streaming HTML parser.
Please give us a lit of concrete use cases for which this feature is required. This is a massive feature which requires a ton of engineering effort to implement in browser engines, and I'd like to have clear-cut important use cases that can't be satisfied without it; not something websites can very easily workaround.
If Github is using a single call to innerHTML, and it's slow, the correct fix is to break that up into chunks and then process each chunk separately.
It is hard to break up a chunk of HTML without parsing it.
I have a hard time believing that Github's scenario warrants a completely new streaming HTML parser.
There's already a streaming HTML parser—the main document parser. It is just hardwired to work in certain settings and not others.
I'd like to have clear-cut important use cases that can't be satisfied without it; not something websites can very easily workaround.
I think it is helpful to have use cases, so yeah, let's sharpen them up. What is "clear-cut" and "important" might be a bit subjective; what's your standard?
I want to push back on this idea that workarounds are OK. If authors end up having to rely on lots of workarounds, the accumulated burden can be significant. I think @jakearchibald's post about streaming load performance is worth studying: How long does it take authors to discover this iframe, document.write hack? How resource intensive is spinning up an iframe? How bad is it to enshrine that Safari/Chrome/Edge script running bug?
If Github is using a single call to innerHTML, and it's slow, the correct fix is to break that up into chunks and then process each chunk separately.
It is hard to break up a chunk of HTML without parsing it.
As far as I can tell, Github is generating HTML for the entire comment section & sending it over XHR. Unless their backend somehow parses HTML each time it has to modify an issue page, they should have a mechanism to generate HTML per comment. At that point, they could be splitting up markup via comments and batch them up and send it via XHR.
Also, browser engines could implement an optimization to speculatively tokenize & parse DOM nodes when a content with text/html
MIME type is fetched.
I have a hard time believing that Github's scenario warrants a completely new streaming HTML parser.
There's already a streaming HTML parser—the main document parser. It is just hardwired to work in certain settings and not others.
I'm saying that exposing and maintaining that as a JS API without introducing a major security vulnerability would require a significant engineering effort.
I want to push back on this idea that workarounds are OK. If authors end up having to rely on lots of workarounds, the accumulated burden can be significant.
It all depends on the cost. It would be great if we can make DOM thread safe and expose to Web workers without perf & security issues but the engineering effort required to do that is akin to rewriting WebKit from scratch so we wouldn't lightly propose such a thing. That's a bit of an extreme case but there's always a cost & benefit trade off in every feature we're adding to the Web platform, and there's also an opportunity cost. The time I spend implementing this feature is a time I could spend fixing other bugs and implementing other features in WebKit.
Since this feature has a significant implementation cost (at least to WebKit), the justification for it needs to be correspondingly strong.
to speculatively tokenize & parse DOM nodes when a content with text/html MIME type is fetched
That doesn't help with actually adding these elements to the DOM in a streaming fashion too.
Let's, as the first step, minimize the required API for the original @jakearchibald's use case to something like:
document.getElementById('div').appendChildStream(respStream);
What are the new security implications of this that are not already present for innerHTML
? What is the added implementation cost that is not covered by main document parser and/or iframe hack?
I understand @rniwa argument, which is why I hoped for a very simple scenario that I believe would already solve 99% of use cases: attribute value, content chunk.
I also agree with @RReverser this should start as small as possible or it won't ever land.
const tag = document.createStreamTag((node, value) => {
if (node.nodeType === Node.ATTRIBUTE_NODE) {
// we have an attribute. We can reach its owner
// we can deal with its name and the value as content
} else {
// we have a Node.ELEMENT_NODE
// it's still open with N childNodes
// we can append a new node, discard the value, do whatever
}
});
// parse & stream
tag`<div class=${'name'} onclick=${fn}>
${'some content'}
</div>`;
The tag
stream will always return a DocumentFragment
(in this case containing a div) and above example will invoke 3 times the callback:
class
attribute node, and value "name"
onclick
attribute node, and fn
as it is as listener (no .toString()
implicit anything),div
node itself, with childNodes.length
equal to 1
, which is the text before the chunk.The value of the third invocation will be the text "some content"
, but it could also be anything else, including a Promise
object.
If this was possible through the platform, it'd be quite revolutionary.
All primitives to enrich the logic on top would be there. The only missing bit to cover all my use cases already implemented and available to check/view/see if you want, is the fact HTML is case-insensitive so that an attribute like onCustomEvent
would result into a DOM attribute with name oncustomevent
instead.
Latter one is not a huge limit but maybe somebody has an idea on how that could be solved.
@domenic
from what I can tell you're still thinking of the parser states as a "real" thing, and as a low-level primitive. But as @inikulin pointed out, they're not really primitives, they're just implementation details and spec devices we use to navigate through the algorithm.
If browsers don't implement it & don't intend to, what's the point of having it in a spec? I realise that browsers may use different terms internally, but unless they're implementing something wildly different to the spec, and intend to continue doing so, those states could be mapped to something standard.
I also don't think we should expose them.
Why?
@dominiccooney
What if we exposed a smaller set of states?
Agreed. We could even start by exposing nothing, but design the parser in a way that allows this in future.
@jakearchibald
but unless they're implementing something wildly different to the spec
Sometimes they do, e.g. Blink don't use states from spec that are dedicated to entity parsing and uses custom state machine for that: https://chromium.googlesource.com/chromium/blink/+/master/Source/core/html/parser/HTMLEntityParser.cpp
Right, generally specifications define some kind of process that brings you from A to B. The details of that process are not important and implementations are encouraged to compete in that area. The moment you want to expose more details of that process to the outside world it starts mattering a whole lot more what those details are and how they function, as the moment you expose them you prevent all kinds of optimizations and code refactoring that could otherwise take place.
Fair enough. It'd be good to expose these states at some point, but it doesn't need to be v1.
@WebReflection I agree having events for separate pieces of the HTML as it goes through would be quite nice, but I'd say it's already a little bit more advanced than the "as small as possible", more like version 2. For version 1, it would be nice at least to be able to insert streaming content into the DOM even without hooks for separate parts of it.
events are just attributes ... what I've written intercepts/pauses at dom chunks and / or attributes, no matter which attribute it is or what it does ... attributes :smile:
@WebReflection Sure, but as I said, it's a bit more advanced because it requires providing hooks from inside of the parser. I want to start with something that will be definitely possible to get implemented by vendors with pretty much no changes or hooks that are not already there, and then iterate on top of that.
@dominiccooney
Thanks for those details. Roughly how much content are we talking about here?
This is really full-size docs. Anywhere between 10K and 200K. I don't know what averages are, tbh.
https://github.com/whatwg/html/issues/2142 – previous issue where a streaming parsing API was discussed
Another important question: do we want it to behave like a streaming innerHTML
? If so, such functionality can't be achieved with the fragment approach, since we don't know context of parsing ahead of time. Consider we have a <textarea>
element. With innerHTML
setter parser knows that content will be parsed in context of <textarea>
element and switches tokeniser to text parsing mode. So, e.g. <div></div>
will be parsed as text content. Whereas, with fragment we'll parse it as a div
tag. If we'll use same machinery for fragment parsing approach as we use for the <template>
parsing we can workaround some of the cases, such as parsing table content (however e.g. foster parenting will not work), but everything that involves adjustment of the tokeniser state will be a problem.
@inikulin The fragment could buffer text until it's appended, at which point it knows its context. Although a guess it's a bit weird that you wouldn't be able to look at stuff in the fragment.
The API could take an option that would give it context ahead of time, so nodes could be created before insertion.
@jakearchibald What if we modify API a bit. We'll introduce new entity, let's call it StreamingParser
for now:
// If we provide context element, then content is streamed directly to it.
let parser = new StreamingParser(contentElement);
let response = await fetch(url);
response.body
.pipeTo(parser.stream);
// You can examine parsed content at any moment using `parser.fragment`
// property which is a fragment mapped to the parsed content in context element
console.log(parser.fragment.childNodes.length);
// If context element is not provided, we don't stream content anywhere,
// however you can still use `parser.fragment` to examine content or attach it to some node
parser = new StreamingParser();
// ...
If you don't provide the content element, how is the content parsed?
In that case parser.fragment
(or even better call it parser.target
) will be a DocumentFragment
element implicitly created by the parser.
Is that a valid context for a parser?
As in, if I push <path/>
to the parser, what ends up in parser.fragment
?
DocumentFragment
itself is not a valid context for parser. I forgot to elaborate here: in case if we don't provide content element for the parser, it creates <template>
element under the hood and pipes content into it, parser.target
will be template.content
in this case.
It'd still be nice to have the nodes created before the target. A "context" option could do this. The option could take a Range
, an Element
(treated like a range that starts within the element), or a DOMString, which is treated as an element that would be created by document.createElement(string)
.
How it will behave if we pass a Range
as a context?
@jakearchibald Seems like I got it: in case of Range
we'll stream to all elements in Range
? If so. we'll need separate instance of parser for each element in Range
.
@inikulin whoa, I really thought I'd replied to this, sorry. Range
would simply be used to figure out the context, like https://w3c.github.io/DOM-Parsing/#idl-def-range-createcontextualfragment(fragment). There'd only be one parser instance.
@jakearchibald Thanks for the clarification. We've just discussed possible behaviours with @RReverser and we were wondering if parsing should affect context element's ambient context: e.g. in case if we stream inside <table>
and provided markup contains text outside table cell should we move this text above context <table>
element (foster parent it) as it's done in full document parsing. Or we should behave exactly like innerHTML
and keep text inside <table>
?
Hmm, that's a tough one. It'd be difficult to do what the parser does while giving access to the nodes before they're inserted. As in:
const streamingFragment = document.createStreamingFragment({context: 'table'});
const writer = streamingFragment.writer.getWriter();
await writer.write('hello');
// Is 'hello' anywhere in streamingFragment.childNodes?
In cases where the node would be moved outside of the context, we could do the innerHTML
thing, or discard the node (it's been moved outside of the fragment, to nowhere).
I'd want to avoid as many of the innerHTML
behaviours as possible, but I guess it isn't possible here.
Another concern we discussed with @inikulin (also related to the discussion in last few comments) is that content being parsed might contain closing tags and so leave the parent context. In that regard, behaviour of innerHTML
or createContextualDocumentFragment
seems better in that it keeps the content isolated, although we're still not sure how stable is machinery for the latter API (given that it does more than innerHTML
, e.g. executing scripts is allowed).
In an offline discussion, @sebmarkbage brought up the helpful point that if we added Response
-accepting srcObject
to iframe (see https://github.com/whatwg/html/issues/3972), this would also serve as a streaming parsing API, albeit only in iframes.
@domenic Hmm, I'm not sure how it would help with streaming parsing? Seems to mostly help with streaming generation of content?
@RReverser The parsing would also be done in a streaming fashion, just like it is currently done for iframes loaded from network-derived lowercase-"r" responses.
What I mean is, I don't see how this helps with actually parsing HTML from JS side (and getting tokens etc.), it rather seems to help with generating and delivering HTML to the renderer.
TL;DR HTML should provide an API for parsing. Why? "Textual" HTML is a widely used syntax. HTML parsing is complex enough to want to use the browser's parser, plus browsers can do implementation tricks with how they create elements, etc.
Unfortunately the way the HTML parser is exposed in the web platform is a hodge-podge. Streaming parsing is only available for main document loads; other things rely on strings which put pressure on memory. innerHTML is synchronous and could cause jank for large documents (although I would like to see data on this because it is pretty fast.)
Here are some strawman requirements:
Commentary:
One big question is when this API exposes the tree it is operating on. Main document parsing does expose the tree and handles mutations to it pretty happily; innerHTML parsing does not until the nodes are adopted into the target node's document (which may start running custom element stuff.)
One minor question is what to do with errors.
Being asynchronous has implications for documents and/or custom elements. If you allow creating stuff in the main document, then you have to run the custom element constructors sometime, so to make it not jank you probably can't run them together. This is probably a feature worth addressing.
See also:
Issue 2827