Add a "modern" parsing API

dominiccooney commented 7 years ago

TL;DR HTML should provide an API for parsing. Why? "Textual" HTML is a widely used syntax. HTML parsing is complex enough to want to use the browser's parser, plus browsers can do implementation tricks with how they create elements, etc.

Unfortunately the way the HTML parser is exposed in the web platform is a hodge-podge. Streaming parsing is only available for main document loads; other things rely on strings which put pressure on memory. innerHTML is synchronous and could cause jank for large documents (although I would like to see data on this because it is pretty fast.)

Here are some strawman requirements:

Should work with streams, and probably strings.
It should be asynchronous. HTML parsing is fast, but if you wanted to handle megabytes of data on phones while animating something, you probably can't do it synchronously.

Commentary:

One big question is when this API exposes the tree it is operating on. Main document parsing does expose the tree and handles mutations to it pretty happily; innerHTML parsing does not until the nodes are adopted into the target node's document (which may start running custom element stuff.)

One minor question is what to do with errors.

Being asynchronous has implications for documents and/or custom elements. If you allow creating stuff in the main document, then you have to run the custom element constructors sometime, so to make it not jank you probably can't run them together. This is probably a feature worth addressing.

re: progressive rendering

@jakearchibald, I think we would need to build a controller which balances not only parsing and tree construction work, but append, style, layout and paint work. In the worst case lets say the controller is terrible and behaves like innerHTML; is this feature still worth it? What's the point where it becomes compelling?

One benefit of keeping this API pretty high-level without hooks into substitution, states, etc. is that in future when there's something like "async append" the existing uses of the API could become async appends, just with the UA doing the commit ASAP.

re: templating/hole finding

Crudely, DOM Ranges can point to an element and an offset. The DOM doesn't have a way to point to tag names, attribute names or values. Additionally, tag names can't be changed later. So if hole finding is a thing, it needs to either happen as part of tokenization or have some restrictions placed on it to make sense.

The HTML templating system I worked on in 2007 got a lot of benefit out of requiring that a substitution did not flip the HTML parser into different modes. That system also cared about semantics, for example, that a given attribute value contains script. The HTML parser doesn't but our (Blink's) built-in XSS protection knows which attribute names are event handlers and so on. If the there was a tokenization-level API supporting (or even requiring) those sort of restrictions could be useful.

re: DOM construction lists

I think if you want to construct elements from a worker, HTML's text serialization is a hard format to write correctly and read correctly. This parsing API could help make the reading side of things better but doesn't do anything for writers. HTML's text serialization is primarily useful because there's a good chance you already have your data in this format.

@wycats ' proposal for DOM tree construction command buffers is reminiscent of the HTML parser's operations, but you have to squint (appendHTML is a bit like "inject document.written bytes".) I'm not sure how usable HTML's thing would be as an API. You have to do attributes before content, for example. There's also some things missing: HTML knows a context it is parsing into before it starts (like, "fragment parsing into a table", etc.) and has some wild operations (like "reconstruct the set of open formatting elements") and so on.

hsivonen commented 7 years ago

The thread here seems to assume a lot of context that is not stated explicitly. What should I read to learn about the use cases?

dominiccooney commented 7 years ago

@dvoytenko , could you tell us more about your specific use case for the document.open/write/close code in #2827?

rniwa commented 7 years ago

This thread has 3-4 different proposals yet not clear goal or use cases for this API. What problem(s) are we trying to solve here?

jakearchibald commented 7 years ago

@rniwa Quite a few sites, including GitHub, hijack link clicks and perform the navigation themselves to avoid reparsing/executing the same JavaScript on the next page. However, this can become a lot slower on long github pages, as you lose the benefit of streaming with innerHTML. See https://jakearchibald.com/2016/fun-hacks-faster-content/.

Thinking of the extensible web manifesto, a streaming parser would expose this exiting browser behaviour to JavaScript, without having to resort to iframe hacks.

If it could also expose some parser state mid-parse, it would help (although not completely solve) some template cases. But helping somewhat unrelated cases feels like a win in terms of the extensible web.

jakearchibald commented 7 years ago

@domenic

Personally I am with @justinfagnani that template parts is the most promising direction so far for that particular problem.

I agree that there are better targeted solutions for that particular case, but isn't this 'appcaching' it? Offering low-level parser details feels like it would help more use-cases.

@dominiccooney

I think we would need to build a controller which balances not only parsing and tree construction work, but append, style, layout and paint work

I think that's compatible. If this thing supports streams, it can support back-pressure. Also, the "get parser state" method could return a promise that waits for the queued HTML to flush.

Additionally, tag names can't be changed later. So if hole finding is a thing, it needs to either happen as part of tokenization or have some restrictions placed on it to make sense.

You could achieve this with some parser info:

whatever`
  <${foo} src="hi">
`;

After flushing < the parser should know it's in the tag-open state.

@hsivonen

https://jakearchibald.com/2016/fun-hacks-faster-content/ might help.

WebReflection commented 7 years ago

FWIW I see this pattern as a footgun

whatever`
  <${foo} src="hi">
`;

Attributes have different meaning accordingly with the kind of node you want to put there. It might play well with strings on the server side, but on DOM side I don't see that as a must have, quite the opposite.

What I mean is that the following, which at this point could also be allowed too, doesn't look good at all.

whatever`
  <${foo} ${attr}=${value}></${foo}>
`;

What if foo is a br or any other void element or vice-versa (you wrote <${foo} /> but it's not void) ? IMO this goes a bit too far from what I (personally) ever needed, from a template parser/engine.

jakearchibald commented 7 years ago

@WebReflection I agree that this kind of interpolation wouldn't make sense for HyperHTML, but it would make it easy to detect this situation and throw a meaningful error.

I'm much more interested in exposing existing browser internals to create new possibilities and make existing things easier, than creating a new inflexible API that solves one use-case.

WebReflection commented 7 years ago

it wasn't about hyperHTML, it was more about common sense.

What does the following produce?

whatever`
  <${bar} ${attr}=${value}>
  <${foo} ${attr}=${value}></${foo}>
`;

It's absolutely unpredictable and it's also XSS prone, IMO, but surely I don't want to block anyone exploring anything, I'm just thinking loudly about that pattern.

jakearchibald commented 7 years ago

Any system that's piping text directly into a parser needs to be very careful with user input. Using parser state, the developer can pick the appropriate escaping method, or throw if it's a state they don't want/wish to support.

domenic commented 7 years ago

@jakearchibald from what I can tell you're still thinking of the parser states as a "real" thing, and as a low-level primitive. But as @inikulin pointed out, they're not really primitives, they're just implementation details and spec devices we use to navigate through the algorithm. I also don't think we should expose them.

dvoytenko commented 7 years ago

@dominiccooney We use inactive document's open/write to stream shadow DOM. We display relatively big documents in shadow roots and streaming helps a lot with perceived latency. The way it works is this:

a. We create a buffer (inactive) document document.implementation.createHTMLDocument and call open on it. b. For each new chunk arriving via XHR, we call document.write on the buffer element c. We do some preprocessing on bufferDocument.head d. For each new bufferDocument.body.firstChild we move it to the real attached shadow root.

These steps achieve a rather good perceived streaming performance. Once a node is moved, the subsequent streaming happens in the shadow root. It works much like https://jakearchibald.com/2016/fun-hacks-faster-content/ suggests, but we don't want to use iframes.

dominiccooney commented 7 years ago

Use Cases

@dvoytenko Thanks for those details. Roughly how much content are we talking about here?

@jakearchibald Your "fun hacks faster content" is awesome—this is the kind of scenario I have in mind. (Actually I read "fun hacks" with interest in 2016 and it has been irritating me ever since. It bothers me how hard it is to do this; that you have to break up chunks yourself; that Blink runs script elements adopted from the iframe when the spec says don't do that; etc.)

@rniwa, @justinfagnani I think the template filling is meeting a different set use cases. The abstraction and staging is different: Template filling seems more focused on DOM, whereas this is about splicing characters without breaking tokenization; template filling seems more focused on having an artifact which is instantiated, maybe multiple times, whereas this is about streaming taking bytes from somewhere and rehydrating them exactly once. I could even envisage these things being used together, for example, you stream a component definition and use the API proposed here to parse it; that includes a template you fill when an instance of the component is newed up.

@inikulin, @rniwa, is that satisfactory? Do you have any follow up questions about use cases?

@jakearchibald wrote:

I think that's compatible. If this thing supports streams, it can support back-pressure. Also, the "get parser state" method could return a promise that waits for the queued HTML to flush.

I agree! I'm just worried that there's a path dependence here. How naive could an implementation of this API be and still be useful?

@WebReflection, below is an extended meditation on how we ameliorate the XSS problem you mention in your example. This doesn't solve the problem of self closing tags causing the structure to be unpredictable, though. I don't think that is a terrible problem. A conservative author could just always write immediate closing tags for any spliced tagnames; I believe it is always safe to write closing tags, even for self-closing tags. (@domenic?)

Exposing parser states

I agree that exposing HTML parser states is a bad idea because it will limit parser evolution and probably just annoy authors anyway ("oh, I handled the comment state but forgot to handle the comment end dash state".)

What if we exposed a smaller set of states?

For example, we could map the tag open state and tag name state into one "abstract" state, say, tag name. After feeding the slice to the parser we require the parser to be in the HTML spec tag name state; if not, then that might be a hard error.

We could start conservatively by allowing splicing in a small set of states—tag names, attribute names, attribute values, and text—and impose restrictions, for example, maybe parsing the splice in html`<div>${thing}</div>` is only allowed in the HTML spec data, character reference, named character reference, numeric character reference, hexadecimal character reference, ... etc. states and must end in the data state. This would allow thing to be "fun & games" but not "fun <script>alert('and games')" (we would abort when we hit the < and try to transition into tag open state) or "fun &" (we would abort when we finished parsing the splice and find ourselves in the character reference state and not the data state.)

I expect the implementation would carry around a bitset of allowed states which it tests on transitions. There's a bunch of states but many could be collapsed because we never allow splicing near the DOCTYPE states and so on. This could slow main document parsing, but making the parser yield more often probably means we're on a slow path anyway. I think it's probably fine.

We also have the option of implementing different syntax for splices so you can splice a string and not worry about whether it's being spliced into text or an attribute, and whether that attribute was single quoted, double quoted, or unquoted.

But say in future we want to allow arbitrary markup there. We can do this with a set of functions authors use to communicate how they want the splice handled; these return an object that the outer html function interprets, for example html`<div>${hi_parser_trust_me`${thing}`}</div>` where _hi_parser_trustme is another platform function which returns an object that the outer html function knows to interpret with relaxed parsing rules. Of course we'd need to take care with the design and design a useful set of those functions with intuitive names and make shorthands like html`<div>${hi_parser_trust_me(thing)}</div>` also work.

rniwa commented 7 years ago

I still don't understand what the use cases of this feature are. If Github is using a single call to innerHTML, and it's slow, the correct fix is to break that up into chunks and then process each chunk separately. I have a hard time believing that Github's scenario warrants a completely new streaming HTML parser.

Please give us a lit of concrete use cases for which this feature is required. This is a massive feature which requires a ton of engineering effort to implement in browser engines, and I'd like to have clear-cut important use cases that can't be satisfied without it; not something websites can very easily workaround.

dominiccooney commented 7 years ago

If Github is using a single call to innerHTML, and it's slow, the correct fix is to break that up into chunks and then process each chunk separately.

It is hard to break up a chunk of HTML without parsing it.

I have a hard time believing that Github's scenario warrants a completely new streaming HTML parser.

There's already a streaming HTML parser—the main document parser. It is just hardwired to work in certain settings and not others.

I'd like to have clear-cut important use cases that can't be satisfied without it; not something websites can very easily workaround.

I think it is helpful to have use cases, so yeah, let's sharpen them up. What is "clear-cut" and "important" might be a bit subjective; what's your standard?

I want to push back on this idea that workarounds are OK. If authors end up having to rely on lots of workarounds, the accumulated burden can be significant. I think @jakearchibald's post about streaming load performance is worth studying: How long does it take authors to discover this iframe, document.write hack? How resource intensive is spinning up an iframe? How bad is it to enshrine that Safari/Chrome/Edge script running bug?

rniwa commented 7 years ago

If Github is using a single call to innerHTML, and it's slow, the correct fix is to break that up into chunks and then process each chunk separately.

It is hard to break up a chunk of HTML without parsing it.

As far as I can tell, Github is generating HTML for the entire comment section & sending it over XHR. Unless their backend somehow parses HTML each time it has to modify an issue page, they should have a mechanism to generate HTML per comment. At that point, they could be splitting up markup via comments and batch them up and send it via XHR.

Also, browser engines could implement an optimization to speculatively tokenize & parse DOM nodes when a content with text/html MIME type is fetched.

I have a hard time believing that Github's scenario warrants a completely new streaming HTML parser.

There's already a streaming HTML parser—the main document parser. It is just hardwired to work in certain settings and not others.

I'm saying that exposing and maintaining that as a JS API without introducing a major security vulnerability would require a significant engineering effort.

I want to push back on this idea that workarounds are OK. If authors end up having to rely on lots of workarounds, the accumulated burden can be significant.

It all depends on the cost. It would be great if we can make DOM thread safe and expose to Web workers without perf & security issues but the engineering effort required to do that is akin to rewriting WebKit from scratch so we wouldn't lightly propose such a thing. That's a bit of an extreme case but there's always a cost & benefit trade off in every feature we're adding to the Web platform, and there's also an opportunity cost. The time I spend implementing this feature is a time I could spend fixing other bugs and implementing other features in WebKit.

Since this feature has a significant implementation cost (at least to WebKit), the justification for it needs to be correspondingly strong.

RReverser commented 7 years ago

to speculatively tokenize & parse DOM nodes when a content with text/html MIME type is fetched

That doesn't help with actually adding these elements to the DOM in a streaming fashion too.

Let's, as the first step, minimize the required API for the original @jakearchibald's use case to something like:

document.getElementById('div').appendChildStream(respStream);

What are the new security implications of this that are not already present for innerHTML? What is the added implementation cost that is not covered by main document parser and/or iframe hack?

WebReflection commented 7 years ago

I understand @rniwa argument, which is why I hoped for a very simple scenario that I believe would already solve 99% of use cases: attribute value, content chunk.

I also agree with @RReverser this should start as small as possible or it won't ever land.

const tag = document.createStreamTag((node, value) => {
  if (node.nodeType === Node.ATTRIBUTE_NODE) {
    // we have an attribute. We can reach its owner
    // we can deal with its name and the value as content
  } else {
    // we have a Node.ELEMENT_NODE
    // it's still open with N childNodes
    // we can append a new node, discard the value, do whatever
  }
});

// parse & stream
tag`<div class=${'name'} onclick=${fn}>
  ${'some content'}
</div>`;

The tag stream will always return a DocumentFragment (in this case containing a div) and above example will invoke 3 times the callback:

first time with the class attribute node, and value "name"
second time with onclick attribute node, and fn as it is as listener (no .toString() implicit anything),
the third time the div node itself, with childNodes.length equal to 1, which is the text before the chunk.

The value of the third invocation will be the text "some content", but it could also be anything else, including a Promise object.

If this was possible through the platform, it'd be quite revolutionary.

All primitives to enrich the logic on top would be there. The only missing bit to cover all my use cases already implemented and available to check/view/see if you want, is the fact HTML is case-insensitive so that an attribute like onCustomEvent would result into a DOM attribute with name oncustomevent instead.

Latter one is not a huge limit but maybe somebody has an idea on how that could be solved.

jakearchibald commented 7 years ago

@domenic

from what I can tell you're still thinking of the parser states as a "real" thing, and as a low-level primitive. But as @inikulin pointed out, they're not really primitives, they're just implementation details and spec devices we use to navigate through the algorithm.

If browsers don't implement it & don't intend to, what's the point of having it in a spec? I realise that browsers may use different terms internally, but unless they're implementing something wildly different to the spec, and intend to continue doing so, those states could be mapped to something standard.

I also don't think we should expose them.

Why?

@dominiccooney

What if we exposed a smaller set of states?

Agreed. We could even start by exposing nothing, but design the parser in a way that allows this in future.

inikulin commented 7 years ago

@jakearchibald

but unless they're implementing something wildly different to the spec

Sometimes they do, e.g. Blink don't use states from spec that are dedicated to entity parsing and uses custom state machine for that: https://chromium.googlesource.com/chromium/blink/+/master/Source/core/html/parser/HTMLEntityParser.cpp

annevk commented 7 years ago

Right, generally specifications define some kind of process that brings you from A to B. The details of that process are not important and implementations are encouraged to compete in that area. The moment you want to expose more details of that process to the outside world it starts mattering a whole lot more what those details are and how they function, as the moment you expose them you prevent all kinds of optimizations and code refactoring that could otherwise take place.

jakearchibald commented 7 years ago

Fair enough. It'd be good to expose these states at some point, but it doesn't need to be v1.

RReverser commented 7 years ago

@WebReflection I agree having events for separate pieces of the HTML as it goes through would be quite nice, but I'd say it's already a little bit more advanced than the "as small as possible", more like version 2. For version 1, it would be nice at least to be able to insert streaming content into the DOM even without hooks for separate parts of it.

WebReflection commented 7 years ago

events are just attributes ... what I've written intercepts/pauses at dom chunks and / or attributes, no matter which attribute it is or what it does ... attributes :smile:

RReverser commented 7 years ago

@WebReflection Sure, but as I said, it's a bit more advanced because it requires providing hooks from inside of the parser. I want to start with something that will be definitely possible to get implemented by vendors with pretty much no changes or hooks that are not already there, and then iterate on top of that.

dvoytenko commented 7 years ago

@dominiccooney

Thanks for those details. Roughly how much content are we talking about here?

This is really full-size docs. Anywhere between 10K and 200K. I don't know what averages are, tbh.

jakearchibald commented 7 years ago

https://github.com/whatwg/html/issues/2142 – previous issue where a streaming parsing API was discussed

inikulin commented 7 years ago

Another important question: do we want it to behave like a streaming innerHTML? If so, such functionality can't be achieved with the fragment approach, since we don't know context of parsing ahead of time. Consider we have a <textarea> element. With innerHTML setter parser knows that content will be parsed in context of <textarea> element and switches tokeniser to text parsing mode. So, e.g. <div></div> will be parsed as text content. Whereas, with fragment we'll parse it as a div tag. If we'll use same machinery for fragment parsing approach as we use for the <template> parsing we can workaround some of the cases, such as parsing table content (however e.g. foster parenting will not work), but everything that involves adjustment of the tokeniser state will be a problem.

jakearchibald commented 7 years ago

@inikulin The fragment could buffer text until it's appended, at which point it knows its context. Although a guess it's a bit weird that you wouldn't be able to look at stuff in the fragment.

The API could take an option that would give it context ahead of time, so nodes could be created before insertion.

inikulin commented 7 years ago

@jakearchibald What if we modify API a bit. We'll introduce new entity, let's call it StreamingParser for now:


// If we provide context element, then content is streamed directly to it.
let parser = new StreamingParser(contentElement); 

let response = await fetch(url);
response.body
  .pipeTo(parser.stream);

// You can examine parsed content at any moment using `parser.fragment`
// property which is a fragment mapped to the parsed content in context element
console.log(parser.fragment.childNodes.length);

// If context element is not provided, we don't stream content anywhere,
// however you can still use `parser.fragment` to examine content or attach it to some node
parser = new StreamingParser(); 

// ...

jakearchibald commented 7 years ago

If you don't provide the content element, how is the content parsed?

inikulin commented 7 years ago

In that case parser.fragment (or even better call it parser.target) will be a DocumentFragment element implicitly created by the parser.

jakearchibald commented 7 years ago

Is that a valid context for a parser?

jakearchibald commented 7 years ago

As in, if I push <path/> to the parser, what ends up in parser.fragment?

inikulin commented 7 years ago

DocumentFragment itself is not a valid context for parser. I forgot to elaborate here: in case if we don't provide content element for the parser, it creates <template> element under the hood and pipes content into it, parser.target will be template.content in this case.

jakearchibald commented 7 years ago

It'd still be nice to have the nodes created before the target. A "context" option could do this. The option could take a Range, an Element (treated like a range that starts within the element), or a DOMString, which is treated as an element that would be created by document.createElement(string).

inikulin commented 7 years ago

How it will behave if we pass a Range as a context?

inikulin commented 7 years ago

@jakearchibald Seems like I got it: in case of Range we'll stream to all elements in Range? If so. we'll need separate instance of parser for each element in Range.

jakearchibald commented 7 years ago

@inikulin whoa, I really thought I'd replied to this, sorry. Range would simply be used to figure out the context, like https://w3c.github.io/DOM-Parsing/#idl-def-range-createcontextualfragment(fragment). There'd only be one parser instance.

inikulin commented 7 years ago

@jakearchibald Thanks for the clarification. We've just discussed possible behaviours with @RReverser and we were wondering if parsing should affect context element's ambient context: e.g. in case if we stream inside <table> and provided markup contains text outside table cell should we move this text above context <table> element (foster parent it) as it's done in full document parsing. Or we should behave exactly like innerHTML and keep text inside <table>?

jakearchibald commented 7 years ago

Hmm, that's a tough one. It'd be difficult to do what the parser does while giving access to the nodes before they're inserted. As in:

const streamingFragment = document.createStreamingFragment({context: 'table'});
const writer = streamingFragment.writer.getWriter();
await writer.write('hello');

// Is 'hello' anywhere in streamingFragment.childNodes?

In cases where the node would be moved outside of the context, we could do the innerHTML thing, or discard the node (it's been moved outside of the fragment, to nowhere).

I'd want to avoid as many of the innerHTML behaviours as possible, but I guess it isn't possible here.

RReverser commented 7 years ago

Another concern we discussed with @inikulin (also related to the discussion in last few comments) is that content being parsed might contain closing tags and so leave the parent context. In that regard, behaviour of innerHTML or createContextualDocumentFragment seems better in that it keeps the content isolated, although we're still not sure how stable is machinery for the latter API (given that it does more than innerHTML, e.g. executing scripts is allowed).

domenic commented 5 years ago

In an offline discussion, @sebmarkbage brought up the helpful point that if we added Response-accepting srcObject to iframe (see https://github.com/whatwg/html/issues/3972), this would also serve as a streaming parsing API, albeit only in iframes.

RReverser commented 5 years ago

@domenic Hmm, I'm not sure how it would help with streaming parsing? Seems to mostly help with streaming generation of content?

domenic commented 5 years ago

@RReverser The parsing would also be done in a streaming fashion, just like it is currently done for iframes loaded from network-derived lowercase-"r" responses.

RReverser commented 5 years ago

What I mean is, I don't see how this helps with actually parsing HTML from JS side (and getting tokens etc.), it rather seems to help with generating and delivering HTML to the renderer.

whatwg / html