whatwg / html

HTML Standard
https://html.spec.whatwg.org/multipage/
Other
8.12k stars 2.67k forks source link

Add a "modern" parsing API #2993

Closed dominiccooney closed 3 years ago

dominiccooney commented 7 years ago

TL;DR HTML should provide an API for parsing. Why? "Textual" HTML is a widely used syntax. HTML parsing is complex enough to want to use the browser's parser, plus browsers can do implementation tricks with how they create elements, etc.

Unfortunately the way the HTML parser is exposed in the web platform is a hodge-podge. Streaming parsing is only available for main document loads; other things rely on strings which put pressure on memory. innerHTML is synchronous and could cause jank for large documents (although I would like to see data on this because it is pretty fast.)

Here are some strawman requirements:

Commentary:

One big question is when this API exposes the tree it is operating on. Main document parsing does expose the tree and handles mutations to it pretty happily; innerHTML parsing does not until the nodes are adopted into the target node's document (which may start running custom element stuff.)

One minor question is what to do with errors.

Being asynchronous has implications for documents and/or custom elements. If you allow creating stuff in the main document, then you have to run the custom element constructors sometime, so to make it not jank you probably can't run them together. This is probably a feature worth addressing.

See also:

Issue 2827

annevk commented 7 years ago

cc @jakearchibald @whatwg/html-parser

jakearchibald commented 7 years ago

One big question is when this API exposes the tree it is operating on.

I'd like this API to support progressive rendering, so I guess I guess my preference is "as soon as possible".

const streamingFragment = document.createStreamingFragment();

const response = await fetch(url);
response.body
  .pipeThrough(new TextDecoder())
  .pipeTo(streamingFragment.writable);

document.body.append(streamingFragment);

I'd like the above to progressively render. The parsing would follow the "in template", although we may want options to handle other cases, like SVG.

One minor question is what to do with errors

What kinds of errors?

jakearchibald commented 7 years ago

There are a few libraries that use tagged template literals to build HTML, I think their code would be simpler if they knew what state the parser was in at a given point. This might be an opportunity.

Eg:

const fragment = whatever`
  <p>${someContent}</p>
  <img src=${someImgSrc}>
`;

These libraries allow someContent to be text, an element, a promise for text/element. someImgSrc would be text in this case, but may be a function if it's assigning to an event listener. Right now these libraries insert a UID, then crawl the created elements for those UIDs so they can perform the interpolation.

I wonder if something like streamingFragment could provide enough details to avoid the UID hack.

const streamingFragment = document.createStreamingFragment();
const writer = streamingFragment.writer.getWriter();

await writer.write('<p>');
let parserState = await streamingFragment.getParserState();
parserState.currentNode; // paragraph

await writer.write('</p><img src=');
parserState = await streamingFragment.getParserState();

…I guess this last bit is more complicated, but ideally it should know it's in the "before attribute value" state for "src" within tag "img". Ideally there should be a way to get the resulting attribute & element as a promise.

+@justinfagnani @webreflection

inikulin commented 7 years ago

@dominiccooney HTML can have conformance errors, but there are recovery mechanisms for all of them and user agents doesn't bail out on errors. So any input can be consumed by the HTML parser without a problem.

I like @jakearchibald's API. However, I wonder if we need to support full document streaming parser and how API will look like for it. Also, in streaming fragment approach will it be possible to perform consequent writes to fragment (e.g. pipe one response to fragment and afterwards another one). If so, how it will behave: overwrite content of fragment or insert it in the end of fragment?

inikulin commented 7 years ago

@jakearchibald

I think their code would be simpler if they knew what state the parser was in at a given point.

What do you mean by state here? Parser insertion mode, tokeniser state or something else?

jakearchibald commented 7 years ago

@inikulin

I wonder if we need to support full document streaming parser

Hmm yeah. I'm not sure what the best pattern is to use for that.

will it be possible to perform consequent writes to fragment (e.g. pipe one response to fragment and afterwards another one). If so, how it will behave: overwrite content of fragment or insert it in the end of fragment?

Yeah, you can do this with streams. Either with individual writes, or piping with {preventClose: true}. This will follow the same rules as if you mess with elements' content during initial page load.

As in, if the parser eats:

<p>Hello

…then you:

document.querySelector('p').append(', how are you today?');

…you get:

<p>Hello, how are you today?

…if the parser then receives " everyone", I believe you get:

<p>Hello everyone, how are you today?

…as the parser as a pointer to the first text node of the paragraph.

inikulin commented 7 years ago

@jakearchibald There is a problem with this approach. Consider we have two streams: one writes <div>Hey and the other one ya. Usually when parser encounters end of the stream it finalises the AST and, therefore, the result of feeding the first stream to the parser will be <div>Hey</div> (parser will emit implied end tag here). So, when second stream will write ya you'll get <div>Hey</div>ya as a result. So it will be pretty much the same as creating second fragment and appending it to the first one. On the other hand we can have API that will explicitly say parser to treat second stream as a continuation of the first one.

WebReflection commented 7 years ago

Thanks @jakearchibald for thinking of us.

I can speak for my 6+ months on the template literals VS DOM pattern so that maybe you can have as many info as possible about implementations/proposals/APIs etc.

I'll try to split this post in topics.


Not just a UID

I am not using just a UID, I'm using a comment that contains some UID.

// dumb example
function tag(statics, ...interpolations) {
  const out = [statics[0]];
  for (let i = 1; i < statics.length; i++)
    out.push('<!-- MY UID -->', statics[i]);
  return out.join('');
}

tag`<p>a ${'b'} c</p>`;

This gives me the ability to let the HTML parser split for me text content in chunks, and verify that if the nodeType of the <p> childNodes[x] is Node.COMMENT_NODE and its textContent is my UID, I'm fine.

The reason I'm using comments, beside letting the browser do the splitting job for me, is that browsers that don't support in core HTMLTemplateElement will discard partial tables, cols, or options layout but they wouldn't with comments.

var brokenWorkAround = document.createElement('div');
brokenWorkAround.innerHTML = '<td>goodbye TD</td>';
brokenWorkAround.childNodes; // [#text]
brokenWorkAround.outerHTML;
// <div>goodbye TD</div>

You can read about this issue in all the polyfill from webcomponents issues. https://github.com/webcomponents/template/issues

As summary, if every browser was natively compatible with the template element and the fact it doesn't ignore any kind of node, the only thing parsers like mine would need is a way to understand when the HTML engine encounters a "special node", in my case represented by a comment with a special content.

Right now we all need to traverse the whole tree after creating it, and in search of special placeholders.

This is fast enough as a one-off operation, and thanks gosh template literals are unique so it's easy to perform the traversing only once, but it wouldn't scale on huge documents, specially now that I've learned for browsers, and due legacy, simply checking nodeType is a hell of a performance nightmare!


Attributes are "doomed"

Now that I've explained the basics for the content, let's talk about attributes.

If you inject a comment as attribute and there are no quotes around, the layout is destroyed.

<nope nopity=<!-- nope -->>nayh</nope>

So, for attributes, having a similar mechanism to define a unique entity/value to be notify about woul dbe ACE!!!! Right now the content is injected sanitized upfront. It works darn well but it's not ideal as a solution.

Moreover on Attribuites

If you put a placeholder in attributes you have the following possible issues:

HTML is very forgiving in many parts, attributes are quite the opposite for various scenarios.

As summary if whatever mechanism would tell the browser any attribute with such special content should be ignored, all these problems would disappear.


Backward compatibility

As much as I'd love to have help from the platform itslef regarding the template literals pattern, I'm afraid that won't ever land in production until all browsers out there would support it (or there is a reliable polyfill for that).

That means that exposing the internal HTML parser through a new API can surely benefits projects from the future, but it would rarely land for all browser in 5+ years.

This last point is just my consideration about effort / results ratio.

Thanks again for helping out regardless.

jakearchibald commented 7 years ago

@inikulin

There is a problem with this approach

I don't think it's a problem. If you use {preventClose: true}, it doesn't encounter "end of stream". So:

await textStream1.pipeTo(streamingFragment.writable, { preventClose: true });
await textStream2.pipeTo(streamingFragment.writable);

The streaming fragment would consume the streams as if there were a single stream concatenated.

await textStream3.pipeTo(streamingFragment.writable);

The above would fail, as the writable has now closed.

WebReflection commented 7 years ago

P.S. just in case my wishes come true ... what both me and (most-likely) Justin would love to have natively exposed, is a document.queryRawContent(UID) that would return, in linear order, atributes with such value, or comments nodes with such value.

<html lang=UID>
<body> Hello <!--UID-->! <p class=UID></p></body>

The JS coutner part would be:

const result = document.queryRawContent(UID);
[
  the html lang attribute,
  the comment childNodes[1] of the body,
  the p class arttribute
]

Now that, in core, would make my parser a no brainer (beside the issue with comments and attributes, but RegExp upfront are very good at that and blazing fast

[edit] even while streaming it would work, actually it'd be even better so it's one pass for the browser

WebReflection commented 7 years ago

Also since I know for many code is better than thousand words, this is the TL;DR version of what hyperHTML does.

function tag(statics, ...interpolations) {
  if (this.statics !== statics) {
    this.statics = statics;
    this.updates = parse.call(this, statics, '<!--WUT-->');
  }
  this.updates(interpolations);
}

function parse(statics, lookFor) {
  const updates = [];
  this.innerHTML = statics.join(lookFor);
  traverse(this, updates, lookFor);
  const update = (value, i) => updates[i](value);
  return interpolations => interpolations.forEach(update);
}

function traverse(node, updates, lookFor) {
  switch (node.nodeType) {
    case Node.ELEMENT_NODE:
      updates.forEach.call(node.attributes, attr => {
        if (attr.value === lookFor)
          updates.push(v => attr.value = v)});
      updates.forEach.call(node.childNodes,
        node => traverse(node, updates, lookFor)); break;
    case Node.COMMENT_NODE:
      if (`<!--${node.textContent}-->` === lookFor) {
        const text = node.ownerDocument.createTextNode('');
        node.parentNode.replaceChild(text, node);
        updates.push(value => text.textContent = value);
}}}

const body = tag.bind(document.body);

setInterval(() => {
  body`
  <div class="${'my-class'}">
    <p> It's ${(new Date).toLocaleTimeString()} </p>
  </div>`;
}, 1000);

The slow path is the traverse function, the not-so-cool part is the innerHTML injection (as regular node, template or whatever it is) without having the ability to intercept, while parsing the string, all placeholders / attributes and act addressing them accordingly.

OK, I'll let you discuss the rest now :smile:

jakearchibald commented 7 years ago

@WebReflection

I think the UID scanner you're talking about might not be necessary. Consider:

const fragment = whatever`
  <p>${someContent}</p>
  <img src=${someImgSrc}>
`;

Where whatever could do something like this:

async function whatever(strings, ...values) {
  const streamingFragment = document.createStreamingFragment();
  const writer = streamingFragment.writer.getWriter();

  for (const str of strings) {
    // str is:
    // <p>
    // </p> <img src=
    // >
    // (with extra whitespace of course)
    await writer.write(str);
    let parserState = streamingFragment.getParserState();

    if (parserState.tokenState == 'data') {
      // This is the case for <p>, and >
      await writer.write('<!-- -->');
      parserState.currentTarget.lastChild; // this is the comment you just created.
      // Swap it out for the interpolated value
    }
    else if (parserState.tokenState.includes('attr-value')) {
      // await the creation of this attr node
      parserState.attrNode.then(attr => {
        // Add the interpolated value, or remove it and add an event listener instead etc etc.
      });
    }
  }
}
WebReflection commented 7 years ago

Yes, that might work. As long as these scenarios are allowed:

const fragment = whatever`
  <ul>${...}</ul>
  ${...}
  <p data-a=${....} onclick=${....}>also ${...} and</p>
  <img a=${...} b=${...} src=${someImgSrc}>
  <table><tr>${...}</tr></table>
`;

which looks like it'd be the case.

jakearchibald commented 7 years ago

@WebReflection Interpolation should be allowed anywhere.

whatever`
  <${'img'} src="hi">
`;

In the above case tokenState would be "tag-open" or similar. At this point you could either throw a helpful error, or just pass the interpolated value through.

inikulin commented 7 years ago

@jakearchibald Do you expect tokenStateto be one of tokeniser states defined in https://html.spec.whatwg.org/multipage/parsing.html#tokenization? If so, I'm afraid we can't do that, they are part of parser intrinsics and are subject to change. Moreover, some of them can be meaningless for a user.

jakearchibald commented 7 years ago

@inikulin yeah, that's what I was hoping to expose, or something equivalent. Why can't we expose it?

WebReflection commented 7 years ago

@jakearchibald

what about the following ?

whatever`
  <${'button'} ${'disabled'}>
`;

I actually don't mind having that possible because boolean attributes need boolean values so that ${obj.disabled ? 'disabled' : ''} doesn't look like a great option to me, but I'd be curious to know if "attribute-name" would be exposed too.

Anyway, having my example covered would be already awesome.

jakearchibald commented 7 years ago

@WebReflection The tokeniser calls that the "Before attribute name state", so if we could expose that, it'd be possible.

WebReflection commented 7 years ago

Not sure this is actually just extra noise or something valuable, but if it can simplify anything, viperHTML uses similar mechanism to parse once on the NodeJS side.

The parser is the pretty awesome htmlparser2.

Probably inspiring as API ? I use the comment trick there though, but since there is a .write mechanism, I believe it could be possible to make it incremental.

inikulin commented 7 years ago

@jakearchibald These states are part of intrinsic parser mechanism and are subject of change, we've even removed/introduced few recently just to fix some conformance-error related bug in parser. So, exposing them to end user will require us to freeze current list of states, that will significantly complicate further development of the parser spec. Moreover, I believe some of them will be quite confusing for end users, e.g. Comment less-than sign bang dash dash state

WebReflection commented 7 years ago

@inikulin would a subset be reasonable? As example, data and attr-value for me would cover already 100% of hyperHTML use cases and I believe those two will never change in the history of HTML ... right?

jakearchibald commented 7 years ago

I'm keen on exposing some parser state to help libraries, but I'm happy for us to add it later rather than block streaming parsing on it.

inikulin commented 7 years ago

@WebReflection Yes, that could be a solution. But I have some use cases in mind that can be confusing for end user. Consider <div data-foo="bar". We'll emit attr-value state in that case, however this markup will not produce attribute in AST (it will not even produce a tag, since unclosed tags in the end of the input stream are omitted).

WebReflection commented 7 years ago

@inikulin if someone writes broken html I don't expect anything different than throwing errors and break everything right away (when using a new parser API)

Template literals are static, there's no way one of them would instantly fail the parser ... it either work or fail forever, since these are also frozen Arrays.

Accordingly, I understand this API is not necessarily for template literals only, but if the streamer goes bananas due wrong output it's developer fault.

today it's developer fault regardless, but she'll never notice due silent failure.

inikulin commented 7 years ago

if someone writes broken html I don't expect anything different than throwing errors and break everything right away.

You will be surprised looking at the real world markup around the web. Also, there is no such thing as "broken markup" anymore. There is non-conforming markup, but modern HTML parser can swallow anything. So, to conclude, you suggest to bail out with an error in case of invalid markup in this new streaming API?

WebReflection commented 7 years ago

You will be surprised looking at the real world markup around the web.

you missed the edit: when using a new parser API

So, to conclude, you suggest to bail out with an error in case of invalid markup in this new streaming API?

If the alternative is to not have it, yes please.

I'm tired of missed opportunities due lazy developers that need to be cuddle by standards for their mistakes.

inikulin commented 7 years ago

If the alternative is to not have it, yes please.

I'm tired of missed opportunities due lazy developers that need to be cuddle by standards about their mistakes.

I'm not keen to this approach to be honest, it brings us back to times of XHTML. One of the advantages of HTML5 was it's flexibility regarding parse errors and, hence, document authoring.

WebReflection commented 7 years ago

this API goal is different, and developers want to know if they wrote a broken template.

Not knowing it hurts themselves, and since there is no html highlight by default inside strings, it's also a safety belt for them.

So throw like any asynchronous operation that failed would throw, and let them decide if they want to fallback to innerHTML or fix that template literal instead, and forever.

WebReflection commented 7 years ago

To be more explicit, nobody on earth would write the following or, if they do by accident, nobody wants that to succeed.

template`<div data-foo="bar"`;

so why is that a concern?

On JavaScript, something similar would be a Syntax error and it will break everything.

jakearchibald commented 7 years ago

@inikulin

Consider <div data-foo="bar". We'll emit attr-value state in that case, however this markup will not produce attribute in AST (it will not even produce a tag, since unclosed tags in the end of the input stream are omitted).

FWIW this would be fine in an API like my example above. The promise that returns the currently-in-progress element/attribute would reject in this case, but the stream would still write successfully.

I agree that a radically different parsing style would be bad. I'd prefer it to be closer to the regular document parser than innerHTML.

inikulin commented 7 years ago

I agree that a radically different parsing style would be bad. I'd prefer it to be closer to the regular document parser than innerHTML.

@jakearchibald They are pretty much the same, with exception that for innerHTML parser performs adjustment of it's state according to context element before parsing.

jakearchibald commented 7 years ago

@inikulin innerHTML behaves differently regarding script elements. I hope we could avoid those differences with this API.

inikulin commented 7 years ago

@WebReflection

this API goal is different, and developers want to know if they wrote a broken template. To be more explicit, nobody on earth would write the following, and nobody wants that to succeed.

This would be somehow true if templates will be the only use case for this API. What if I want to fetch some arbitrary content provided by 3rd party? E.g. user-supplied comments or something else?

WebReflection commented 7 years ago

What if I want to fetch some arbitrary content provided by 3rd party? E.g. user-supplied comments or something else?

what about it? you'll never have the partial content, just whole content. Or you are thinking about an evaluation at runtime of some user content that put some ${value} inside the comment?

In latter case, I don't see a realistic scenario. In "just parse-stream it all" case I don't see any issue, you'll never have a token in the first place.

Anyway, if it's about having missed notifications due silent failures and internal adjustments I'm also OK with it. It'll punish heavily developers that don't test their template, and I'm fine with that too.

inikulin commented 7 years ago

@WebReflection To be clear, we are not talking about partial content only. There are many other cases there you can get non-conforming markup.

WebReflection commented 7 years ago

@inikulin I honestly see your argument like fetch(randomThing).then(b => b.text()).then(eval), which I fail to see as ever desired use-case.

But like I've said, I wouldn't care if the silent failure/adjustment happens. I'm fine for the parser to never break, it'll be somebody else problem, as long as the parse can exist, exposing what it can, when it can, which will be 99% of the desired use cases to me.

Is this possible? Or this is a won't fix/won't implement ?

This is the bit I'm not sure I understand from your answers. I read potential limits, but not proposed alternatives / solutions.

domenic commented 7 years ago

To be clear, we're not interested in introducing a new, third parser (besides the HTML and XML ones) that only accepts conforming content.

WebReflection commented 7 years ago

XML already accepts conforming content, and I believe this parser would need to be compatible with SVG too.

However, like I've said, it works for me either ways.

TL;DR can this parser expose data and attr-value tokens/states whenever these are valid?

If so, great, that solves everything.

All other cases are (IMO) irrelevant, but not having it because of possible lack of tokens in broken layout would be a hugely missed opportunity for the Web.

I hope I've also made my point of view clear.

jakearchibald commented 7 years ago

Here are some requirements which I think sums up what's been discussed so far:

inikulin commented 7 years ago

@jakearchibald BTW, regarding script execution: maybe we can make it optional? For example if I parse HTML from some untrusted source it would be nice to be able to prevent execution for parsed fragment.

jakearchibald commented 7 years ago

@inikulin I fear that may be false security. Although innerHTML doesn't download/execute script elements, it doesn't block attributes that are later executed (eg onclick attributes).

Seems safer to defer to existing methods that control script download & execution, like CSP and sandbox.

inikulin commented 7 years ago

@jakearchibald Thinking of it bit more I wonder how fragment approach suppose to work considering that when you append fragment into node it's children are adopted by new parent node: https://dom.spec.whatwg.org/#concept-node-insert. So if we insert fragment while content is still piped into it, how we should behave? Make parent node a receiver of all consequent HTML content? In that case we'll need a machinery to pipe HTML content into element. In that regard, it will make more sense to implement streaming parser API for elements and document fragments without introducing new node type (something like element.writable and fragment.writable).

jakearchibald commented 7 years ago

@inikulin In terms of adopting, how does https://jakearchibald.com/2016/fun-hacks-faster-content/#using-iframes-and-documentwrite-to-improve-performance work?

I don't like element.writable as it doesn't really fit with how writables can only be written to once. That's how I ended up with a special streaming fragment. It may be the same node type as a regular fragment though.

inikulin commented 7 years ago

Hmm, it a bit confusing that fragment becomes some kind of proxy entity to pipe HTML in element considering that new nodes will not appear in fragment. But maybe it's just my perception...

jakearchibald commented 7 years ago

They'll appear in the fragment until the fragment is appended to an element.

It's no stranger than https://jakearchibald.com/2016/fun-hacks-faster-content/#using-iframes-and-documentwrite-to-improve-performance, but I guess that's pretty strange.

wycats commented 7 years ago

I very much share the goal of being able to streaming-parse HTML without blocking the main thread.

This goal is pretty connected to some of the goals I had in the DOMChangeList proposal (specifically the DOMTreeConstruction part of that proposal). Here's a sketch of we could enhance that proposal to support these goals:

DOMTreeConstruction is already intended to provide a low-level API that can be used in a worker and transferred from a worker to the UI thread (without having to deal with the thorny questions of making a transferrable DOM available in workers). That makes it a nice fit for async parsing and possibly even streaming parsing.

This thread is really a missing piece of the other proposal: DOMChangeList provides a way to go from operations to actual DOM, but it doesn't provide a compliant way for going from HTML to operations. If we added a way from going from HTML to operations, we can break up the entire processing pipeline and do arbitrary parts of the process in workers (anything up to putting the operations in the real DOM).

wycats commented 7 years ago

As an unrelated aside, I would find it very helpful to have an API that provided a stream of tokenizer events that could be intercepted on the way to the parser. That would allow Glimmer to implement the {{ extension in user-space ({{ text isn't legal in all of the places where you would want it to be meaningful, and has different meaning in text positions vs. attributes). Today, we are forced to require a bundler for HTML, but I would love to be able to use more of the browser's infrastructure instead.

wycats commented 7 years ago

@domenic said:

To be clear, we're not interested in introducing a new, third parser (besides the HTML and XML ones) that only accepts conforming content.

Doesn't the existing HTML parser spec specifically describe a mode that aborts with an exception on the first error?

For non-streaming content, it would probably be sufficient just to expose whether an error had occurred at all (and then userspace could throw away the tree). For streaming content, it might also be sufficient (userspace could "roll back" by deleting any nodes that were already inserted?)

justinfagnani commented 7 years ago

Wow, long thread is long. I had a busy morning, so I'll try to hit two points I caught just now.

  1. Async API: This would make it difficult to use this API in many scenarios. Right now when you create and attach an element, you may expect that the element has rendered synchronously. With an async parser API, if the element has to parse it's template to render, that breaks. In essence, using an async parser API would be similar to using <template> today, but with asyncAppend instead of append. Lots of cod would get more complex as element state itself becomes async and we don't have a standard way of waiting for an element to be "ready".

    Of course, if we had top-level await, we could hide that async API behind module initialization.

  2. Being able to get parser state while parsing fragments would be awesome, but in order to avoid inserting sentinels altogether, we'd need a few more features, like

    1. Get a reference to the under-construction element, or previously constructed text node.
    2. Prevent collapsing consecutive text nodes. ie if we parse <div>abc then def</div>, we'd need a way to get a reference to abc and def and not collapse them into a single node.

But stepping back, the real API I want is to be able to create a tree of DOM and easily get references to particular nodes in cheaply. https://github.com/whatwg/html/issues/2254 (at least the Template Parts" idea in there) would solve my use-case completely.

Another thing that would help is a variation on the TreeWalker API that didn't return a node from nextNode() so that I could navigate to a node without creating wrappers for all preceding nodes.

WebReflection commented 7 years ago

@justinfagnani I think @jakearchibald already solved your i and ii points; https://github.com/whatwg/html/issues/2993#issuecomment-326552132

you can write a comment and retrieve it right away as your place holder so that you'd have abc then your content, and later on whatever it is, including def and eventually another data where you can add another comment

`<div>a ${'b'} c ${'but also d'} e</div>`