whatwg / html

HTML Standard
https://html.spec.whatwg.org/multipage/
Other
8.12k stars 2.67k forks source link

Add a "modern" parsing API #2993

Closed dominiccooney closed 3 years ago

dominiccooney commented 7 years ago

TL;DR HTML should provide an API for parsing. Why? "Textual" HTML is a widely used syntax. HTML parsing is complex enough to want to use the browser's parser, plus browsers can do implementation tricks with how they create elements, etc.

Unfortunately the way the HTML parser is exposed in the web platform is a hodge-podge. Streaming parsing is only available for main document loads; other things rely on strings which put pressure on memory. innerHTML is synchronous and could cause jank for large documents (although I would like to see data on this because it is pretty fast.)

Here are some strawman requirements:

Commentary:

One big question is when this API exposes the tree it is operating on. Main document parsing does expose the tree and handles mutations to it pretty happily; innerHTML parsing does not until the nodes are adopted into the target node's document (which may start running custom element stuff.)

One minor question is what to do with errors.

Being asynchronous has implications for documents and/or custom elements. If you allow creating stuff in the main document, then you have to run the custom element constructors sometime, so to make it not jank you probably can't run them together. This is probably a feature worth addressing.

See also:

Issue 2827

RReverser commented 5 years ago

Actually nevermind, I realised that half of this old thread was already about the "delivery to the renderer" problem and not actual parsing. Which is useful too, but seems confusing to mix both in the same discussion.

dead-claudia commented 3 years ago

So I have an idea: how about something like this on HTMLElement: Promise<void> replaceChildrenWithHTML((ReadableStream or DOMString) stream)? This would lock the children list (attempts to read the children fail with an error) and return a promise resolved once it's unlocked and ready to manipulate again. This in effect would be an asynchronous elem.innerHTML = ..., and would be easy to make efficient with background DOM parsing. Note that the browser can append elements at any time, and while you can't manipulate elements themselves, addition can still be detected by properties like outerHeight. (This is so they can pipeline it - it makes for a better user experience.)

As for why a generic readable stream? Such an API could be immensely useful for not just things like displaying Markdown documents from the server, but also for things like displaying large CI logs and large files, where in more advanced cases, a developer might choose to use the scroll position plus the current outer height to determine a threshold to render more items, simply buffering the logs until they're ready to render them. (I could totally see a browser doing this for displayed text files whose sizes are over 100MB - they might even choose to buffer the rest to disk to save memory and just read from there after they've received everything from network, only pre-loading things that are remotely close to the viewport.)


I'm aware of how old this bug is. I still want to resurrect it.

annevk commented 3 years ago

Let's close this issue. This is probably best started in https://wicg.io/ or a personal repository before it reaches a point where it can be more seriously considered.