whatwg / html

HTML Standard
https://html.spec.whatwg.org/multipage/
Other
8.03k stars 2.62k forks source link

HTML needs a mechaism for extending the parsing algorithm #8114

Open plinss opened 2 years ago

plinss commented 2 years ago

The HTML parsing algorithm is supposed to allow generating a consistent DOM from any given HTML input. However as new elements are added, from time to time changes are made to the parsing algorithm, e..g adding new elements as flow content.

While this is ultimately convenient for authors, it results in a different DOM structure in older clients.

HTML should have a mechanism (ideally declarative) for expressing parsing behavior so that older clients can produce the correct DOM when handling new content. This would also allow web component authors to opt-in to the same kinds of authoring improvements.

domenic commented 2 years ago

This is just XHTML, right?

plinss commented 2 years ago

In theory linking to a formal schema document could satisfy this, but this needn't have such a heavy solution.

One possibility could be a meta tag that describes a single element's parsing behavior, another could be a micro-syntax within the element's open tag (like maybe a sigil just before the >).

While XHTML had its issues, it did offer some flexibility which we lost. We traded that flexibility for authoring simplicity and a parser algorithm that was supposed to be invariant. That invariance has been broken several times, and likely will be again. Let's try to find a better solution that allows the flexibility to innovate while not breaking code.

annevk commented 2 years ago

While in theory this seems interesting, in practice I haven't seen a proposal for this that maintains all the good qualities of HTML syntax. Meta-syntax is just not very ergonomic (or internally consistent, at this point) and also introduces its own set of risks.

hsivonen commented 2 years ago

Having site-supplied declarations that affect parsing would cause parsing actions at a distance that would be hard to connect in a sensible way to all entry points to parsing. (It seems unlikely that a meta would travel into fragment parsing invocations, for example.)

Moreover, a new solution introduced now would only work prospectively when we come across this problem the next time in the future. It wouldn't solve the issue at hand relative to implementations of the current parsing algorithm already out there.

However, if a site is willing to take extra steps to accommodate already-deployed implementations, we already have syntax for that: using explicit end tags, i.e. not omitting any end tags (</p> in particular) that the spec says are permissible to omit. This can even be automated on the server side by parsing (with an up-to-date implementation) and immediately reserializing.