whatwg / html

HTML Standard
https://html.spec.whatwg.org/multipage/
Other
7.87k stars 2.58k forks source link

"in foreign content" is missing from list of insertion modes #4015

Open zcorpan opened 5 years ago

zcorpan commented 5 years ago

https://html.spec.whatwg.org/multipage/parsing.html#the-insertion-mode

Initially, the insertion mode is "initial". It can change to "before html", "before head", "in head", "in head noscript", "after head", "in body", "text", "in table", "in table text", "in caption", "in column group", "in table body", "in row", "in cell", "in select", "in select in table", "in template", "after body", "in frameset", "after frameset", "after after body", and "after after frameset" during the course of the parsing, as described in the tree construction stage. The insertion mode affects how tokens are processed and whether CDATA sections are supported.

https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-inforeign is not mentioned, although it is where CDATA sections are supported. If in foreign content is not an insertion mode, then the insertion mode doesn't affect whether CDATA sections are supported...

annevk commented 6 months ago

I looked into this as part of https://bugs.webkit.org/show_bug.cgi?id=189431.

  1. I don't think it's an insertion mode. The note is incorrect. Instead it's a set of rules that https://html.spec.whatwg.org/multipage/parsing.html#tree-construction-dispatcher uses, but the underlying insertion mode remains unchanged. E.g., typically it will be "in body".
  2. Whether a CDATA token is supported hinges on this:

If there is an adjusted current node and it is not an element in the HTML namespace, then switch to the CDATA section state.

Now as per the WebKit bug, Chromium and WebKit also require the parser to not be in an integration point.

Now the interesting test is the combination of CDATA and U+0000 in this state.

http://software.hixie.ch/utilities/js/live-dom-viewer/?%3Cscript%3Edocument.write(%22%3Csvg%3E%3Ctitle%3E%3C!%5BCDATA%5B%5Cu0000fdsf%5D%5D%3E%22)%3B%3C%2Fscript%3E

Gecko again appears to follow the specification and not emit U+0000 as is required by "in body", but is that what we want? We should probably treat U+0000 explicitly when tokenizing CDATA to ensure it does not get lost.

cc @whatwg/html-parser

hsivonen commented 6 months ago

"in foreign content" indeed isn't an insertion mode.

My understanding of the reason why CDATA sections are allowed as children of HTML integration points is that SVG desc and title as well as MathML annotation-xml could legitimately contain non-HTML, so supporting CDATA sections maximizes compatibility with copypaste from XML.

annevk commented 6 months ago

What do you think about the U+0000 case?