XML vs HTML parser (Item 10)

zbraniecki commented 5 years ago

In our list of new features, we wanted to switch from HTML parser to XML parser for DOM Overlays in order to allow for self-closing elements.

This makes both code and especially localization a bit easier to read:

key1 = Click on <a data-l10n-name="link1"></a> to go to <a data-l10n-name="link2"></a>.

vs.

key1 = Click on <a data-l10n-name="link1"/> to go to <a data-l10n-name="link2"/>.

The scope of the problem is debatable. On one hand, HTML curates a list of void-elements which seems to be specifically tailored for elements that likely should not have text content value.

This overlaps pretty well with Fluent scenarios where an element that shouldn't have content value, like <img/> can in HTML parser be self-closed, while an element that will rarely go without a value, such as <a></a> cannot. So, while the example above may look tempting, we should remember that in most cases the links2 will have text content in localization and won't benefit from being self-closed.

On the other hand, a developer would likely prefer to write:

<p data-l10n-id="key1">
  <a href="https://www.mozilla.org"/>
  <a href="https://www.firefox.com"/>
</p>

over

<p data-l10n-id="key1">
  <a href="https://www.mozilla.org"></a>
  <a href="https://www.firefox.com"></a>
</p>

and what's worse, it's a pretty tricky bug to notice because it implicitly and without any error turns:

<p data-l10n-id="key1">
  <a href="https://www.mozilla.org"/>
  <img/>
</p>

into:

<p data-l10n-id="key1">
  <a href="https://www.mozilla.org">
    <img/>
  </a>
</p>

which then, overlayed with Fluent may take time for the developer to find out the cause of weird visual output of such DOM.

The other issue is that as we get more of React and Custom Elements, we start encountering cases where a custom element or react element is meant to not have content, but the HTML parser doesn't allow it to be self-closing, which causes the same papercut as above.

Unfortunately, switching away from HTML to XML/XHTML parsing comes at a cost. While XML/XHTML allows all tags to be self-closed, it forbids attributes without values, and attribute values without quotes making those two errors:

<p data-l10n-id="key1">
  <a data-l10n-opaque></a>
  <a data-l10n-pos=2></a>
</p>

Independently of if we'll want to support either of those features, we can argue that developers got used to being able to add attributes such as hidden or disabled on elements without their value, and placing simple attribute values such as class without quotes. Both of them result in hard parsing error without any form of recovery.

This issue is meant to help us make a decision on this.

gijsk commented 5 years ago

I'm fine with the current parsing.

I think switching to XHTML will bring back issues when localized content cannot be parsed (a la yellow screen of death). What would we do in that case?

Pike commented 5 years ago

If we'd follow some thoughts from projectfluent/fluent/issues/237, we might tokenize markup explicitly without relying on an XML or HTML parser, and we had full control over the behavior and error scenarios.

But yeah, we shouldn't be going back to our YSOD parsing errors. Even if they'd just kill a single string and force language fallback.

zbraniecki commented 5 years ago

I propose we resolve this issue as WONTFIX and stick to HTML parser.

zbraniecki / fluent-domoverlays-js

XML vs HTML parser (Item 10) #4