When a self-closing tag is processed (such as <p/>), the output is an incorrectly unclosed tag (such as <p>). This causes significant structural issues when the content is read back in.
Self-closing <p/> elements pose a similar issue. While many browsers will force-close adjacent unclosed <p> elements due to their block-element-ness, many parsers (such as lxml) do not, and a similar cascade of misclosed <p> tags occurs there too.
We are able to work around it as follows:
import re
html = re.sub(
r"<([^\s>]+)([^>]*)/>",
r"<\1\2></\1>",
html,
flags=re.DOTALL,
)
but a proper fix would be better (and more efficient, as we process tens of thousands of HTML files at a time). Either self-closing tags should be self-closed by default (it's one more character), or they should be kept when keep_closing_tags==True (when working with a downstream parser that expects predominantly well-formed HTML).
This is a more specific follow-up to #181.
When a self-closing tag is processed (such as
<p/>
), the output is an incorrectly unclosed tag (such as<p>
). This causes significant structural issues when the content is read back in.For example, the following code:
results in the following HTML (added linefeeds are mine):
which is interpreted by a browser (Firefox) as follows:
Self-closing
<p/>
elements pose a similar issue. While many browsers will force-close adjacent unclosed<p>
elements due to their block-element-ness, many parsers (such aslxml
) do not, and a similar cascade of misclosed<p>
tags occurs there too.We are able to work around it as follows:
but a proper fix would be better (and more efficient, as we process tens of thousands of HTML files at a time). Either self-closing tags should be self-closed by default (it's one more character), or they should be kept when
keep_closing_tags==True
(when working with a downstream parser that expects predominantly well-formed HTML).