wilsonzlin / minify-html

Extremely fast and smart HTML + JS + CSS minifier, available for Rust, Deno, Java, Node.js, Python, Ruby, and WASM
MIT License
798 stars 36 forks source link

Escaped `<` characters (`&lt;`) are processed incorrectly #191

Open chrispy-snps opened 1 month ago

chrispy-snps commented 1 month ago

This is a more specific follow-up to #182.

When the &lt; escape sequence is processed, it is incorrectly converted to &LT instead of kept as-is:

>>> import minify_html
>>> print(minify_html_onepass.minify("&lt;"))
<

>>> print(minify_html_onepass.minify("&lt;faketag"))
&LTfaketag

>>> print(minify_html_onepass.minify("&lt;faketag&gt;"))
&LTfaketag>

Strangely, a bare &lt; by itself is processed correctly. It is only when followed by content that it breaks.

The issue occurs in both minify_html and minify_html_onepass.

We are able to work around it as follows:

html = html.replace("&lt;", "AMP_LT_WORKAROUND")
html_minified = minify_html.minify(html)
html = html.replace("AMP_LT_WORKAROUND", "&lt;")

but a proper fix would be better (and more efficient, as we process tens of thousands of HTML files at a time).

codingjerk commented 5 days ago

Hi @chrispy-snps, thank you for workaround