wilsonzlin / minify-html

Extremely fast and smart HTML + JS + CSS minifier, available for Rust, Deno, Java, Node.js, Python, Ruby, and WASM
MIT License
848 stars 36 forks source link

whitespaces between tags in `<div>` should be collapsed instead of being cleaned #76

Open Rongronggg9 opened 2 years ago

Rongronggg9 commented 2 years ago

whitespaces between tags in <div> should be collapsed instead of being cleaned

input:

<div>
    <span>blablabla</span>
    <div>
        <span>blablabla</span>
        <b>Bold</b> <i>Italic</i> <u>Underlined</u>
        <a href="https://example.com">URL1</a>
        <a href="https://example.com">URL2</a>
        <a href="https://example.com">URL3</a>
        <a href="https://example.com">URL4</a>
    </div>
</div>

Firefox:

Oops, minify_html messed it up:

<div><span>blablabla</span><div><span>blablabla</span><b>Bold</b><i>Italic</i><u>Underlined</u><a href=https://example.com>URL1</a><a href=https://example.com>URL2</a><a href=https://example.com>URL3</a><a href=https://example.com>URL4</a></div></div>

So the intended output should be:

<div><span>blablabla</span> <div><span>blablabla</span> <b>Bold</b> <i>Italic</i> <u>Underlined</u> <a href=https://example.com>URL1</a> <a href=https://example.com>URL2</a> <a href=https://example.com>URL3</a> <a href=https://example.com>URL4</a></div></div>

Note that the space between the first <span> and the inner <div> is needed, because a <div> can be floated or inline if the CSS set it to do so. That's also true for <p> (Unfortunately if I replace the outer <div> with <p>, this space will be still absent, so it is another bug).

FYI: Python 3.9.10 (main, Feb 22 2022, 13:54:07) [GCC 11.2.0] on linux minify-html 0.8.0

wilsonzlin commented 2 years ago

As per the README, minify-html considers <div> a layout element and can only have layout or content elements (which elements belong in which categories are mentioned in the README). In this case, you have used formatting elements <span>, <b>, <i>, <u>, and <a> inside <div>, which breaks this assumption. I would suggest replacing all <div> usages with <p>, as that should be more semantically correct in this case.

It's true that using CSS can change a <div> to act like a content element and/or make it inline. However, this is true of any element. Therefore minify-html has to make assumptions (which are based on semantic best practices); otherwise it's impossible for minify-html to do anything. After all, any element could be like <pre> with the CSS: * { white-space: pre; }, in which case removing any whitespace is wrong.

Rongronggg9 commented 2 years ago

I would suggest replacing all <div> usages with <p>, as that should be more semantically correct in this case.

No, we can't do that, it's invalid to have nested <p>s, but nested <div>s are valid though. And what's more, that HTML is not my work, I just use this lib to deal with tons of HTML from RSS feeds. (Maybe it is not an intended usage? haha) Many of them have <div>s like this. Having a huge <div> block consisting of formatting elements unwrapped in layout or content elements is not uncommon as far as I met, though it is a bit strange as you have mentioned.

minify-html considers <div> a layout element and can only have layout or content elements

However, as MDN documented, the permitted content of <div> is Flow content. So having formatting elements unwrapped in layout or content elements in <div> is totally valid and should be expected to a certain extent... I can understand your point, restricting expected content to be layout or content elements creates much ease and optimized output, and most websites do obey this rule. If changing this rule does introduce much complexity, what about having an option to deal <div>s just like <p>s mostly (since you suggest me replacing all <div> usages with <p>)? Anyway, that's OK to make no change and I agree that in most cases and for most users this is not really a problem.

Thanks for your answer and your fancy work. This lib is the best one in similar libs I've met.

wilsonzlin commented 2 years ago

Good point about the nested <p>, I forgot about that rule, which is strictly enforced by browsers. Normally I'd suggest changing the structure but it sounds like you are using minify-html for other content (i.e. out of your control). I'm open to the possibility of having a special configuration that changes the meaning of <div> so that it could contain formatting elements, as it is a very widely used "generic" element, and many will probably use it in the case you have mentioned.

Brixy commented 1 year ago

Thanks a lot for your minifier!

I came here for a similar situation.

Your differentiation between content, formatting and layout tags does make a lot of sense and promotes the use of sematic HTML.

Would’t it make sense to have an option like

do_not_collapse_whitespace_for_consecutive_formatting_tags: true

This way

<div> <span>test</span> </div>

would be normally collapsed, but

<div> <span>test 1<span> <span>test 2</span> </div>

would be minified to:

<div><span>test 1<span> <span>test 2</span></div>

, and @Rongronggg9’s code to

<div><span>blablabla</span><div><span>blablabla</span> <b>Bold</b> <i>Italic</i> <u>Underlined</u> <a href=https://example.com>URL1</a> <a href=https://example.com>URL2</a> <a href=https://example.com>URL3</a> <a href=https://example.com>URL4</a></div></div>
chrispy-snps commented 4 months ago

@wilsonzlin - you already quite nicely categorize tags into various categories. Maybe providing a generalized way to change a tag's default category could work?

For my purposes, I do want the default behavior of <div> collapsing all possible whitespace. But for the cases above, perhaps reclassifying <div> into another category would resolve the issue?