zzzprojects / html-agility-pack

Html Agility Pack (HAP) is a free and open-source HTML parser written in C# to read/write DOM and supports plain XPATH or XSLT. It is a .NET code library that allows you to parse "out of the web" HTML files.
https://html-agility-pack.net
MIT License
2.63k stars 375 forks source link

The html rendering result is different from the html output result #529

Closed colindcli closed 7 months ago

colindcli commented 9 months ago

When the p span does not close the tag, it is followed by the div tag. The HAP output is: <p><span><span><div></div></span></span></p> The Chrome output is: <p><span><span></span></span></p><div></div>

chrome

var html = @"<p><span><span><div></div>";
var doc = new HtmlDocument();
doc.LoadHtml(html);
var newHtml = doc.DocumentNode.OuterHtml;  //<p><span><span><div></div></span></span></p>
var res = @"<p><span><span></span></span></p><div></div>"; //chrome or edge rendering results
var b = newHtml == res; //false
JonathanMagnan commented 9 months ago

Hello @colindcli ,

Thank you for reporting. We will look at it.

Best Regards,

Jon

JonathanMagnan commented 9 months ago

Hello @colindcli,

The v1.11.57 has been released

The "span" is not being closed before a "div" has been fixed." However, there is a slight difference with your result as the "p" currently has an implicit ending (before the "div") and not an explicit ending.

By default, the "p" tag behavior is really bad. I recommend you to use of of the 2 followings options:

HtmlDocument.DisableBehaviorTagP = true;

// or

doc.BackwardCompatibility = false;

Here is the current result with your code:

var html = @"<p><span><span><div></div>";
var doc = new HtmlDocument();
doc.LoadHtml(html);
//HtmlDocument.DisableBehaviorTagP = true;
doc.BackwardCompatibility = false;

var newHtml = doc.DocumentNode.OuterHtml;
var res = @"<p><span><span></span></span><div></div>";
var b = newHtml == res; // true

As mentioned, the "p" tag is not closed in this example as this is an implicit close, which is valid in HTML.

Adding an option to force an "Explicit" end could be possible when an "Implicit" end is found if you really require this.

Let me know if everything is working correctly for you now.

Best Regards,

Jon

colindcli commented 9 months ago

@JonathanMagnan There is no problem at all in the test, very good! Thank you so much!

JonathanMagnan commented 8 months ago

Hello @colindcli ,

Unfortunately, we had to undo our change.

As a few people reported, our code was breaking some valid rules.

I tried hard to make it work with the right behavior (finding the parent recursively), but unfortunately, it was leading to some other potential issue that I could already easily find.

Making the fix correctly for this specific case is easy, but making the fix correctly to make it work in general (with more complex HTML) is just too complicated as too many rules are missing to make it work correctly.

Unfortunately, with all the time and effort I put tried to make it happen, I failed, and the only option possible is to abandon it or wait for someone that make a pull request that could fix it without impacting anything.

So, the change has been reverted in v1.11.58

Best Regards,

Jon

colindcli commented 8 months ago

Hello @JonathanMagnan,

I'm very sorry because an update-wide error occurred due to a bug I mentioned. This is not what I want to see. If the impact is really large, version rollback is also acceptable.

Best Regards