zzzprojects / html-agility-pack

Html Agility Pack (HAP) is a free and open-source HTML parser written in C# to read/write DOM and supports plain XPATH or XSLT. It is a .NET code library that allows you to parse "out of the web" HTML files.
https://html-agility-pack.net
MIT License
2.63k stars 375 forks source link

Parent OuterLength wrong when using <br/> #567

Closed meum closed 2 weeks ago

meum commented 1 month ago

The / inside <br/> does not get included in the parent OuterLength and OuterHtml, even though it is correct in the br node itself.

Example:

var htmlDoc2 = new HtmlDocument();
htmlDoc2.LoadHtml("<html><body><br/></body></html>");
Console.WriteLine(htmlDoc2.DocumentNode.OuterLength.ToString()); // One too low because it doesn't count the /
Console.WriteLine(htmlDoc2.DocumentNode.OuterHtml); // Missing the /
Console.WriteLine(htmlDoc2.DocumentNode.SelectSingleNode("//br").OuterLength.ToString()); // Correct
Console.WriteLine(htmlDoc2.DocumentNode.SelectSingleNode("//br").OuterHtml); // Correct

Expected output:

31
<html><body><br/></body></html>
5
<br/>

Actual output:

30
<html><body><br></body></html>
5
<br/>
JonathanMagnan commented 1 month ago

Hello @meum ,

Thank you for reporting.

Here is what we found out so far,

Some node like DocumentNode have their outerhtml re-written since the value _changed = true, so the UpdateHtml method is called.

When directly using to the node "br", the _changed = false which means it take the text directly from the one provided instead: https://github.com/zzzprojects/html-agility-pack/blob/master/src/HtmlAgilityPack.Shared/HtmlNode.cs#L681

We will dive more into this issue, but at least we now understand why we have a different behavior.

Best Regards,

Jon

JonathanMagnan commented 1 month ago

Hello @meum ,

A new option has been added starting from v1.11.65: OptionWriteEmptyNodesWithoutSpace

To write an "empty node" such as br with an ending tag, you need to use the option OptionWriteEmptyNodes = true; unfortunately, it also adds an additional space. So by also using the option OptionWriteEmptyNodesWithoutSpace = true, this additional space will be removed. That's currently not a perfect fix as keeping the original ending would have probably be better, but surely better then the current behavior:

var htmlDoc2 = new HtmlDocument();
htmlDoc2.OptionWriteEmptyNodes = true;
htmlDoc2.OptionWriteEmptyNodesWithoutSpace = true;

Best Regards,

Jon