zzzprojects / html-agility-pack

Html Agility Pack (HAP) is a free and open-source HTML parser written in C# to read/write DOM and supports plain XPATH or XSLT. It is a .NET code library that allows you to parse "out of the web" HTML files.
https://html-agility-pack.net
MIT License
2.65k stars 375 forks source link

HtmlDocument shows `<link>foo</link>` tag as just `<link>foo` #524

Closed Davidsv closed 11 months ago

Davidsv commented 11 months ago

Description

If I pass a raw string that contains <link>foo</link> to htmlDocument.LoadHtml(raw), then output htmlDocument.DocumentNode.OuterHtml, it will show up as <link>foo (without the closing tag).

And similarly, if I configure htmlDocument.OptionWriteEmptyNodes = true; , the output will be <link />foo, perhaps indicating that it think it's an empty node?

Note: my input is not strictly expected to be a web page, I know <link> might have special meaning. But I'd still like to be able to load it as a regular node.

Fiddle

https://dotnetfiddle.net/QASHg5

elgonzo commented 11 months ago

The <link> element in HTML does not support any content apart from attributes and therefore also does not feature an end tag. (specification). And HAP - being a HTML parser - tries to parse it as a regular HTML <link> element. So, that's why you get what you see...

elgonzo commented 11 months ago

Looking a bit around in HAP's source code, there seems to be a way to achieve what you want. The HtmlAgilityPack.HtmlNode class maintains a static dictionary HtmlNode.ElementsFlags that assigns certain element characteristics to certain element names. For the link element name, the dictionary characterizes it to be an empty element.

Since HtmlNode.ElementsFlags is publicly accessible, it is sufficient to remove the entry for link from this dictionary to get the desired result:

HtmlNode.ElementsFlags.Remove("link");

var html = @"<root><link>foo</link><url>bar</url></root>";
var doc = new HtmlDocument();
doc.LoadHtml(html);
...

Note that due to HtmlNode.ElementsFlags being a static field, modifying or replacing its assigned dictionary will affect all parsing done by HAP in your application.

(P.S.: I am just a user of HAP and not associated with the project nor its authors/maintainers.)

JonathanMagnan commented 11 months ago

Thank you @elgonzo for your help again. Your answer is 100% correct.

Let us know if you have additional question about this @Davidsv

Best Regards,

Jon

Davidsv commented 11 months ago

Perfect, this is good enough for me. Thank you both. Closing