zzzprojects / html-agility-pack

Html Agility Pack (HAP) is a free and open-source HTML parser written in C# to read/write DOM and supports plain XPATH or XSLT. It is a .NET code library that allows you to parse "out of the web" HTML files.
https://html-agility-pack.net
MIT License
2.64k stars 375 forks source link

Closing tag is missing within <![CDATA #571

Open bytecode77 opened 1 week ago

bytecode77 commented 1 week ago

1. Description

Closing tag is missing within <![CDATA object.

Background: This issue surfaced when parsing HTML <code> blocks. The HTML content is given by Confluence, where the HTML is parsed from.

3. Fiddle or Project

https://dotnetfiddle.net/XmcOB6

Expected Result

<![CDATA[<a href="foo">
bar
</a>]]>

Actual Result

<![CDATA[<a href="foo">
bar
]]>

4. Any further technical details

JonathanMagnan commented 1 week ago

Hello @bytecode77 ,

Looking at Firefox and Chrome, I believe the current actual result from HAP is the right one:

FireFox:

image

The "comment" section end at the first occurrence of the > found. This means the end tag </a> is considered part of the HTML, but an end tag alone cannot really exist, so it gets removed.

I'm not exactly sure of all the rules that is currently applied, but the current actual result is indeed what I'm expecting.

Let me know if that explains the reason correctly.

Best Regards,

Jon

bytecode77 commented 1 week ago

Thanks for your quick response Jonathan!

For a comment, I think the end tag is not useful to the DOM. However, the code does not represent a comment:

<code><![CDATA[<a href=\"foo\">\r\nbar\r\n</a>]]></code>

Specifically, this code was retrieved by the Confluence API when reading Confluence pages. I'm not sure what the purpose of <![CDATA[ here is, though.

JonathanMagnan commented 1 week ago

However, the code does not represent a comment:

Indeed, you are right. I just assumed this, looking at the result, but that's not the case.

I never had to really use the <![CDATA[ as far as I remember, but looking at what I see on internet, it looks more related to be used within a script tag but not exclusively to this.

At this moment, it still looks like the current behavior looks more like the normal behavior unless I'm proving wrong. Again, I'm not familiar with this tag, so I could definitely be wrong.

Best Regards,

Jon

bytecode77 commented 1 week ago

This is the original HTML that is a Confluence page export that I'm parsing.

It does contain a <![CDATA[ within the text-body of a <ac:structured-macro ac:name="code" and it represents a Confluence code box.

Yes, it was used within <script> tags way before javascript was common to not offend non-supporting browsers. But it remains valid HTML that is, indeed, used:

<ac:structured-macro ac:name="code" ac:schema-version="1" ac:macro-id="70aacf91-111a-4b25-8c3c-543aa6fd0af9">
    <ac:plain-text-body>
        <![CDATA[<a href="foo" target="_blank">
  bar
</a>]]>
    </ac:plain-text-body>
JonathanMagnan commented 6 days ago

Hello @bytecode77 ,

Thank you for the additional info. I have looked at the HAP code, and the <!CDATA[ tag is not supported. The current behavior is more a combination of "Comment" and "Text" nodes that show the same result as Firefox.

I would not like to change the current default behavior, but I'm open to looking more at it to support it the way you want through an option that you will need to enable.

I should be able to look more at it later this week

Best Regards,

Jon

bytecode77 commented 6 days ago

Thanks for looking into it, Jon!

I'll stay tuned for your updates :)