zzzprojects / html-agility-pack

Html Agility Pack (HAP) is a free and open-source HTML parser written in C# to read/write DOM and supports plain XPATH or XSLT. It is a .NET code library that allows you to parse "out of the web" HTML files.
https://html-agility-pack.net
MIT License
2.66k stars 377 forks source link

Closing tag is missing within <![CDATA #571

Closed bytecode77 closed 1 month ago

bytecode77 commented 1 month ago

1. Description

Closing tag is missing within <![CDATA object.

Background: This issue surfaced when parsing HTML <code> blocks. The HTML content is given by Confluence, where the HTML is parsed from.

3. Fiddle or Project

https://dotnetfiddle.net/XmcOB6

Expected Result

<![CDATA[<a href="foo">
bar
</a>]]>

Actual Result

<![CDATA[<a href="foo">
bar
]]>

4. Any further technical details

JonathanMagnan commented 1 month ago

Hello @bytecode77 ,

Looking at Firefox and Chrome, I believe the current actual result from HAP is the right one:

FireFox:

image

The "comment" section end at the first occurrence of the > found. This means the end tag </a> is considered part of the HTML, but an end tag alone cannot really exist, so it gets removed.

I'm not exactly sure of all the rules that is currently applied, but the current actual result is indeed what I'm expecting.

Let me know if that explains the reason correctly.

Best Regards,

Jon

bytecode77 commented 1 month ago

Thanks for your quick response Jonathan!

For a comment, I think the end tag is not useful to the DOM. However, the code does not represent a comment:

<code><![CDATA[<a href=\"foo\">\r\nbar\r\n</a>]]></code>

Specifically, this code was retrieved by the Confluence API when reading Confluence pages. I'm not sure what the purpose of <![CDATA[ here is, though.

JonathanMagnan commented 1 month ago

However, the code does not represent a comment:

Indeed, you are right. I just assumed this, looking at the result, but that's not the case.

I never had to really use the <![CDATA[ as far as I remember, but looking at what I see on internet, it looks more related to be used within a script tag but not exclusively to this.

At this moment, it still looks like the current behavior looks more like the normal behavior unless I'm proving wrong. Again, I'm not familiar with this tag, so I could definitely be wrong.

Best Regards,

Jon

bytecode77 commented 1 month ago

This is the original HTML that is a Confluence page export that I'm parsing.

It does contain a <![CDATA[ within the text-body of a <ac:structured-macro ac:name="code" and it represents a Confluence code box.

Yes, it was used within <script> tags way before javascript was common to not offend non-supporting browsers. But it remains valid HTML that is, indeed, used:

<ac:structured-macro ac:name="code" ac:schema-version="1" ac:macro-id="70aacf91-111a-4b25-8c3c-543aa6fd0af9">
    <ac:plain-text-body>
        <![CDATA[<a href="foo" target="_blank">
  bar
</a>]]>
    </ac:plain-text-body>
JonathanMagnan commented 1 month ago

Hello @bytecode77 ,

Thank you for the additional info. I have looked at the HAP code, and the <!CDATA[ tag is not supported. The current behavior is more a combination of "Comment" and "Text" nodes that show the same result as Firefox.

I would not like to change the current default behavior, but I'm open to looking more at it to support it the way you want through an option that you will need to enable.

I should be able to look more at it later this week

Best Regards,

Jon

bytecode77 commented 1 month ago

Thanks for looking into it, Jon!

I'll stay tuned for your updates :)

JonathanMagnan commented 1 month ago

Hello @bytecode77 ,

A new version has been released today: https://github.com/zzzprojects/html-agility-pack/releases/tag/v1.11.68

Could you try the new option and let us know if everything is working as expected.

Best Regards,

Jon

bytecode77 commented 1 month ago

Hi Jonathan!

thank's for providing a new version! However, unfortunately the result is the same as before, see the dotnetfiddle that I posted initially.

To provide a more concrete example: When importing HTML from Confluence, this is what the HTML string looks like:

...
<ac:plain-text-body>
<![CDATA[<iframe src="xxxxxx" width="600" height="1000" frameborder="0">
  <a href="xxxxx" target="_blank">
    xxxxx
  </a>

</iframe>]]>
</ac:plain-text-body>
...

It's a code box showing some HTML.

However, the InnerHtml property is missing the closing tag:

<![CDATA[<iframe src="https://www.terminland.de/hno/?mode=frame" width="600" height="1000" frameborder="0">
  <a href="https://www.terminland.de/hno" target="_blank">
    Online-Terminbuchung
  </a>

]]>
JonathanMagnan commented 1 month ago

Hello @bytecode77 ,

Fiddle currently still uses the previous version of HAP and not the latest one (don't ask me why!).

You need to turn on the following options to make it works:

HtmlDocument document = new HtmlDocument();
document.OptionThreatCDataBlockAsComment = true;

I tested both your examples (the first one and the one you just posted), and both seem to work very fine with this option.

Best Regards,

Jon

bytecode77 commented 1 month ago

Ah, I see. However, unfortunately the option OptionThreatCDataBlockAsComment is not included in your latest build. I confirmed this by installing 1.11.68 in a completely blank project on another computer to make sure that this isn't a nuget cache issue of some sort.

JonathanMagnan commented 1 month ago

Oh god my bad, It looks like I did a bad release!

The v1.11.69 is now available. I double-checked to ensure I had not made the same mistake twice 😆

Best Regards,

Jon

JonathanMagnan commented 1 month ago

Let me know if this version was working

bytecode77 commented 1 month ago

Works like a charm! Thanks for your support, Jonathan, and keep up the nice work. I really love HtmlAgilityPack and I've been using it for many years.

Bye!

wo80 commented 1 month ago

@JonathanMagnan Regarding the spelling of OptionThreatCDataBlockAsComment: you probably meant treat and not threat ;-)

bytecode77 commented 1 month ago

You're right, wo80, didn't notice :D

@JonathanMagnan I think if you change the typo and break upward compatibility, there is only one person (me) who would be affected, but that's no problem. When I update some time in the future I will just change the wording on the calling site, too.

JonathanMagnan commented 1 month ago

Do'h! You are 100% right @wo80

~It will be changed the next time we release HAP, which will probably happen next month.~

JonathanMagnan commented 1 month ago

Hello,

The v1.11.70 is now available with the option renamed as proposed by @wo80 ;)

Best Regards,

Jon