zzzprojects / html-agility-pack

Html Agility Pack (HAP) is a free and open-source HTML parser written in C# to read/write DOM and supports plain XPATH or XSLT. It is a .NET code library that allows you to parse "out of the web" HTML files.
https://html-agility-pack.net
MIT License
2.65k stars 375 forks source link

Inconsistency between `node.OuterHtml`, `node.EndNode.OuterHtml`, and `node.Closed` #553

Open rvishruth opened 4 months ago

rvishruth commented 4 months ago

1. Description

Inconsistency between node.OuterHtml, node.EndNode.OuterHtml, and node.Closed

HtmlAgilityPack automatically closes certain nodes. However, when it does, the EndNode does not get updated accordingly.

Related follow-up questions:

  1. Is it possible to disable this auto-closing behavior?
  2. If not, how can one detect if any changes were made to the original HTML?
  3. Why are there no ParseErrors being logged?

2. Exception

N/A

3. Fiddle or Project

https://dotnetfiddle.net/wDNJfT

From output, p, <p>Hello, , <>, True - Endnode is <> and the <p> tag is not closed strong, <strong>world!</strong>, <>, True - EndNode is <> even though the OuterHTML has a closing tag

JonathanMagnan commented 4 months ago

Hello @rvishruth ,

1. Is it possible to disable this auto-closing behavior?

No, it's currently not possible. However, it look to be easily possible by adding 2 simple options:

The <> is usually when there is no EndNote. There is indeed no implicit end in the case of an implicit end.

We can surely add those options, but then you will have some side impact as some of your tag is not closed. So you will probably not get the desired behavior even with those options.

2. If not, how can one detect if any changes were made to the original HTML?

Besides comparing the HTML, I don't think there is anyway.

3. Why are there no ParseErrors being logged?

Not all parsing errors is logged. Implicit ending like <p>a<p>b is not a parsing error but a valid HTML syntax, so therefore, it's not logged.

Let me know if that answer correctly to your question.

Best Regards,

Jon

rvishruth commented 4 months ago

Hi @JonathanMagnan! Thank you for your response!

(2) and (3) makes sense to me!

Regarding (1), could you expand on what you mean by

The <> is usually when there is no EndNote. There is indeed no implicit end in the case of an implicit end.

Is this not a bug (because the Endnode is <> even though the there was a closing tag added)?

JonathanMagnan commented 4 months ago

Hello @rvishruth ,

Is this not a bug (because the Endnode is <> even though the there was a closing tag added)?

Indeed, it can be seen as a bug. I would have expected a null value for the EndNode as there is no EndNode. I would not have expected the OuterHtml to either be <> but empty instead.

But we didn't think about it years ago when we added the ImplicitEnd logic, and modifying this part of the code would cause some breaking change impact for other developers, so that's not something we are currently comfortable fixing.

Best Regards,

Jon

rvishruth commented 4 months ago

I understand, thanks @JonathanMagnan! Few more follow up questions:

  1. Indeed, it can be seen as a bug. I would have expected a null value for the EndNode as there is no EndNode. I would not have expected the OuterHtml to either be <> but empty instead.

This makes sense for the p tag case, where there was no closing tag added. But for the strong tag case, HAP adds a closing tag, so, I'm not sure if a future change will make it so that the EndNode will be </strong>. With this in mind, what would be the way to detect if the original HTML had any improperly closed tags present? Would it be to iterate through every node and check if the EndNode == "<>"?

  1. Why is node.Closed True for p, <p>Hello, , <>, True (see fiddle)?
JonathanMagnan commented 4 months ago

Hello @rvishruth ,

Just to let you know, I'm not the original creator of this library (we purchased it as it was no longer maintained), so what I'm saying is what I understand.

The strong tag is automatically closed because the p tag just before gets closed (so everything inside the p tag needs to be closed.

Unlike a p tag where the end tag is optional (See Tag omission section here, the strong requires an end tag (see Tag omission here.

So HAP automatically added the end tag when writing the HTML, but when he parsed the HTML, there was indeed no end tag, so no EndNode. Is it a bug? Is it a normal behavior? At this point, I have no idea. I can just explain what is happening.

I'm not really sure what the best way to achieve your actual behavior would be. HAP is not 100% HTML compatible so you will surely get some weird behavior especially with invalid HTML.

rvishruth commented 4 months ago

Thank you for the explanation @JonathanMagnan! I Given the current functionality, I think checking if EndNode is <> might be the best way to detect if there was a closing tag missing (apart from comparing with the original HTML).

I leave it up to you if you want to keep the issue open. I do think the current behavior is inconsistent, but its also unclear if its a bug.